-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Train Benchmark: Enable multiprocess forkserver (CUDA compatibility) #52598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
| self._metrics["validation/step"].get() | ||
| + self._metrics["validation/iter_first_batch"].get() | ||
| # Exclude the time it takes to get the first batch. | ||
| # + self._metrics["validation/iter_first_batch"].get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also export time-to-first-batch as a benchmark result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information is already there. We can dashboard this if it's useful by adding to dataset.
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
|
Verified deadlock without the fix:
After fix: |
|
Here is the release test run. |
|
Release test runs https://buildkite.com/ray-project/release/builds/40180 |
|
https://buildkite.com/ray-project/release/builds/40180 Ray Data: |
|
https://buildkite.com/ray-project/release/builds/40258#_ |
release/train_tests/benchmark/image_classification/localfs_image_classification_jpeg/factory.py
Show resolved
Hide resolved
…52598) --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: jhsu <jhsu@anyscale.com>
Why are these changes needed?
Train release test: Enable multiprocess forkserver (CUDA compatibility)
Related issue number
#52641
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.