New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Blocking Wait if both Blocking Wait and Async Error Handling Are Set #47926
Conversation
Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit. Differential Revision: [D24928149](https://our.internmc.facebook.com/intern/diff/D24928149/) [ghstack-poisoned]
Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit. Differential Revision: [D24928149](https://our.internmc.facebook.com/intern/diff/D24928149/) ghstack-source-id: 116553583 Pull Request resolved: #47926
💊 CI failures summary and remediationsAs of commit 675f47a (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 1 failure confirmed as flaky and can be ignored:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 2 times. |
Codecov Report
@@ Coverage Diff @@
## gh/osalpekar/105/base #47926 +/- ##
=========================================================
- Coverage 81.25% 81.25% -0.01%
=========================================================
Files 1838 1838
Lines 198256 198256
=========================================================
- Hits 161098 161097 -1
- Misses 37158 37159 +1 |
This pull request has been merged in 5d51b63. |
…Set (pytorch#47926) Summary: Pull Request resolved: pytorch#47926 Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit. ghstack-source-id: 116553583 Test Plan: Simple FBL run/CI Reviewed By: jiayisuse Differential Revision: D24928149 fbshipit-source-id: d42c038ad44607feb3d46dd65925237c564ff7a3
if (blockingWait_ && asyncErrorHandling_) { | ||
LOG(INFO) << "[Rank " << rank_ | ||
<< "] NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING " | ||
<< "should not both be enabled. " | ||
<< "Only NCCL_BLOCKING_WAIT is being used in this process."; | ||
asyncErrorHandling_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@osalpekar Sorry a bit late on this, but I think we should instead throw an error if both NCCL_BLOCKING_WAIT
and NCCL_ASYNC_ERROR_HANDLING
is set and ask user to set only one. It is usually confusing for users if we have such behavior where we unset one option. User's might not look closely at the logs and would feel that there might be a bug.
Stack from ghstack:
Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit.
Fixes: #47943
Differential Revision: D24928149