New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NCCL] Add Error log when ProcessGroupNCCL takes down process upon timeout/error #44988
Conversation
timeout/error The new NCCL async error handling feature throws an exception from the workCleanup Thread if one of the NCCL operations encounters an error or times out. This PR adds an error log to make it more clear to the user why the training process crashed. Differential Revision: [D23794801](https://our.internmc.facebook.com/intern/diff/D23794801/) [ghstack-poisoned]
timeout/error The new NCCL async error handling feature throws an exception from the workCleanup Thread if one of the NCCL operations encounters an error or times out. This PR adds an error log to make it more clear to the user why the training process crashed. Differential Revision: [D23794801](https://our.internmc.facebook.com/intern/diff/D23794801/) ghstack-source-id: 112419640 Pull Request resolved: #44988
💊 CI failures summary and remediationsAs of commit 4217efa (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 7 times. |
auto exceptionMsg = c10::str( | ||
"Some NCCL operations have failed or timed out. Due to the ", | ||
"asynchronous nature of CUDA kernels, subsequent GPU operations ", | ||
"might run on corrupted/incomplete data. To avoid this inconsistency, ", | ||
"we are taking the entire process down."); | ||
LOG(ERROR) << exceptionMsg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test for this case using multiprocessing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! Requesting changes since I think we need to test this works properly.
…ess upon timeout/error" timeout/error** timeout/error The new NCCL async error handling feature throws an exception from the workCleanup Thread if one of the NCCL operations encounters an error or times out. This PR adds an error log to make it more clear to the user why the training process crashed. Differential Revision: [D23794801](https://our.internmc.facebook.com/intern/diff/D23794801/) [ghstack-poisoned]
timeout/error Pull Request resolved: #44988 The new NCCL async error handling feature throws an exception from the workCleanup Thread if one of the NCCL operations encounters an error or times out. This PR adds an error log to make it more clear to the user why the training process crashed. ghstack-source-id: 113876146 Differential Revision: [D23794801](https://our.internmc.facebook.com/intern/diff/D23794801/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamping since we've added the test in #46044.
…ess upon timeout/error" timeout/error** timeout/error** timeout/error The new NCCL async error handling feature throws an exception from the workCleanup Thread if one of the NCCL operations encounters an error or times out. This PR adds an error log to make it more clear to the user why the training process crashed. Differential Revision: [D23794801](https://our.internmc.facebook.com/intern/diff/D23794801/) [ghstack-poisoned]
timeout/error Pull Request resolved: #44988 The new NCCL async error handling feature throws an exception from the workCleanup Thread if one of the NCCL operations encounters an error or times out. This PR adds an error log to make it more clear to the user why the training process crashed. ghstack-source-id: 114002493 Differential Revision: [D23794801](https://our.internmc.facebook.com/intern/diff/D23794801/)
This pull request has been merged in 172036a. |
Stack from ghstack:
timeout/error
timeout/error**
timeout/error**
timeout/error
The new NCCL async error handling feature throws an exception from the
workCleanup Thread if one of the NCCL operations encounters an error or times
out. This PR adds an error log to make it more clear to the user why the
training process crashed.
Differential Revision: D23794801