-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[comm] Ensure ncclComm is not aborted before checking exception #124466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124466
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 1eda3e9 with merge base dba689b ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D56347560 |
ca2c79b
to
1616e27
Compare
…rch#124466) Summary: More details in this pytorch issue: pytorch#124468 It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` What can happen is this: 1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095 2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread 4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join 5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error. So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted Some more longer term discussion in the issue. Test Plan: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` no longer errors out Differential Revision: D56347560
This pull request was exported from Phabricator. Differential Revision: D56347560 |
if (!ncclComm_->isAborted()) { | ||
checkAndSetException(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one question I have is what would be the behavior in the isAborted() == true
path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, do we need to early-exit the watchdog loop in that case to avoid touching any other apis? or is it fine to just keep the loop running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be ok to continue as is, although it's definitely not very safe. We'll continue to run things like "finishedGPUExecutionInternal" which queries the cuda event that is outside nccl so should still be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, isCompleted()
will return true. I guess that might be okay even though isCompleted()
is pybinded to a user-facing API: work.is_completed()
. As like, the collective is indeed completed, just unsuccessfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving per discussion.
This pull request was exported from Phabricator. Differential Revision: D56347560 |
1616e27
to
734c630
Compare
) Summary: Pull Request resolved: #124466 More details in this pytorch issue: #124468 It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` What can happen is this: 1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095 2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread 4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join 5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error. So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted Some more longer term discussion in the issue. Test Plan: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` no longer errors out Reviewed By: kwen2501, yoyoyocmu Differential Revision: D56347560
…rch#124466) Summary: More details in this pytorch issue: pytorch#124468 It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` What can happen is this: 1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095 2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread 4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join 5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error. So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted Some more longer term discussion in the issue. Test Plan: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` no longer errors out Reviewed By: kwen2501, yoyoyocmu Differential Revision: D56347560
6bc96d9
to
1eda3e9
Compare
…rch#124466) Summary: More details in this pytorch issue: pytorch#124468 It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` What can happen is this: 1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095 2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread 4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join 5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error. So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted Some more longer term discussion in the issue. Test Plan: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` no longer errors out Reviewed By: kwen2501, yoyoyocmu Differential Revision: D56347560
This pull request was exported from Phabricator. Differential Revision: D56347560 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D56347560 |
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Differential Revision: D56347560
More details in this pytorch issue: #124468
It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:
What can happen is this:
pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Line 1095 in b2f6cfd
pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp
Line 388 in b2f6cfd
So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted
Some more longer term discussion in the issue.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k