[comm] Ensure ncclComm is not aborted before checking exception #124466

xw285cornell · 2024-04-19T09:01:48Z

Differential Revision: D56347560

More details in this pytorch issue: #124468

It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:

for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()

What can happen is this:

dist.destroy_process_group() calls into shutdown() and then calls into abort:

pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

Line 1095 in b2f6cfd

std::launch::async, [this, &reason]() { return this->abort(reason); });
It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError;

pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp

Line 388 in b2f6cfd

ncclAsyncErr_ = ncclSystemError;

.
ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.

So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted

Some more longer term discussion in the issue.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

pytorch-bot · 2024-04-19T09:01:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124466

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 1eda3e9 with merge base dba689b ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu) (gh) (similar failure)
test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_complex128

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for torch/distributed/_composable/fsdp/_fsdp_param_group.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-04-19T09:01:59Z

This pull request was exported from Phabricator. Differential Revision: D56347560

…rch#124466) Summary: More details in this pytorch issue: pytorch#124468 It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` What can happen is this: 1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095 2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread 4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join 5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error. So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted Some more longer term discussion in the issue. Test Plan: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` no longer errors out Differential Revision: D56347560

facebook-github-bot · 2024-04-20T03:13:37Z

This pull request was exported from Phabricator. Differential Revision: D56347560

kwen2501 · 2024-04-20T03:42:42Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  if (!ncclComm_->isAborted()) {
+    checkAndSetException();
+  }


I guess one question I have is what would be the behavior in the isAborted() == true path.

yea, do we need to early-exit the watchdog loop in that case to avoid touching any other apis? or is it fine to just keep the loop running

It seems to be ok to continue as is, although it's definitely not very safe. We'll continue to run things like "finishedGPUExecutionInternal" which queries the cuda event that is outside nccl so should still be safe.

yeah, isCompleted() will return true. I guess that might be okay even though isCompleted() is pybinded to a user-facing API: work.is_completed(). As like, the collective is indeed completed, just unsuccessfully.

kwen2501

Approving per discussion.

facebook-github-bot · 2024-05-02T18:27:28Z

This pull request was exported from Phabricator. Differential Revision: D56347560

) Summary: Pull Request resolved: #124466 More details in this pytorch issue: #124468 It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` What can happen is this: 1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095 2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread 4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join 5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error. So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted Some more longer term discussion in the issue. Test Plan: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` no longer errors out Reviewed By: kwen2501, yoyoyocmu Differential Revision: D56347560

…rch#124466) Summary: More details in this pytorch issue: pytorch#124468 It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` What can happen is this: 1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095 2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread 4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join 5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error. So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted Some more longer term discussion in the issue. Test Plan: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` no longer errors out Reviewed By: kwen2501, yoyoyocmu Differential Revision: D56347560

facebook-github-bot · 2024-05-05T09:00:51Z

This pull request was exported from Phabricator. Differential Revision: D56347560

facebook-github-bot · 2024-05-05T09:00:56Z

This pull request was exported from Phabricator. Differential Revision: D56347560

facebook-github-bot · 2024-05-05T18:53:58Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-05-05T18:55:38Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 19, 2024

facebook-github-bot added the fb-exported label Apr 19, 2024

xw285cornell mentioned this pull request Apr 19, 2024

Race in ProcessGroupNCCL shutdown #124468

Open

shuqiangzhang approved these changes Apr 19, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 19, 2024

xw285cornell force-pushed the export-D56347560 branch from ca2c79b to 1616e27 Compare April 20, 2024 03:13

kwen2501 reviewed Apr 20, 2024

View reviewed changes

yoyoyocmu approved these changes Apr 22, 2024

View reviewed changes

kwen2501 approved these changes Apr 22, 2024

View reviewed changes

xw285cornell force-pushed the export-D56347560 branch from 1616e27 to 734c630 Compare May 2, 2024 18:27

xw285cornell force-pushed the export-D56347560 branch 2 times, most recently from 6bc96d9 to 1eda3e9 Compare May 5, 2024 09:00

pytorchmergebot added the merging label May 5, 2024

pytorchmergebot closed this in 7c59720 May 5, 2024

pytorchmergebot added Merged and removed merging labels May 5, 2024

[comm] Ensure ncclComm is not aborted before checking exception #124466

[comm] Ensure ncclComm is not aborted before checking exception #124466

Uh oh!

Conversation

xw285cornell commented Apr 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124466

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

facebook-github-bot commented Apr 19, 2024

Uh oh!

facebook-github-bot commented Apr 20, 2024

Uh oh!

kwen2501 Apr 20, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Apr 20, 2024

Choose a reason for hiding this comment

Uh oh!

xw285cornell Apr 20, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Apr 22, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 2, 2024

Uh oh!

facebook-github-bot commented May 5, 2024

Uh oh!

facebook-github-bot commented May 5, 2024

Uh oh!

facebook-github-bot commented May 5, 2024

Uh oh!

pytorchmergebot commented May 5, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xw285cornell commented Apr 19, 2024 •

edited

Loading

pytorch-bot bot commented Apr 19, 2024 •

edited

Loading