Skip to content

Conversation

xw285cornell
Copy link
Contributor

@xw285cornell xw285cornell commented Apr 19, 2024

Differential Revision: D56347560

More details in this pytorch issue: #124468

It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:

for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()

What can happen is this:

  1. dist.destroy_process_group() calls into shutdown() and then calls into abort:
    std::launch::async, [this, &reason]() { return this->abort(reason); });
  2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError;
    ncclAsyncErr_ = ncclSystemError;
    .
  3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
  4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
  5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.

So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted

Some more longer term discussion in the issue.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented Apr 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124466

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 1eda3e9 with merge base dba689b (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 19, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56347560

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 19, 2024
xw285cornell added a commit to xw285cornell/pytorch that referenced this pull request Apr 20, 2024
…rch#124466)

Summary:

More details in this pytorch issue: pytorch#124468

It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```

What can happen is this:

1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095
2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 
3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.

So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted

Some more longer term discussion in the issue.

Test Plan:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```
no longer errors out

Differential Revision: D56347560
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56347560

Comment on lines +475 to +480
if (!ncclComm_->isAborted()) {
checkAndSetException();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one question I have is what would be the behavior in the isAborted() == true path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, do we need to early-exit the watchdog loop in that case to avoid touching any other apis? or is it fine to just keep the loop running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be ok to continue as is, although it's definitely not very safe. We'll continue to run things like "finishedGPUExecutionInternal" which queries the cuda event that is outside nccl so should still be safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, isCompleted() will return true. I guess that might be okay even though isCompleted() is pybinded to a user-facing API: work.is_completed(). As like, the collective is indeed completed, just unsuccessfully.

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving per discussion.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56347560

pytorch-bot bot pushed a commit that referenced this pull request May 2, 2024
)

Summary:
Pull Request resolved: #124466

More details in this pytorch issue: #124468

It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```

What can happen is this:

1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095
2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388.
3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.

So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted

Some more longer term discussion in the issue.

Test Plan:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```
no longer errors out

Reviewed By: kwen2501, yoyoyocmu

Differential Revision: D56347560
…rch#124466)

Summary:

More details in this pytorch issue: pytorch#124468

It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```

What can happen is this:

1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095
2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 
3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.

So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted

Some more longer term discussion in the issue.

Test Plan:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```
no longer errors out

Reviewed By: kwen2501, yoyoyocmu

Differential Revision: D56347560
@xw285cornell xw285cornell force-pushed the export-D56347560 branch 2 times, most recently from 6bc96d9 to 1eda3e9 Compare May 5, 2024 09:00
xw285cornell added a commit to xw285cornell/pytorch that referenced this pull request May 5, 2024
…rch#124466)

Summary:

More details in this pytorch issue: pytorch#124468

It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```

What can happen is this:

1. dist.destroy_process_group() calls into shutdown() and then calls into abort: https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1095
2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; https://github.com/pytorch/pytorch/blob/b2f6cfd9c061a212cde8c8768fda41cc75a3110c/torch/csrc/distributed/c10d/NCCLUtils.hpp#L388. 
3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.

So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted

Some more longer term discussion in the issue.

Test Plan:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```
no longer errors out

Reviewed By: kwen2501, yoyoyocmu

Differential Revision: D56347560
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56347560

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56347560

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants