Fix send()/recv() to adhere to timeout #109611

H-Huang · 2023-09-19T15:50:55Z

Summary: Point to point ops don't enqueue their work to the workMetaList_ which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts.

Test Plan:
While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / MultiprocessTestCase

I manually tested this change with the script in #109401, but need to look more closely at how to automate a test for NCCL watchdog

Differential Revision: D49418976

pytorch-bot · 2023-09-19T15:50:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109611

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 601093b with merge base d04b35e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2023-09-19T15:51:06Z

This pull request was exported from Phabricator. Differential Revision: D49418976

wconstab · 2023-09-23T00:11:15Z

tried using this on my PP hang. it seems like it partially works (watchdog is picking up on a timeout for a RECV op. However it is giving me this

[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=0, OpType=RECV, NumelIn=80000, NumelOut=80000, Timeout(ms)=2000$
) ran for 20175 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=0, OpType=RECV, NumelIn=80000, NumelOut=80000, Timeout(ms)=20000
) ran for 20189 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:988] Failed to retrieve NCCL_DESYNC_DEBUG report.  Please 
file an issue. Error: traceMap[thisSeq][myRank].second == kEventStart INTERNAL ASS
ERT FAILED at "/data/users/whc/pytorch/torch/csrc/distributed/c10d/TraceUtils.h":2
44, please report a bug to PyTorch. Timeout rank [0] last trace item must be kEven
tStart. thisSeq = 0, col = RECV

it's suspicious that it says the seq number is 0, as i've already executed a bunch of steps of forward/backward and each step should be waiting on completion of a previous send/recv.

Summary: Point to point ops don't enqueue their work to the `workMetaList_` which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts. Test Plan: While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / `MultiprocessTestCase` I manually tested this change with the script in pytorch#109401, but need to look more closely at how to automate a test for NCCL watchdog Differential Revision: D49418976

facebook-github-bot · 2023-09-29T18:28:40Z

This pull request was exported from Phabricator. Differential Revision: D49418976

H-Huang · 2023-09-29T18:29:20Z

Thanks @wconstab, it does look like sequence # also needs to be updated. That might have also caused the issue with desync debug since it reads the seq #, i will double check it

facebook-github-bot · 2023-09-29T18:29:38Z

This pull request was exported from Phabricator. Differential Revision: D49418976

wconstab · 2023-10-03T16:40:04Z

ok this works for me now-- at least in a trivial case where i intentionally cause a hang, i get a reasonable output from the desync report:

[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=RECV, NumelIn=6, NumelOut=6, Timeout(ms)=20000) ran for 2075
7 milliseconds before timing out.                                                                                                                                         
[E ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=RECV, NumelIn=6, NumelOut=6, Timeout(ms)=20000) ran for 2070
2 milliseconds before timing out.
Done                                                                                 
[E ProcessGroupNCCL.cpp:986]     
         - [1] Timeout at collective: RECV, #2                                                                                                                            
         - To our best knowledge, the lagging/dead/mismatched ranks that caused the desync are:
           - [1] joined but didn't finish collective #2 (count from 1)
         - Snapshot of ranks' latest states:
           #2 started ranks:
             [1] started RECV
           #3 started ranks:
             [0] started RECV
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=REC
V, NumelIn=6, NumelOut=6, Timeout(ms)=20000) ran for 20757 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=RECV, NumelIn=6, Nume
lOut=6, Timeout(ms)=20000) ran for 20757 milliseconds before timing out.
devgpu012:611476:616545 [0] NCCL INFO [Service thread] Connection closed by localRank 0
Done
devgpu012:611476:613548 [0] NCCL INFO comm 0x7c78920 rank 0 nranks 2 cudaDev 0 busId 11000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:986] 
         - [0] Timeout at collective: RECV, #3
         - To our best knowledge, the lagging/dead/mismatched ranks that caused the desync are:
           - [1] joined but didn't finish collective #2 (count from 1)
         - Snapshot of ranks' latest states:
           #2 started ranks:
             [1] started RECV
           #3 started ranks:
             [0] started RECV
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupt
ed/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=REC
V, NumelIn=6, NumelOut=6, Timeout(ms)=20000) ran for 20702 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=RECV, NumelIn=6, Nume
lOut=6, Timeout(ms)=20000) ran for 20702 milliseconds before timing out.
Traceback (most recent call last):
  File "/data/users/whc/pytorch/hang.py", line 44, in <module>
    mp.spawn(main, args=(world_size,), nprocs=world_size, join=True)

wconstab

thanks for fixing @H-Huang

H-Huang · 2023-10-03T18:31:57Z

@pytorchbot merge

pytorchmergebot · 2023-10-03T18:33:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Point to point ops don't enqueue their work to the `workMetaList_` which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts. Test Plan: While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / `MultiprocessTestCase` I manually tested this change with the script in pytorch#109401, but need to look more closely at how to automate a test for NCCL watchdog Differential Revision: D49418976 Pull Request resolved: pytorch#109611 Approved by: https://github.com/wconstab

H-Huang requested review from awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners September 19, 2023 15:50

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Sep 19, 2023

facebook-github-bot added the fb-exported label Sep 19, 2023

H-Huang force-pushed the export-D49418976 branch from 89bb17b to 9a77048 Compare September 29, 2023 18:28

H-Huang force-pushed the export-D49418976 branch from 9a77048 to 601093b Compare September 29, 2023 18:29

wconstab approved these changes Oct 3, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 3, 2023

pytorchmergebot added the merging label Oct 3, 2023

pytorchmergebot added the Merged label Oct 3, 2023

pytorchmergebot removed the merging label Oct 3, 2023

pytorchmergebot closed this in efb73fe Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix send()/recv() to adhere to timeout #109611

Fix send()/recv() to adhere to timeout #109611

Uh oh!

H-Huang commented Sep 19, 2023

Uh oh!

pytorch-bot bot commented Sep 19, 2023 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 19, 2023

Uh oh!

wconstab commented Sep 23, 2023 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 29, 2023

Uh oh!

H-Huang commented Sep 29, 2023

Uh oh!

facebook-github-bot commented Sep 29, 2023

Uh oh!

wconstab commented Oct 3, 2023

Uh oh!

wconstab left a comment

Uh oh!

H-Huang commented Oct 3, 2023

Uh oh!

pytorchmergebot commented Oct 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix send()/recv() to adhere to timeout #109611

Fix send()/recv() to adhere to timeout #109611

Uh oh!

Conversation

H-Huang commented Sep 19, 2023

Uh oh!

pytorch-bot bot commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109611

✅ No Failures

Uh oh!

facebook-github-bot commented Sep 19, 2023

Uh oh!

wconstab commented Sep 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 29, 2023

Uh oh!

H-Huang commented Sep 29, 2023

Uh oh!

facebook-github-bot commented Sep 29, 2023

Uh oh!

wconstab commented Oct 3, 2023

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Oct 3, 2023

Uh oh!

pytorchmergebot commented Oct 3, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Sep 19, 2023 •

edited

Loading

wconstab commented Sep 23, 2023 •

edited

Loading