-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Pipelining] Fix _batch_p2p bug for non-NCCL backends (#132644) #152938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152938
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit a8a1884 with merge base f2ea636 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "module: pipelining" |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot label "module: pipelining" |
@pytorchbot label "topic: not user facing" |
@pytorchbot rebase main |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot rebase -b main |
You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM.
Maybe @H-Huang want to have a second look?
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Rebase failed due to Command
This is likely because the author did not allow edits from maintainers on the PR or because the repo has additional permissions settings that mergebot does not qualify. |
@pytorchbot rebase |
You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra. |
Seems to be a problem with cross-org "allow edits from maintainers"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, I suspect there are still some issues with hangs on non-NCCL backends because of the dependencies between multiple ops across ranks vs just 1 op for nccl. Probably would see this issue arise in 1f1b or interleaved schedules.
Will you be testing other schedules on gloo? Also feel free to rebase on your fork and update the PR so we can get the testing signal
I'll have a look at this! |
@pytorchbot drci |
1 similar comment
@pytorchbot drci |
Fixes pytorch#132644 `_batch_p2p` incorrectly assumes that `dist.batch_isend_irecv` returns a single-element list of `dist.Work`, likely due to NCCL's coalescing behaviour. For none NCCL backends like Gloo, multiple `dist.Work` objects are returned, causing the code to discard some operations via `.pop()`. This leads to deadlocks during pipeline parallelism. * Modified `_batch_p2p` to return `list[dist.Work]` instead of popping a single element. * Added `_wait_batch_p2p` to call `wait()` on multiple `dist.Work` objects, consuming the result of `_batch_p2p`. * Updated references from `dist.Work` to `list[dist.Work]`. * `pippy_bert.py` from pytorch#132644 now works with gloo.
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #132644
_batch_p2p
incorrectly assumes thatdist.batch_isend_irecv
returns a single-element list ofdist.Work
, likely due to NCCL's coalescing behaviour.For none NCCL backends like Gloo, multiple
dist.Work
objects are returned, causing the code to discard some operations via.pop()
. This leads to deadlocks during pipeline parallelism.Changes:
_batch_p2p
to returnlist[dist.Work]
instead of popping a single element._wait_batch_p2p
to callwait()
on multipledist.Work
objects, consuming the result of_batch_p2p
.dist.Work
tolist[dist.Work]
.Testing:
pippy_bert.py
fromtorch.distributed.pipelining
hang and timeout in CPU gloo backend #132644 now works with gloo.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k