[Pipelining] Fix _batch_p2p bug for non-NCCL backends (#132644) #152938

tom-pollak · 2025-05-06T10:19:19Z

_batch_p2p incorrectly assumes that dist.batch_isend_irecv returns a single-element list of dist.Work, likely due to NCCL's coalescing behaviour.

For none NCCL backends like Gloo, multiple dist.Work objects are returned, causing the code to discard some operations via .pop(). This leads to deadlocks during pipeline parallelism.

Changes:

Modified _batch_p2p to return list[dist.Work] instead of popping a single element.
Added _wait_batch_p2p to call wait() on multiple dist.Work objects, consuming the result of _batch_p2p.
Updated references from dist.Work to list[dist.Work].

Testing:

pippy_bert.py from torch.distributed.pipelining hang and timeout in CPU gloo backend #132644 now works with gloo.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot · 2025-05-06T10:19:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152938

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit a8a1884 with merge base f2ea636 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

linux-binary-libtorch-release / libtorch-cpu-shared-with-deps-release-test / test (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 1

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

tom-pollak · 2025-05-06T10:23:20Z

@pytorchbot label "module: pipelining"

pytorch-bot · 2025-05-06T10:23:22Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'label:' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

tom-pollak · 2025-05-06T10:24:09Z

@pytorchbot label "module: pipelining"

tom-pollak · 2025-05-06T15:33:28Z

@pytorchbot label "topic: not user facing"

tom-pollak · 2025-05-07T16:51:23Z

@pytorchbot rebase main

pytorch-bot · 2025-05-07T16:51:26Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: main

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

tom-pollak · 2025-05-07T16:52:14Z

@pytorchbot rebase -b main

pytorch-bot · 2025-05-07T16:52:18Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

kwen2501

Thanks. LGTM.
Maybe @H-Huang want to have a second look?

H-Huang · 2025-05-08T12:57:37Z

@pytorchbot rebase

pytorchmergebot · 2025-05-08T12:59:10Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-05-08T12:59:13Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch push -f https://github.com/graphcore/pytorch-fork.git pull/152938/head:pp-deadlock returned non-zero exit code 128

remote: Permission to graphcore/pytorch-fork.git denied to pytorchmergebot.
fatal: unable to access 'https://github.com/graphcore/pytorch-fork.git/': The requested URL returned error: 403

This is likely because the author did not allow edits from maintainers on the PR or because the repo has additional permissions settings that mergebot does not qualify.
Raised by https://github.com/pytorch/pytorch/actions/runs/14907026043

tom-pollak · 2025-05-08T14:42:17Z

@pytorchbot rebase

pytorch-bot · 2025-05-08T14:42:21Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

tom-pollak · 2025-05-08T14:44:34Z

Seems to be a problem with cross-org "allow edits from maintainers"?
https://github.com/orgs/community/discussions/5634

H-Huang

Overall LGTM, I suspect there are still some issues with hangs on non-NCCL backends because of the dependencies between multiple ops across ranks vs just 1 op for nccl. Probably would see this issue arise in 1f1b or interleaved schedules.

Will you be testing other schedules on gloo? Also feel free to rebase on your fork and update the PR so we can get the testing signal

tom-pollak · 2025-05-08T16:23:16Z

Probably would see this issue arise in 1f1b or interleaved schedules.

I'll have a look at this!

tom-pollak · 2025-05-08T17:16:34Z

@pytorchbot drci

tom-pollak · 2025-05-09T09:46:40Z

@pytorchbot drci

Fixes pytorch#132644 `_batch_p2p` incorrectly assumes that `dist.batch_isend_irecv` returns a single-element list of `dist.Work`, likely due to NCCL's coalescing behaviour. For none NCCL backends like Gloo, multiple `dist.Work` objects are returned, causing the code to discard some operations via `.pop()`. This leads to deadlocks during pipeline parallelism. * Modified `_batch_p2p` to return `list[dist.Work]` instead of popping a single element. * Added `_wait_batch_p2p` to call `wait()` on multiple `dist.Work` objects, consuming the result of `_batch_p2p`. * Updated references from `dist.Work` to `list[dist.Work]`. * `pippy_bert.py` from pytorch#132644 now works with gloo.

H-Huang · 2025-05-10T00:56:53Z

@pytorchbot merge -i

pytorchmergebot · 2025-05-10T00:59:28Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 6, 2025

pytorch-bot bot added the module: pipelining Pipeline Parallelism label May 6, 2025

pytorchbot added the open source label May 6, 2025

pytorch-bot bot added the topic: not user facing topic category label May 6, 2025

janeyx99 requested a review from kwen2501 May 7, 2025 19:41

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 7, 2025

kwen2501 requested a review from H-Huang May 7, 2025 22:16

kwen2501 approved these changes May 7, 2025

View reviewed changes

H-Huang approved these changes May 8, 2025

View reviewed changes

tom-pollak force-pushed the pp-deadlock branch from e62f6d4 to aabb080 Compare May 8, 2025 14:54

tom-pollak added 2 commits May 9, 2025 16:29

fix lints

db39551

fix lints 2

a8a1884

tom-pollak force-pushed the pp-deadlock branch from df7aee6 to a8a1884 Compare May 9, 2025 15:29

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 10, 2025

pytorchmergebot added the merging label May 10, 2025

pytorchmergebot added the Merged label May 10, 2025

pytorchmergebot closed this in fc7d8c6 May 10, 2025

pytorchmergebot removed the merging label May 10, 2025

tom-pollak deleted the pp-deadlock branch June 17, 2025 08:12

[Pipelining] Fix _batch_p2p bug for non-NCCL backends (#132644) #152938

[Pipelining] Fix _batch_p2p bug for non-NCCL backends (#132644) #152938

Uh oh!

Conversation

tom-pollak commented May 6, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes:

Testing:

Uh oh!

pytorch-bot bot commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152938

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

tom-pollak commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 6, 2025

Uh oh!

tom-pollak commented May 6, 2025

Uh oh!

tom-pollak commented May 6, 2025

Uh oh!

tom-pollak commented May 7, 2025

Uh oh!

pytorch-bot bot commented May 7, 2025

Uh oh!

tom-pollak commented May 7, 2025

Uh oh!

pytorch-bot bot commented May 7, 2025

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang commented May 8, 2025

Uh oh!

pytorchmergebot commented May 8, 2025

Uh oh!

pytorchmergebot commented May 8, 2025

Uh oh!

tom-pollak commented May 8, 2025

Uh oh!

pytorch-bot bot commented May 8, 2025

Uh oh!

tom-pollak commented May 8, 2025

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

tom-pollak commented May 8, 2025

Uh oh!

tom-pollak commented May 8, 2025

Uh oh!

tom-pollak commented May 9, 2025

Uh oh!

H-Huang commented May 10, 2025

Uh oh!

pytorchmergebot commented May 10, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tom-pollak commented May 6, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 6, 2025 •

edited

Loading

tom-pollak commented May 6, 2025 •

edited

Loading