-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
module: ncclProblems related to nccl supportProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
The list of work objects (works
) is empty:
with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm:
dist.all_reduce(a)
dist.all_reduce(b)
print(len(cm.works)) # prints 0
cm.wait()
In this situation cm.wait()
becomes a no-op, which causes red-before-write issues whenever _coalescing_manager
is used.
Versions
PyTorch version: 2.4.0.dev20240326+cu121
### Tasks
- [ ] https://github.com/pytorch/pytorch/pull/122651
- [ ] https://github.com/pytorch/pytorch/pull/122849
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang
Metadata
Metadata
Assignees
Labels
module: ncclProblems related to nccl supportProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
Done