Skip to content

[c10d][NCCL] _coalescing_manager does not produce proper work handle #122807

@Aidyn-A

Description

@Aidyn-A

🐛 Describe the bug

The list of work objects (works) is empty:

    with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm:
        dist.all_reduce(a)
        dist.all_reduce(b)
    print(len(cm.works)) # prints 0
    cm.wait()

In this situation cm.wait() becomes a no-op, which causes red-before-write issues whenever _coalescing_manager is used.

Versions

PyTorch version: 2.4.0.dev20240326+cu121

### Tasks
- [ ] https://github.com/pytorch/pytorch/pull/122651
- [ ] https://github.com/pytorch/pytorch/pull/122849

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

Metadata

Metadata

Assignees

Labels

module: ncclProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions