Skip to content

Conversation

Aidyn-A
Copy link
Collaborator

@Aidyn-A Aidyn-A commented Mar 28, 2024

Fixes #122807
The work handle of the coalescing job will be populated:

    with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm:
        dist.all_reduce(a)
        dist.all_reduce(b)
    print(len(cm.works)) # prints 1
    cm.wait() # actually waits

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

Copy link

pytorch-bot bot commented Mar 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122849

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 08deacb with merge base dd8a24b (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 28, 2024
@Aidyn-A Aidyn-A requested a review from kwen2501 March 28, 2024 00:12
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 28, 2024
@Aidyn-A Aidyn-A added the module: nccl Problems related to nccl support label Mar 29, 2024
@wconstab
Copy link
Contributor

i don't understand the fix, and i'd appreciate if you also add a test case, probably to test_c10d_nccl.py.

the fix looks like its just adds some assertions and updates the coalescedDevice, but how does it impact the work obj that gets returned?

@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Apr 1, 2024

how does it impact the work obj that gets returned?

Here:

if device:
# Old style of letting each coll inside the context manager to call into C++ counterpart via python binding
work = group._end_coalescing(device)

_end_coalescing was supposed to return a proper work object. However, because coalescedComm_ is nullptr it does call groupEnd, but it returns a nullptr for work object pointer:

c10::intrusive_ptr<Work> ProcessGroupNCCL::endCoalescing(OpType optype) {
if (coalescedComm_ == nullptr) {
// There is no actual work being coalesced, return here
groupEnd();
coalescing_state_ = 0;
return nullptr;
}

@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Apr 1, 2024

cc @kwen2501

@kwen2501
Copy link
Contributor

kwen2501 commented Apr 2, 2024

Thanks for finding the issue.
Hopefully #123170 explains the error.
I think the issue is really in _coalescing_manager so I am fixing it there.

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! LGTM!

@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Apr 2, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 2, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
Fixes pytorch#122807
The work handle of the coalescing job will be populated:
```python
    with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm:
        dist.all_reduce(a)
        dist.all_reduce(b)
    print(len(cm.works)) # prints 1
    cm.wait() # actually waits
```

Pull Request resolved: pytorch#122849
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[c10d][NCCL] _coalescing_manager does not produce proper work handle

6 participants