Skip to content

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Aug 14, 2025

Stack from ghstack (oldest at bottom):

Admittedly I'm a noob when looking at traces, but this looked pretty off to me:
Screenshot 2025-08-14 at 5 27 49 PM

  1. Why are there so many "nccl:coalesced" on the CPU thread
  2. Why is there "nccl:coalesced" on compute stream (stream 7)

Here is what is happening:

CPU side: In endCoalescing, we create a work object with the profiling title "nccl:coalesced"
GPU side: The CUDA kernels will inherit this profiling title

What is missing:

We forgot to call the record function callback. With this change we finishs immediately on the CPU side, but the ncclDevKernel_SendRecv still have the coalesced title. New trace looks like this:

image

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 14, 2025
Copy link

pytorch-bot bot commented Aug 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160680

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 114cc2d with merge base 1c25871 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

H-Huang added a commit that referenced this pull request Aug 14, 2025
ghstack-source-id: 008f7c5
Pull-Request: #160680
@H-Huang H-Huang added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 15, 2025
@H-Huang H-Huang requested a review from fegin September 2, 2025 17:20
@H-Huang
Copy link
Member Author

H-Huang commented Sep 29, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants