Skip to content

Conversation

fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Apr 14, 2025

Stack from ghstack (oldest at bottom):

During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced.

Also the added unit test also address the unit test ask in the comment in #150863.

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 14, 2025
Copy link

pytorch-bot bot commented Apr 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151238

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0accc80 with merge base f1f18c7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fduwjj added a commit that referenced this pull request Apr 14, 2025
@fduwjj fduwjj requested review from kwen2501, eqy and d4l3k April 14, 2025 18:10
@fduwjj fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 14, 2025
During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced. 

Also the added unit test also address the unit test ask in the comment in #150863.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k

[ghstack-poisoned]
During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced. 

Also the added unit test also address the unit test ask in the comment in #150863.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Apr 14, 2025
@fduwjj fduwjj closed this Apr 14, 2025
@fduwjj fduwjj reopened this Apr 14, 2025
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

def _verify_trace(self, t, include_collectives, timing_enabled, is_json):
ver = t["version"]
self.assertEqual(ver, "2.5")
self.assertEqual(ver, "2.7")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why didn't this error when the version was incorrectly (?) 2.6 above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it is a rebase slowness. The base should be 2.6...

pgStatus_,
/*isP2P=*/false);
(void)trace_id;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this so we can detect when a coalesced event wasn't triggered or had invalid shape/dtype/etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that's exactly what we want to do.

During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced. 

Also the added unit test also address the unit test ask in the comment in #150863.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k

[ghstack-poisoned]
@fduwjj
Copy link
Contributor Author

fduwjj commented Apr 15, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict gh/fduwjj/126/orig returned non-zero exit code 1

warning: skipped previously applied commit 03dc140ff4d
hint: use --reapply-cherry-picks to include skipped commits
hint: Disable this message with "git config set advice.skippedCherryPicks false"
Rebasing (1/2)
Rebasing (2/2)
Auto-merging test/distributed/test_c10d_nccl.py
CONFLICT (content): Merge conflict in test/distributed/test_c10d_nccl.py
Auto-merging torch/csrc/distributed/c10d/FlightRecorder.hpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/FlightRecorder.hpp
Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
error: could not apply cbd06005993... [c10d][fr] Record each individual collective being coalesced
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply cbd06005993... [c10d][fr] Record each individual collective being coalesced

Raised by https://github.com/pytorch/pytorch/actions/runs/14475581233

During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced. 

Also the added unit test also address the unit test ask in the comment in #150863.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Apr 15, 2025
@fduwjj
Copy link
Contributor Author

fduwjj commented Apr 15, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
…#151238)

During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced.

Also the added unit test also address the unit test ask in the comment in pytorch#150863.

Pull Request resolved: pytorch#151238
Approved by: https://github.com/d4l3k, https://github.com/wconstab
ghstack dependencies: pytorch#151247
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…#151238)

During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced.

Also the added unit test also address the unit test ask in the comment in pytorch#150863.

Pull Request resolved: pytorch#151238
Approved by: https://github.com/d4l3k, https://github.com/wconstab
ghstack dependencies: pytorch#151247
@github-actions github-actions bot deleted the gh/fduwjj/126/head branch May 25, 2025 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants