-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE] split seq_id to collective_seq_id and p2p_seq_id #125727
Conversation
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass isP2P to record. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125727
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 8a9f70a with merge base cd3a71f (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass isP2P to record. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: ee518d978d2f03891826e4d3bb1dd13bd51e7005 Pull Request resolved: #125727
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass isP2P to record. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass isP2P to record. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 3808e1c860578bff32f9cb10db96ba4ea3b8dd9f Pull Request resolved: #125727
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass isP2P to record. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass seqCollective_ and seqP2P_ to the recorder. Pass isP2P to record (not sure if we need this yet). Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e2ce7773ad108dbd8a633f051e7ac74b405e31b6 Pull Request resolved: #125727
this mostly lgtm, i didn't review carefully one thought is whether we want to (correctly) bump version major, or, just keep old seq_id and add new ones and later deprecate it. I think its worth checking if anyone's gonna be broken by this, but since not many users have developed scripts yet its probably fine to just bump the major and drop the old key so we have less baggage. anyone else have a thought on that? |
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass isP2P to record. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass seqCollective_ and seqP2P_ to the recorder. Pass isP2P to record (not sure if we need this yet). Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 42b2316af83a7cd16688fb792b56dd2e060c9dcb Pull Request resolved: #125727
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass seqCollective_ and seqP2P_ to the recorder. Pass isP2P to record (not sure if we need this yet). Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: ca25b1f43605c90f232900307413aa94c3df53ba Pull Request resolved: #125727
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass seqCollective_ and seqP2P_ to the recorder. Pass isP2P to record - this flag will help distinguish P2P v/s collective records. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 8d710b148cc470ef0188b2cea3d7eadc79306ed0 Pull Request resolved: #125727
Summary: Attempt to separate out collectives and P2P for debug purposes. Pass seqCollective_ and seqP2P_ to the recorder. Pass isP2P to record - this flag will help distinguish P2P v/s collective records. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 14b6c61dda0ad1987e3bfbbc7b8434d3240f92ba Pull Request resolved: #125727
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: c31b3164d2e51efeab210e6a949cd4c8d1ecd3d7 Pull Request resolved: #125727
@wconstab - do you have a preference? Shall we keep the old |
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: c67b8ed6bda1415b5f6a2e2006e5bec0ae8b1621 Pull Request resolved: #125727
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: f392686c6e68260fd453c28f2575fcf8bc71ea7f Pull Request resolved: #125727
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: pytorch#125173 ghstack-source-id: cf9bb109c028d7ffe9612d2b9c4fda1df47586d7 Pull Request resolved: pytorch#125727 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: New commits were pushed while merging. Please rerun the merge command. Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Summary:
Split out
seq_id
intocollective_seq_id
andp2p_seq_id
. The main idea here is that collectives that go to all machines should have identicalcollective_seq_id
and therefore it makes it easier to spot if one of machines isn't handling a collective operation.Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.
Resolves issue: #125173
Test Plan:
Unit tests.
Reviewers:
Subscribers:
Tasks:
Tags:
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k