Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BE] split seq_id to collective_seq_id and p2p_seq_id #125727

Closed
wants to merge 19 commits into from

Conversation

c-p-i-o
Copy link
Contributor

@c-p-i-o c-p-i-o commented May 8, 2024

Stack from ghstack (oldest at bottom):

Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.

Resolves issue: #125173

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Summary:
Attempt to separate out collectives and P2P for debug purposes.
Pass isP2P to record.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Copy link

pytorch-bot bot commented May 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125727

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8a9f70a with merge base cd3a71f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 8, 2024
c-p-i-o added a commit that referenced this pull request May 8, 2024
Summary:
Attempt to separate out collectives and P2P for debug purposes.
Pass isP2P to record.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: ee518d978d2f03891826e4d3bb1dd13bd51e7005
Pull Request resolved: #125727
@c-p-i-o c-p-i-o marked this pull request as draft May 8, 2024 00:35
Summary:
Attempt to separate out collectives and P2P for debug purposes.
Pass isP2P to record.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
c-p-i-o added a commit that referenced this pull request May 8, 2024
Summary:
Attempt to separate out collectives and P2P for debug purposes.
Pass isP2P to record.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 3808e1c860578bff32f9cb10db96ba4ea3b8dd9f
Pull Request resolved: #125727
Summary:
Attempt to separate out collectives and P2P for debug purposes.
Pass isP2P to record.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
c-p-i-o added a commit that referenced this pull request May 8, 2024
Summary:
Attempt to separate out collectives and P2P for debug purposes.

Pass seqCollective_ and seqP2P_ to the recorder.
Pass isP2P to record (not sure if we need this yet).

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: e2ce7773ad108dbd8a633f051e7ac74b405e31b6
Pull Request resolved: #125727
@wconstab
Copy link
Contributor

wconstab commented May 8, 2024

this mostly lgtm, i didn't review carefully

one thought is whether we want to (correctly) bump version major, or, just keep old seq_id and add new ones and later deprecate it.

I think its worth checking if anyone's gonna be broken by this, but since not many users have developed scripts yet its probably fine to just bump the major and drop the old key so we have less baggage. anyone else have a thought on that?

Summary:
Attempt to separate out collectives and P2P for debug purposes.
Pass isP2P to record.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
c-p-i-o added a commit that referenced this pull request May 8, 2024
Summary:
Attempt to separate out collectives and P2P for debug purposes.

Pass seqCollective_ and seqP2P_ to the recorder.
Pass isP2P to record (not sure if we need this yet).

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 42b2316af83a7cd16688fb792b56dd2e060c9dcb
Pull Request resolved: #125727
[ghstack-poisoned]
c-p-i-o added a commit that referenced this pull request May 9, 2024
Summary:
Attempt to separate out collectives and P2P for debug purposes.

Pass seqCollective_ and seqP2P_ to the recorder.
Pass isP2P to record (not sure if we need this yet).

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: ca25b1f43605c90f232900307413aa94c3df53ba
Pull Request resolved: #125727
[ghstack-poisoned]
c-p-i-o added a commit that referenced this pull request May 10, 2024
Summary:
Attempt to separate out collectives and P2P for debug purposes.

Pass seqCollective_ and seqP2P_ to the recorder.
Pass isP2P to record - this flag will help distinguish P2P v/s
collective records.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 8d710b148cc470ef0188b2cea3d7eadc79306ed0
Pull Request resolved: #125727
[ghstack-poisoned]
c-p-i-o added a commit that referenced this pull request May 13, 2024
Summary:
Attempt to separate out collectives and P2P for debug purposes.

Pass seqCollective_ and seqP2P_ to the recorder.
Pass isP2P to record - this flag will help distinguish P2P v/s
collective records.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 14b6c61dda0ad1987e3bfbbc7b8434d3240f92ba
Pull Request resolved: #125727
@c-p-i-o c-p-i-o changed the title [WIP] rename seq_id to collective_seq_id [BE] split seq_id to collective_seq_id and p2p_seq_id May 13, 2024
@c-p-i-o c-p-i-o self-assigned this May 13, 2024
@c-p-i-o c-p-i-o marked this pull request as ready for review May 13, 2024 17:51
Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

torch/csrc/distributed/c10d/TraceUtils.h Outdated Show resolved Hide resolved
[ghstack-poisoned]
c-p-i-o added a commit that referenced this pull request May 13, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: #125173

ghstack-source-id: c31b3164d2e51efeab210e6a949cd4c8d1ecd3d7
Pull Request resolved: #125727
@c-p-i-o
Copy link
Contributor Author

c-p-i-o commented May 13, 2024

one thought is whether we want to (correctly) bump version major, or, just keep old seq_id and add new ones and later deprecate it.

@wconstab - do you have a preference? Shall we keep the old seq_id?
The only one thing I'd have to fix up is the unit tests and keep the old field. But just wanted to make sure if that work is worth it.
How do I find out I won't break anyone?

[ghstack-poisoned]
c-p-i-o added a commit that referenced this pull request May 15, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: #125173

ghstack-source-id: c67b8ed6bda1415b5f6a2e2006e5bec0ae8b1621
Pull Request resolved: #125727
[ghstack-poisoned]
c-p-i-o added a commit that referenced this pull request May 16, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: #125173

ghstack-source-id: f392686c6e68260fd453c28f2575fcf8bc71ea7f
Pull Request resolved: #125727
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
OnlyFor pushed a commit to OnlyFor/pytorch that referenced this pull request May 18, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: pytorch#125173

ghstack-source-id: cf9bb109c028d7ffe9612d2b9c4fda1df47586d7
Pull Request resolved: pytorch#125727
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
[ghstack-poisoned]
[ghstack-poisoned]
@c-p-i-o
Copy link
Contributor Author

c-p-i-o commented May 20, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 20, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

[ghstack-poisoned]
@c-p-i-o c-p-i-o mentioned this pull request May 21, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

@c-p-i-o
Copy link
Contributor Author

c-p-i-o commented May 21, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants