Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in process_group_name when there is duplicate pgs #100518

Closed
wants to merge 1 commit into from

Conversation

xw285cornell
Copy link
Contributor

@xw285cornell xw285cornell commented May 3, 2023

Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

@pytorch-bot
Copy link

pytorch-bot bot commented May 3, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100518

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure

As of commit 63810e8:

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45315615

xw285cornell added a commit to xw285cornell/pytorch that referenced this pull request May 3, 2023
Summary:
Pull Request resolved: pytorch#100518

with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

fbshipit-source-id: 38017707065fb9585ab39460848ad9f15c213db0
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45315615

@xw285cornell xw285cornell changed the title [fairscale] Avoid creating unnecessary pgs Fix bug in process_group_name when there is duplicate pgs May 3, 2023
@@ -1417,6 +1417,43 @@ def _test_new_group_local_sync_sanity_check(self, backend):
]
self.assertEqual(output_tensor_list, expected)

def _test_new_group_local_sync_duplidate_pg(self, backend):
"""
We should support users create multiople PGs with the same set of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/multiople/multiple/

@xw285cornell xw285cornell requested a review from kumpera May 3, 2023 18:21
xw285cornell added a commit to xw285cornell/pytorch that referenced this pull request May 3, 2023
Summary:
Pull Request resolved: pytorch#100518

with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

fbshipit-source-id: 0e1694bb6e84e1520dba4868fc8061060bdc220d
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45315615

Copy link
Contributor

@kumpera kumpera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@kumpera kumpera added the topic: bug fixes topic category label May 3, 2023
@kumpera
Copy link
Contributor

kumpera commented May 3, 2023

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 3, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator

Details for Dev Infra team Raised by workflow job

@xw285cornell
Copy link
Contributor Author

sorry - I should have a separate PR for it (this is with the integration changes in fair scale). I'll land it from inside

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45315615

xw285cornell added a commit to xw285cornell/pytorch that referenced this pull request May 3, 2023
…0518)

Summary:
Pull Request resolved: pytorch#100518

with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

fbshipit-source-id: 1ce9928b3d0f8660e8637e1b408e5ac0ae0410bd
xw285cornell added a commit to xw285cornell/pytorch that referenced this pull request May 3, 2023
…0518)

Summary:
Pull Request resolved: pytorch#100518

with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

fbshipit-source-id: aaefe67f52bddd6e184496c1231dadb0dcca089b
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45315615

…0518)

Summary:
Pull Request resolved: pytorch#100518

with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

fbshipit-source-id: b499a2e085722cc429a3ad34b9da1093decded46
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45315615

@xw285cornell
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@xw285cornell xw285cornell deleted the export-D45315615 branch May 4, 2023 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged release notes: distributed (c10d) release notes category topic: bug fixes topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants