New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in process_group_name when there is duplicate pgs #100518
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100518
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New FailureAs of commit 63810e8: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D45315615 |
Summary: Pull Request resolved: pytorch#100518 with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group. Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441 Reviewed By: xunnanxu, eeggl Differential Revision: D45315615 fbshipit-source-id: 38017707065fb9585ab39460848ad9f15c213db0
01eccfd
to
7170407
Compare
This pull request was exported from Phabricator. Differential Revision: D45315615 |
test/distributed/test_c10d_common.py
Outdated
@@ -1417,6 +1417,43 @@ def _test_new_group_local_sync_sanity_check(self, backend): | |||
] | |||
self.assertEqual(output_tensor_list, expected) | |||
|
|||
def _test_new_group_local_sync_duplidate_pg(self, backend): | |||
""" | |||
We should support users create multiople PGs with the same set of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/multiople/multiple/
Summary: Pull Request resolved: pytorch#100518 with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group. Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441 Reviewed By: xunnanxu, eeggl Differential Revision: D45315615 fbshipit-source-id: 0e1694bb6e84e1520dba4868fc8061060bdc220d
7170407
to
a5564a3
Compare
This pull request was exported from Phabricator. Differential Revision: D45315615 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
@pytorchmergebot merge |
Merge failedReason: This PR has internal changes and must be landed via Phabricator Details for Dev Infra teamRaised by workflow job |
sorry - I should have a separate PR for it (this is with the integration changes in fair scale). I'll land it from inside |
This pull request was exported from Phabricator. Differential Revision: D45315615 |
…0518) Summary: Pull Request resolved: pytorch#100518 with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group. Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441 Reviewed By: xunnanxu, eeggl Differential Revision: D45315615 fbshipit-source-id: 1ce9928b3d0f8660e8637e1b408e5ac0ae0410bd
a5564a3
to
eb79207
Compare
…0518) Summary: Pull Request resolved: pytorch#100518 with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group. Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441 Reviewed By: xunnanxu, eeggl Differential Revision: D45315615 fbshipit-source-id: aaefe67f52bddd6e184496c1231dadb0dcca089b
eb79207
to
44aa965
Compare
This pull request was exported from Phabricator. Differential Revision: D45315615 |
…0518) Summary: Pull Request resolved: pytorch#100518 with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group. Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441?env=PRODUCTION&job_name=torchx_hpc-xlformers_chinch70B_4096_xdwang_0426190441 Reviewed By: xunnanxu, eeggl Differential Revision: D45315615 fbshipit-source-id: b499a2e085722cc429a3ad34b9da1093decded46
44aa965
to
63810e8
Compare
This pull request was exported from Phabricator. Differential Revision: D45315615 |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.
Reviewed By: xunnanxu, eeggl
Differential Revision: D45315615