Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cherry-pick][device_mesh] add back the private init backend option (#124780) #126147

Merged
merged 1 commit into from
May 14, 2024

Conversation

wanchaol
Copy link
Contributor

@wanchaol wanchaol commented May 14, 2024

This PR adds back the private init backend option (we had _init_process_groups before), to tackle the issues sub
mesh creation. This is a regression fix to 2.3 as we removed the _init_process_groups option in 2.3, which triggers a lot more sub process group creations, potentially causing memory spikes

Differential Revision: D56497780
Pull Request resolved: #124780
Approved by: https://github.com/awgu

Fixes #ISSUE_NUMBER

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

This PR adds a private init backend option, to tackle the issues sub
mesh creation:

in device mesh slicing we don't want to create process groups again,
so explicitly turn the group creation off it's useful

Also I think there might be more submesh creation functionality so
having this flag would ensure that there's no new group created

Differential Revision: [D56497780](https://our.internmc.facebook.com/intern/diff/D56497780)
Pull Request resolved: #124780
Approved by: https://github.com/awgu
Copy link

pytorch-bot bot commented May 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126147

Note: Links to docs will display an error until the docs builds have been completed.

❌ 22 New Failures

As of commit 6f7698b with merge base 86a2d67 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ci-td-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 14, 2024
@wanchaol wanchaol changed the title [device_mesh] add back the private init backend option (#124780) [cherry-pick][device_mesh] add back the private init backend option (#124780) May 14, 2024
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2024
@atalman atalman merged commit 1d6a938 into release/2.3 May 14, 2024
98 of 107 checks passed
@atalman atalman deleted the device_mesh_fix branch May 14, 2024 17:38
@PaliC
Copy link
Contributor

PaliC commented May 31, 2024

Validated that this works on the 2.3 branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-td-distributed ciflow/trunk Trigger trunk jobs on your pull request oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants