New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c10d] Increase socket buffer size to allow ProcessGroup init up to 12k ranks #107878
Conversation
…2k ranks The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to `-1` which uses `somaxconn` as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash. split the original diff for OSS vs. internal. Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit. Differential Revision: [D48634654](https://our.internmc.facebook.com/intern/diff/D48634654/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107878
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (5 Unrelated Failures)As of commit b5ec08c with merge base 2515ab9 (): FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…2k ranks The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to `-1` which uses `somaxconn` as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash. split the original diff for OSS vs. internal. Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit. Differential Revision: [D48634654](https://our.internmc.facebook.com/intern/diff/D48634654/) ghstack-source-id: 198558345 Pull Request resolved: #107878
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@pytorchbot merge -f "unrelated failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to
-1
which usessomaxconn
as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash.split the original diff for OSS vs. internal.
Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit.
Differential Revision: D48634654