Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c10d] Increase socket buffer size to allow ProcessGroup init up to 12k ranks #107878

Closed
wants to merge 1 commit into from

Conversation

XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Aug 24, 2023

Stack from ghstack (oldest at bottom):

The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to -1 which uses somaxconn as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash.

split the original diff for OSS vs. internal.

Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit.

Differential Revision: D48634654

…2k ranks

The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to `-1` which uses `somaxconn` as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash.

split the original diff for OSS vs. internal.

Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit.

Differential Revision: [D48634654](https://our.internmc.facebook.com/intern/diff/D48634654/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 24, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107878

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit b5ec08c with merge base 2515ab9 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Aug 24, 2023
XilunWu added a commit that referenced this pull request Aug 24, 2023
…2k ranks

The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to `-1` which uses `somaxconn` as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash.

split the original diff for OSS vs. internal.

Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit.

Differential Revision: [D48634654](https://our.internmc.facebook.com/intern/diff/D48634654/)

ghstack-source-id: 198558345
Pull Request resolved: #107878
@XilunWu XilunWu added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 24, 2023
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fduwjj fduwjj added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Aug 24, 2023
@XilunWu
Copy link
Contributor Author

XilunWu commented Aug 25, 2023

@pytorchbot merge -f "unrelated failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@XilunWu XilunWu deleted the gh/XilunWu/45/head branch August 25, 2023 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants