Skip to content

Conversation

@XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Feb 11, 2025

Summary:
All credit goes to original author XilunWu. I am just landing the code to unblock large training jobs.

D45740631 reduces gloo rendezvous cost for TCP backend by eliminating duplicate address publishing to TCPStore. Ben suggested "have seq_number == global_rank" to further get rid of seq_number exchange and Shawn reported why this didn't work. This diff serves a starting point for benchmarking the benefit of doing so (testbed record shows 2x speedup: 46 min ProcessGroupGloo init time on 8k ranks -> 20 min).

The feature will be enabled via env variable (GLOO_ENABLE_RANK_AS_SEQUENCE_NUMBER) disabled by default that will be controlled by justKnobs.

Reviewed By: XilunWu

Differential Revision: D48130088

Summary:
All credit goes to original author XilunWu. I am just landing the code to unblock large training jobs.

D45740631 reduces gloo rendezvous cost for TCP backend by eliminating duplicate address publishing to TCPStore. Ben suggested "have seq_number == global_rank" to further get rid of `seq_number` exchange and Shawn reported why this didn't work. This diff serves a starting point for benchmarking the benefit of doing so ([testbed record](https://docs.google.com/document/d/1_p390fx0IiaZWbt-Dkdvp9jSgCKebiuG8_a6BVjHtMU/edit) shows 2x speedup: 46 min ProcessGroupGloo init time on 8k ranks -> 20 min).

The feature will be enabled via env variable (GLOO_ENABLE_RANK_AS_SEQUENCE_NUMBER) disabled by default that will be controlled by justKnobs.

Reviewed By: XilunWu

Differential Revision: D48130088
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D48130088

@facebook-github-bot facebook-github-bot merged commit 20dc202 into pytorch:main Feb 11, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants