Skip to content

Conversation

@XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Feb 11, 2025

Summary:
The TCP store has API v2 support we can reduce the network overhead of Gloo rendezvous significantly by fetching a batch of key instead of doing them one by one.
Initial testing shows ~15X improvement for 4k jobs.

Gloo process group init:

Baseline ( fbcode trunk):
2k job (https://fburl.com/mlhub/x1prxu89) : ~82sec (~1.4 min)
4k job (https://fburl.com/mlhub/v1djk4n5) : ~393 sec (~6.6min)
8k job (https://fburl.com/mlhub/cagqrs7m): (~55mins)

With optimizations (D48130088 + D52083376):

2k job (https://fburl.com/mlhub/x0cskdag) : ~18 sec ( ~5x faster)
4k job (https://fburl.com/mlhub/xzmvkm4j) : ~ 25 sec (~15x faster)
8k job (https://fburl.com/mlhub/gdyeizv9) : ~ 85 sec (~35x faster)

Reviewed By: xunnanxu

Differential Revision:
D52083376

Privacy Context Container: L1156430

Summary:
All credit goes to original author XilunWu. I am just landing the code to unblock large Ads jobs.

D45740631 reduces gloo rendezvous cost for TCP backend by eliminating duplicate address publishing to TCPStore. Ben suggested "have seq_number == global_rank" to further get rid of `seq_number` exchange and Shawn reported why this didn't work. This diff serves a starting point for benchmarking the benefit of doing so ([testbed record](https://docs.google.com/document/d/1_p390fx0IiaZWbt-Dkdvp9jSgCKebiuG8_a6BVjHtMU/edit) shows 2x speedup: 46 min ProcessGroupGloo init time on 8k ranks -> 20 min).

The feature will be enabled via env variable (GLOO_ENABLE_RANK_AS_SEQUENCE_NUMBER) disabled by default that will be controlled by justKnobs.

Differential Revision: D48130088
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D52083376

Summary:
Pull Request resolved: pytorch#408

The TCP store has API v2 support we can reduce the network overhead of Gloo rendezvous significantly by fetching a batch of key instead of doing them one by one.
Initial testing shows ~15X improvement for 4k jobs.

Gloo process group init:

Baseline ( fbcode trunk):
2k job (https://fburl.com/mlhub/x1prxu89) : ~82sec (~1.4 min)
4k job (https://fburl.com/mlhub/v1djk4n5) : ~393 sec (~6.6min)
8k job (https://fburl.com/mlhub/cagqrs7m): (~55mins)

With optimizations (D48130088 + D52083376):

2k job (https://fburl.com/mlhub/x0cskdag)  : ~18 sec ( ~5x faster)
4k job (https://fburl.com/mlhub/xzmvkm4j) : ~ 25 sec (~15x faster)
8k job (https://fburl.com/mlhub/gdyeizv9)   : ~ 85 sec (~35x faster)

Reviewed By: xunnanxu

Differential Revision: D52083376
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D52083376

@facebook-github-bot facebook-github-bot merged commit 4ff6edf into pytorch:main Feb 11, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants