Skip to content

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Apr 26, 2024

Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.

This uses 2 approaches for different areas:

  • local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
  • exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.

At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.

Test Plan:
This is testing using many small tests running on a remote cluster.

{D56549942}

torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1

Differential Revision: D56605193

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

Copy link

pytorch-bot bot commented Apr 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124982

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c196ab9 with merge base 43069c4 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 26, 2024
@d4l3k d4l3k added module: performance Issues related to performance, either of kernel code or framework glue topic: not user facing topic category labels Apr 26, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this assumption can be broken? This depends on the rendezvous handler, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's not collocated it should still be much faster than before since it's way fewer operations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes default timeout, is that OK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to be how these store functions operate -- I don't love it but keeping to existing patterns. We already set it in the existing synchronize call so I expect it's already at that value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah -- since this is eliminating synchronize and synchronize was setting the timeout this behavior is exactly the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be cases when TCPStore will be shared with and used by trainer via master_addr/master_port args. Should we consider scoping it to just this place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to scope it to just this with a context manager -- just a warning that this is a breaking change to previous behavior

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally the store API has set(key, val, timeout) and get(key, timeout) but unfortuantely that's not the case today. In the case of TCPStore this is a client-side timeout so even if we shared the "TCPStore" server between the agent and trainers this timeout won't get propagated down to the trainers.

@d4l3k d4l3k force-pushed the export-D56605193 branch from 35429b3 to 214d4ec Compare April 26, 2024 00:21
@d4l3k d4l3k added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Apr 26, 2024
@d4l3k d4l3k force-pushed the export-D56605193 branch 2 times, most recently from 6a78b9d to 3263bd4 Compare April 26, 2024 03:17
@d4l3k d4l3k requested a review from kurman April 26, 2024 03:18
@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@d4l3k d4l3k force-pushed the export-D56605193 branch from 3263bd4 to 24cb26c Compare April 26, 2024 16:59
Copy link
Contributor

@kurman kurman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, except a concern for altering the timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be cases when TCPStore will be shared with and used by trainer via master_addr/master_port args. Should we consider scoping it to just this place?

@d4l3k d4l3k force-pushed the export-D56605193 branch from 24cb26c to 094c229 Compare April 26, 2024 17:29
@d4l3k d4l3k requested a review from kurman April 26, 2024 17:30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use store.multi_get() (just as you used store.multi_set() for writes) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's a fair point -- updated to use multi_get where possible

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally the store API has set(key, val, timeout) and get(key, timeout) but unfortuantely that's not the case today. In the case of TCPStore this is a client-side timeout so even if we shared the "TCPStore" server between the agent and trainers this timeout won't get propagated down to the trainers.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 26, 2024
@d4l3k d4l3k force-pushed the export-D56605193 branch from 094c229 to 5e559f4 Compare April 26, 2024 18:58
@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@d4l3k d4l3k requested a review from c-p-i-o April 26, 2024 19:02
@d4l3k d4l3k force-pushed the export-D56605193 branch from 5e559f4 to b96925a Compare April 26, 2024 19:15

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth adding a comment that get will block on the key being populated in TCPStore, as it may not be immediately obvious that this is the behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a 10min timeout for this get? In the case other ranks hang for some reason, then this will exit? I haven't checked but as long as we propagate the error, this might be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there's a timeout on this -- defaults to 300s or 5 minutes

@d4l3k d4l3k force-pushed the export-D56605193 branch 2 times, most recently from fcc88b2 to ebdc275 Compare April 26, 2024 21:15
@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@d4l3k d4l3k force-pushed the export-D56605193 branch from ebdc275 to dae7cfe Compare April 26, 2024 21:28
@d4l3k
Copy link
Member Author

d4l3k commented Apr 26, 2024

I've verified this with an e2e 32x8 gpu job that uses nccl init and an allreduce. Job succeeded across all 3 retries. Merging

@d4l3k
Copy link
Member Author

d4l3k commented Apr 26, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@d4l3k d4l3k force-pushed the export-D56605193 branch from dae7cfe to 811918e Compare April 26, 2024 22:22
@d4l3k d4l3k requested a review from albanD as a code owner April 26, 2024 22:22
@d4l3k
Copy link
Member Author

d4l3k commented Apr 26, 2024

@pytorchbot merge

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator

Details for Dev Infra team Raised by workflow job

…instead of O(n^2) (#124982)

Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.

This uses 2 approaches for different areas:

* local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
* exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.

At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.

Pull Request resolved: #124982

Test Plan:
This is testing using many small tests running on a remote cluster.

{D56549942}

```
torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1
```

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang

Reviewed By: kurman

Differential Revision: D56605193

Pulled By: d4l3k
@d4l3k d4l3k force-pushed the export-D56605193 branch from 811918e to c196ab9 Compare April 26, 2024 22:35
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@d4l3k d4l3k deleted the export-D56605193 branch April 29, 2024 17:49
pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
…instead of O(n^2) (#124982)

Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.

This uses 2 approaches for different areas:

* local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
* exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.

At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.

Test Plan:
This is testing using many small tests running on a remote cluster.

{D56549942}

```
torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1
```

Differential Revision: D56605193

Pull Request resolved: #124982
Approved by: https://github.com/kiukchung, https://github.com/kurman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: performance Issues related to performance, either of kernel code or framework glue oncall: distributed Add this issue/PR to distributed oncall triage queue suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants