elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) #124982

d4l3k · 2024-04-26T00:04:13Z

Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.

This uses 2 approaches for different areas:

local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.

At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.

Test Plan:
This is testing using many small tests running on a remote cluster.

{D56549942}

torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1

Differential Revision: D56605193

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

pytorch-bot · 2024-04-26T00:04:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124982

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c196ab9 with merge base 43069c4 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/distributed/elastic/agent/server/api.py

kurman · 2024-04-26T00:16:37Z

torch/distributed/elastic/agent/server/api.py

I wonder if this assumption can be broken? This depends on the rendezvous handler, right?

Even if it's not collocated it should still be much faster than before since it's way fewer operations

kurman · 2024-04-26T00:20:29Z

torch/distributed/elastic/utils/store.py

This changes default timeout, is that OK?

That seems to be how these store functions operate -- I don't love it but keeping to existing patterns. We already set it in the existing synchronize call so I expect it's already at that value.

Oh yeah -- since this is eliminating synchronize and synchronize was setting the timeout this behavior is exactly the same

There will be cases when TCPStore will be shared with and used by trainer via master_addr/master_port args. Should we consider scoping it to just this place?

Changed to scope it to just this with a context manager -- just a warning that this is a breaking change to previous behavior

ideally the store API has set(key, val, timeout) and get(key, timeout) but unfortuantely that's not the case today. In the case of TCPStore this is a client-side timeout so even if we shared the "TCPStore" server between the agent and trainers this timeout won't get propagated down to the trainers.

facebook-github-bot · 2024-04-26T16:47:39Z

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kurman

Overall LGTM, except a concern for altering the timeout.

kurman · 2024-04-26T17:08:29Z

torch/distributed/elastic/utils/store.py

There will be cases when TCPStore will be shared with and used by trainer via master_addr/master_port args. Should we consider scoping it to just this place?

kiukchung · 2024-04-26T17:29:23Z

torch/distributed/elastic/agent/server/api.py

why not use store.multi_get() (just as you used store.multi_set() for writes) here?

yeah, that's a fair point -- updated to use multi_get where possible

kiukchung · 2024-04-26T17:55:21Z

torch/distributed/elastic/utils/store.py

ideally the store API has set(key, val, timeout) and get(key, timeout) but unfortuantely that's not the case today. In the case of TCPStore this is a client-side timeout so even if we shared the "TCPStore" server between the agent and trainers this timeout won't get propagated down to the trainers.

torch/distributed/elastic/agent/server/api.py

facebook-github-bot · 2024-04-26T19:01:47Z

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

rsdcastro · 2024-04-26T20:41:32Z

torch/distributed/elastic/agent/server/api.py

It may be worth adding a comment that get will block on the key being populated in TCPStore, as it may not be immediately obvious that this is the behavior.

rsdcastro · 2024-04-26T20:49:51Z

torch/distributed/elastic/utils/store.py

I think we have a 10min timeout for this get? In the case other ranks hang for some reason, then this will exit? I haven't checked but as long as we propagate the error, this might be fine.

yes, there's a timeout on this -- defaults to 300s or 5 minutes

facebook-github-bot · 2024-04-26T21:16:49Z

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

d4l3k · 2024-04-26T22:16:42Z

I've verified this with an e2e 32x8 gpu job that uses nccl init and an allreduce. Job succeeded across all 3 retries. Merging

d4l3k · 2024-04-26T22:16:59Z

@pytorchbot merge

pytorchmergebot · 2024-04-26T22:19:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-04-26T22:20:09Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-docs / build-docs-python-false

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

d4l3k · 2024-04-26T22:22:54Z

@pytorchbot merge

facebook-github-bot · 2024-04-26T22:24:05Z

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pytorchmergebot · 2024-04-26T22:25:10Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator

Details for Dev Infra team

Raised by workflow job

…instead of O(n^2) (#124982) Summary: This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts. This uses 2 approaches for different areas: * local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers. * exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add. At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds. Pull Request resolved: #124982 Test Plan: This is testing using many small tests running on a remote cluster. {D56549942} ``` torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1 ``` cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang Reviewed By: kurman Differential Revision: D56605193 Pulled By: d4l3k

facebook-github-bot · 2024-04-27T02:19:51Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-04-27T02:21:37Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…instead of O(n^2) (#124982) Summary: This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts. This uses 2 approaches for different areas: * local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers. * exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add. At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds. Test Plan: This is testing using many small tests running on a remote cluster. {D56549942} ``` torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1 ``` Differential Revision: D56605193 Pull Request resolved: #124982 Approved by: https://github.com/kiukchung, https://github.com/kurman

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 26, 2024

d4l3k requested review from c-p-i-o, kiukchung, kurman, suo and wconstab April 26, 2024 00:06

d4l3k added module: performance Issues related to performance, either of kernel code or framework glue topic: not user facing topic category labels Apr 26, 2024

kurman reviewed Apr 26, 2024

View reviewed changes

d4l3k force-pushed the export-D56605193 branch from 35429b3 to 214d4ec Compare April 26, 2024 00:21

d4l3k added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Apr 26, 2024

d4l3k force-pushed the export-D56605193 branch 2 times, most recently from 6a78b9d to 3263bd4 Compare April 26, 2024 03:17

d4l3k requested a review from kurman April 26, 2024 03:18

d4l3k force-pushed the export-D56605193 branch from 3263bd4 to 24cb26c Compare April 26, 2024 16:59

kurman reviewed Apr 26, 2024

View reviewed changes

d4l3k force-pushed the export-D56605193 branch from 24cb26c to 094c229 Compare April 26, 2024 17:29

d4l3k requested a review from kurman April 26, 2024 17:30

kiukchung approved these changes Apr 26, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 26, 2024

c-p-i-o reviewed Apr 26, 2024

View reviewed changes

torch/distributed/elastic/agent/server/api.py Outdated Show resolved Hide resolved

torch/distributed/elastic/agent/server/api.py Outdated Show resolved Hide resolved

d4l3k force-pushed the export-D56605193 branch from 094c229 to 5e559f4 Compare April 26, 2024 18:58

d4l3k requested a review from c-p-i-o April 26, 2024 19:02

d4l3k force-pushed the export-D56605193 branch from 5e559f4 to b96925a Compare April 26, 2024 19:15

rsdcastro reviewed Apr 26, 2024

View reviewed changes

d4l3k force-pushed the export-D56605193 branch 2 times, most recently from fcc88b2 to ebdc275 Compare April 26, 2024 21:15

d4l3k force-pushed the export-D56605193 branch from ebdc275 to dae7cfe Compare April 26, 2024 21:28

pytorchmergebot added the merging label Apr 26, 2024

pytorchmergebot removed the merging label Apr 26, 2024

d4l3k force-pushed the export-D56605193 branch from dae7cfe to 811918e Compare April 26, 2024 22:22

d4l3k requested a review from albanD as a code owner April 26, 2024 22:22

pytorchmergebot added the merging label Apr 26, 2024

pytorchmergebot removed the merging label Apr 26, 2024

d4l3k force-pushed the export-D56605193 branch from 811918e to c196ab9 Compare April 26, 2024 22:35

pytorchmergebot added the merging label Apr 27, 2024

pytorchmergebot added the Merged label Apr 27, 2024

pytorchmergebot closed this in dc4c75b Apr 27, 2024

pytorchmergebot removed the merging label Apr 27, 2024

d4l3k deleted the export-D56605193 branch April 29, 2024 17:49

elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) #124982

elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) #124982

Uh oh!

Conversation

d4l3k commented Apr 26, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124982

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

kurman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

d4l3k commented Apr 26, 2024

Uh oh!

d4l3k commented Apr 26, 2024

Uh oh!

pytorchmergebot commented Apr 26, 2024

Merge started

Uh oh!

pytorchmergebot commented Apr 26, 2024

Merge failed

Uh oh!

d4l3k commented Apr 26, 2024

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

pytorchmergebot commented Apr 26, 2024

Merge failed

Uh oh!

facebook-github-bot commented Apr 27, 2024

Uh oh!

pytorchmergebot commented Apr 27, 2024

Merge started

Uh oh!

d4l3k commented Apr 26, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 26, 2024 •

edited

Loading