[TorchElastic] Option for sharing TCPStore created by rdzv handlers #125743

kurman · 2024-05-08T05:59:39Z

Summary:

Define explicit use_agent_store on rdzv handlers. Handlers that set is true can share the store.
Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where RendezvousInfo will have RendezvousStoreInfo object that handlers must return.
- Depending on the implementation they can either:
  - point to existing store (and expected to use_agent_store as true - point 1). Client code will rely on TORCHELASTIC_USE_AGENT_STORE env variable to know if the store is shared.
  - build args that torch.distributed.init_process_group can bootstrap by creating new store.

Additional points:

When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
next_rendezvous signature changed to return instance of RendezvousInfo instead of a (store, rank, world_size) tuple for extensibility purposes.

Why:

Reduce moving parts
- easier to swap implementation
- improve tractability
- addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

pytorch-bot · 2024-05-08T05:59:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125743

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Long queue times due to changes in scale up infra

✅ No Failures

As of commit 12d5173 with merge base 4e6673e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-05-08T06:00:17Z

This pull request was exported from Phabricator. Differential Revision: D57055235

facebook-github-bot · 2024-05-08T06:22:11Z

This pull request was exported from Phabricator. Differential Revision: D57055235

…ytorch#125743) Summary: Pull Request resolved: pytorch#125743 - Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous - Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself - When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235

…ytorch#125743) Summary: - Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous - Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself - When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235

facebook-github-bot · 2024-05-09T00:58:38Z

This pull request was exported from Phabricator. Differential Revision: D57055235

…ytorch#125743) Summary: - Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous - Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself - When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235

facebook-github-bot · 2024-05-09T01:00:01Z

This pull request was exported from Phabricator. Differential Revision: D57055235

facebook-github-bot · 2024-05-13T14:45:12Z

This pull request was exported from Phabricator. Differential Revision: D57055235

…ytorch#125743) Summary: - Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous - Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself - When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235

facebook-github-bot · 2024-05-13T14:46:16Z

This pull request was exported from Phabricator. Differential Revision: D57055235

torch/distributed/elastic/agent/server/api.py

facebook-github-bot · 2024-05-17T21:01:11Z

This pull request was exported from Phabricator. Differential Revision: D57055235

…ytorch#125743) Summary: - Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. - Instead of agent coordinating master_add/master_port values, the logic is now encapsulated in rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. Depending on the implementation they can either point to existing store or just build args that `torch.distributed.init_process_group` can bootstrap by creating new store. - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - next_rendezvous signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235

…ytorch#125743) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235

facebook-github-bot · 2024-05-20T22:37:02Z

This pull request was exported from Phabricator. Differential Revision: D57055235

…ytorch#125743) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235

facebook-github-bot · 2024-05-20T22:37:50Z

This pull request was exported from Phabricator. Differential Revision: D57055235

kurman · 2024-05-20T23:21:41Z

Some of the actions are failing due module visibility issue check - addressing it.

d4l3k

looking much better!

torch/distributed/elastic/rendezvous/api.py

d4l3k · 2024-05-21T00:30:55Z

torch/distributed/elastic/rendezvous/api.py

@@ -48,19 +106,26 @@ class RendezvousHandler(ABC):
    def get_backend(self) -> str:
        """Return the name of the rendezvous backend."""

+    @property
+    def use_agent_store(self) -> bool:


thoughts on using this vs making _bootstrap_store_info optional?

torch/distributed/elastic/agent/server/local_elastic_agent.py

torch/distributed/elastic/agent/server/api.py

wconstab · 2024-05-21T13:30:45Z

torch/distributed/elastic/rendezvous/api.py

+    master_port: int
+
+    @staticmethod
+    def build(rank: int, store: Store) -> "RendezvousStoreInfo":


I'm confused about the order of operations.

I thought the build() path was used when we did not have a store to share from the elastic later.

Actually I guess it must be the following- we always have a tcpstore at the elastic layer, so we can use it to share information. But we call build() when we don't want to share that store with the trainer, and instead we just use the elastic's store for exchanging port info so the trainer can create its own store?

Yes. Build here is a factory method of the RendezvousStoreInfo that will define master_addr/master_port env variables that trainer code can use to start a new store server.

…ytorch#125743) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235

facebook-github-bot · 2024-05-21T18:53:57Z

This pull request was exported from Phabricator. Differential Revision: D57055235

…ytorch#125743) Summary: Pull Request resolved: pytorch#125743 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235

facebook-github-bot · 2024-05-21T19:00:40Z

This pull request was exported from Phabricator. Differential Revision: D57055235

d4l3k

LGTM

d4l3k · 2024-05-21T21:25:34Z

test/distributed/elastic/agent/server/test/api_test.py

@@ -298,7 +302,8 @@ def _get_record_metrics_test_calls(
        return calls

    def test_rendezvous(self):
-        spec = self._get_worker_spec(max_restarts=1)
+        hostname = _get_fq_hostname()
+        spec = self._get_worker_spec(max_restarts=1, local_addr=hostname)


I see this test is changing? Are we changing behavior here around requiring the fqdn being passed in?

Collectives operation is happening in the handler now so we need a address that is resolvable.
Prior to this changes tests would just query store for master_addr/master_port on the store.

facebook-github-bot · 2024-05-22T18:22:18Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-05-22T18:24:01Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Yejing-Lai · 2024-06-05T08:56:45Z

torch/distributed/elastic/agent/server/api.py

@@ -356,37 +357,6 @@ def is_failed(self) -> bool:
        return self.state == WorkerState.FAILED


-def _get_socket_with_port() -> socket.socket:


Hi @kurman. After delete '_get_socket_with_port' api. Can I know what is the equivalent API/method?

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 8, 2024

facebook-github-bot added the fb-exported label May 8, 2024

kurman force-pushed the export-D57055235 branch from fbae6e8 to b5616e3 Compare May 8, 2024 06:22

kurman force-pushed the export-D57055235 branch from b5616e3 to 3d7647c Compare May 9, 2024 00:58

kurman force-pushed the export-D57055235 branch from 3d7647c to 9cc13ee Compare May 9, 2024 00:59

kurman force-pushed the export-D57055235 branch from 9cc13ee to 924f493 Compare May 13, 2024 14:45

kurman force-pushed the export-D57055235 branch from 924f493 to a2e360f Compare May 13, 2024 14:46

kurman requested review from d4l3k and wconstab May 13, 2024 16:57

d4l3k requested changes May 13, 2024

View reviewed changes

torch/distributed/elastic/agent/server/api.py Outdated Show resolved Hide resolved

torch/distributed/elastic/agent/server/api.py Outdated Show resolved Hide resolved

kurman force-pushed the export-D57055235 branch from a2e360f to 55bc221 Compare May 17, 2024 21:01

pytorch-bot bot added the release notes: distributed (torchelastic) label May 17, 2024

kurman requested a review from d4l3k May 17, 2024 23:54

kurman force-pushed the export-D57055235 branch from 55bc221 to 0eef8b8 Compare May 20, 2024 22:36

kurman force-pushed the export-D57055235 branch from 0eef8b8 to 35f1ca6 Compare May 20, 2024 22:37

d4l3k requested changes May 21, 2024

View reviewed changes

wconstab reviewed May 21, 2024

View reviewed changes

kurman force-pushed the export-D57055235 branch from 35f1ca6 to 5392983 Compare May 21, 2024 18:53

kurman force-pushed the export-D57055235 branch from 5392983 to 12d5173 Compare May 21, 2024 19:00

kurman requested a review from d4l3k May 21, 2024 21:06

d4l3k approved these changes May 21, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 21, 2024

d4l3k reviewed May 21, 2024

View reviewed changes

pytorchmergebot added the merging label May 22, 2024

pytorchmergebot added the Merged label May 22, 2024

pytorchmergebot closed this in d62b025 May 22, 2024

pytorchmergebot removed the merging label May 22, 2024

Yejing-Lai reviewed Jun 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TorchElastic] Option for sharing TCPStore created by rdzv handlers #125743

[TorchElastic] Option for sharing TCPStore created by rdzv handlers #125743

kurman commented May 8, 2024 •

edited

pytorch-bot bot commented May 8, 2024 •

edited

facebook-github-bot commented May 8, 2024

facebook-github-bot commented May 8, 2024

facebook-github-bot commented May 9, 2024

facebook-github-bot commented May 9, 2024

facebook-github-bot commented May 13, 2024

facebook-github-bot commented May 13, 2024

facebook-github-bot commented May 17, 2024

facebook-github-bot commented May 20, 2024

facebook-github-bot commented May 20, 2024

kurman commented May 20, 2024

d4l3k left a comment

d4l3k May 21, 2024

wconstab May 21, 2024

kurman May 21, 2024

facebook-github-bot commented May 21, 2024

facebook-github-bot commented May 21, 2024

d4l3k left a comment

d4l3k May 21, 2024

kurman May 22, 2024

facebook-github-bot commented May 22, 2024

pytorchmergebot commented May 22, 2024

Yejing-Lai Jun 5, 2024

		@@ -356,37 +357,6 @@ def is_failed(self) -> bool:
		return self.state == WorkerState.FAILED


		def _get_socket_with_port() -> socket.socket:

[TorchElastic] Option for sharing TCPStore created by rdzv handlers #125743

[TorchElastic] Option for sharing TCPStore created by rdzv handlers #125743

Conversation

kurman commented May 8, 2024 • edited

pytorch-bot bot commented May 8, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125743

❗ 1 Active SEVs

✅ No Failures

facebook-github-bot commented May 8, 2024

facebook-github-bot commented May 8, 2024

facebook-github-bot commented May 9, 2024

facebook-github-bot commented May 9, 2024

facebook-github-bot commented May 13, 2024

facebook-github-bot commented May 13, 2024

facebook-github-bot commented May 17, 2024

facebook-github-bot commented May 20, 2024

facebook-github-bot commented May 20, 2024

kurman commented May 20, 2024

d4l3k left a comment

Choose a reason for hiding this comment

d4l3k May 21, 2024

Choose a reason for hiding this comment

wconstab May 21, 2024

Choose a reason for hiding this comment

kurman May 21, 2024

Choose a reason for hiding this comment

facebook-github-bot commented May 21, 2024

facebook-github-bot commented May 21, 2024

d4l3k left a comment

Choose a reason for hiding this comment

d4l3k May 21, 2024

Choose a reason for hiding this comment

kurman May 22, 2024

Choose a reason for hiding this comment

facebook-github-bot commented May 22, 2024

pytorchmergebot commented May 22, 2024

Merge started

Yejing-Lai Jun 5, 2024

Choose a reason for hiding this comment

kurman commented May 8, 2024 •

edited

pytorch-bot bot commented May 8, 2024 •

edited