-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TorchElastic] Option for sharing TCPStore created by rdzv handlers #125743
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125743
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 12d5173 with merge base 4e6673e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D57055235 |
This pull request was exported from Phabricator. Differential Revision: D57055235 |
…ytorch#125743) Summary: Pull Request resolved: pytorch#125743 - Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous - Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself - When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
…ytorch#125743) Summary: - Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous - Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself - When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
This pull request was exported from Phabricator. Differential Revision: D57055235 |
…ytorch#125743) Summary: - Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous - Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself - When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
This pull request was exported from Phabricator. Differential Revision: D57055235 |
This pull request was exported from Phabricator. Differential Revision: D57055235 |
…ytorch#125743) Summary: - Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous - Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself - When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
…ytorch#125743) Summary: - Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous - Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself - When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
This pull request was exported from Phabricator. Differential Revision: D57055235 |
This pull request was exported from Phabricator. Differential Revision: D57055235 |
…ytorch#125743) Summary: - Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. - Instead of agent coordinating master_add/master_port values, the logic is now encapsulated in rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. Depending on the implementation they can either point to existing store or just build args that `torch.distributed.init_process_group` can bootstrap by creating new store. - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - next_rendezvous signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. With the above changes, rdzv handler store can be exposed/shared with trainers. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
…ytorch#125743) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
This pull request was exported from Phabricator. Differential Revision: D57055235 |
…ytorch#125743) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
This pull request was exported from Phabricator. Differential Revision: D57055235 |
Some of the actions are failing due module visibility issue check - addressing it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking much better!
@@ -48,19 +106,26 @@ class RendezvousHandler(ABC): | |||
def get_backend(self) -> str: | |||
"""Return the name of the rendezvous backend.""" | |||
|
|||
@property | |||
def use_agent_store(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on using this vs making _bootstrap_store_info
optional?
master_port: int | ||
|
||
@staticmethod | ||
def build(rank: int, store: Store) -> "RendezvousStoreInfo": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about the order of operations.
I thought the build() path was used when we did not have a store to share from the elastic later.
Actually I guess it must be the following- we always have a tcpstore at the elastic layer, so we can use it to share information. But we call build() when we don't want to share that store with the trainer, and instead we just use the elastic's store for exchanging port info so the trainer can create its own store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Build here is a factory method of the RendezvousStoreInfo
that will define master_addr/master_port
env variables that trainer code can use to start a new store server.
…ytorch#125743) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
This pull request was exported from Phabricator. Differential Revision: D57055235 |
…ytorch#125743) Summary: Pull Request resolved: pytorch#125743 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases Test Plan: CI Differential Revision: D57055235
This pull request was exported from Phabricator. Differential Revision: D57055235 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -298,7 +302,8 @@ def _get_record_metrics_test_calls( | |||
return calls | |||
|
|||
def test_rendezvous(self): | |||
spec = self._get_worker_spec(max_restarts=1) | |||
hostname = _get_fq_hostname() | |||
spec = self._get_worker_spec(max_restarts=1, local_addr=hostname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this test is changing? Are we changing behavior here around requiring the fqdn being passed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collectives operation is happening in the handler now so we need a address that is resolvable.
Prior to this changes tests would just query store for master_addr/master_port on the store.
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@@ -356,37 +357,6 @@ def is_failed(self) -> bool: | |||
return self.state == WorkerState.FAILED | |||
|
|||
|
|||
def _get_socket_with_port() -> socket.socket: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kurman. After delete '_get_socket_with_port' api. Can I know what is the equivalent API/method?
Summary:
use_agent_store
on rdzv handlers. Handlers that set is true can share the store.RendezvousInfo
will haveRendezvousStoreInfo
object that handlers must return.use_agent_store
as true - point 1). Client code will rely onTORCHELASTIC_USE_AGENT_STORE
env variable to know if the store is shared.torch.distributed.init_process_group
can bootstrap by creating new store.Additional points:
next_rendezvous
signature changed to return instance ofRendezvousInfo
instead of a (store, rank, world_size) tuple for extensibility purposes.Why:
Test Plan: CI
Differential Revision: D57055235
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k