Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TorchElastic] Option for sharing TCPStore created by rdzv handlers #125743

Closed
wants to merge 1 commit into from

Conversation

kurman
Copy link
Contributor

@kurman kurman commented May 8, 2024

Summary:

  1. Define explicit use_agent_store on rdzv handlers. Handlers that set is true can share the store.
  2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where RendezvousInfo will have RendezvousStoreInfo object that handlers must return.
    • Depending on the implementation they can either:
      • point to existing store (and expected to use_agent_store as true - point 1). Client code will rely on TORCHELASTIC_USE_AGENT_STORE env variable to know if the store is shared.
      • build args that torch.distributed.init_process_group can bootstrap by creating new store.

Additional points:

  • When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
  • next_rendezvous signature changed to return instance of RendezvousInfo instead of a (store, rank, world_size) tuple for extensibility purposes.

Why:

  • Reduce moving parts
    • easier to swap implementation
    • improve tractability
    • addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented May 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125743

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 12d5173 with merge base 4e6673e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 8, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

kurman added a commit to kurman/pytorch that referenced this pull request May 8, 2024
…ytorch#125743)

Summary:
Pull Request resolved: pytorch#125743

- Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous
- Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself
- When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases.

With the above changes, rdzv handler store can be exposed/shared with trainers.

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
kurman added a commit to kurman/pytorch that referenced this pull request May 9, 2024
…ytorch#125743)

Summary:

- Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous
- Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself
- When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases.

With the above changes, rdzv handler store can be exposed/shared with trainers. 

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

kurman added a commit to kurman/pytorch that referenced this pull request May 9, 2024
…ytorch#125743)

Summary:

- Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous
- Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself
- When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases.

With the above changes, rdzv handler store can be exposed/shared with trainers. 

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

kurman added a commit to kurman/pytorch that referenced this pull request May 13, 2024
…ytorch#125743)

Summary:

- Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous
- Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself
- When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases.

With the above changes, rdzv handler store can be exposed/shared with trainers. 

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
kurman added a commit to kurman/pytorch that referenced this pull request May 13, 2024
…ytorch#125743)

Summary:

- Expose `static` property on rdzv handlers that indicate whether they are non-elastic type of rendezvous
- Allow rdzv handlers to share TCPStore via setting MASTER_ADDR/MASTER_PORT in the store itself
- When TCPStore is shared, it will be wrapped in PrefixStore to qualify/scope namespace for other usecases.

With the above changes, rdzv handler store can be exposed/shared with trainers. 

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

@kurman kurman requested review from d4l3k and wconstab May 13, 2024 16:57
torch/distributed/elastic/agent/server/api.py Outdated Show resolved Hide resolved
torch/distributed/elastic/agent/server/api.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

kurman added a commit to kurman/pytorch that referenced this pull request May 17, 2024
…ytorch#125743)

Summary:


- Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 
- Instead of agent coordinating master_add/master_port values, the logic is now encapsulated in rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. Depending on the implementation they can either point to existing store or just build args that `torch.distributed.init_process_group` can bootstrap by creating new store. 
- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- next_rendezvous signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.

With the above changes, rdzv handler store can be exposed/shared with trainers. 

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
@kurman kurman requested a review from d4l3k May 17, 2024 23:54
kurman added a commit to kurman/pytorch that referenced this pull request May 20, 2024
…ytorch#125743)

Summary:


1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. 
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. 

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.


Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

kurman added a commit to kurman/pytorch that referenced this pull request May 20, 2024
…ytorch#125743)

Summary:


1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. 
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. 

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.


Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

@kurman
Copy link
Contributor Author

kurman commented May 20, 2024

Some of the actions are failing due module visibility issue check - addressing it.

Copy link
Collaborator

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking much better!

torch/distributed/elastic/rendezvous/api.py Outdated Show resolved Hide resolved
@@ -48,19 +106,26 @@ class RendezvousHandler(ABC):
def get_backend(self) -> str:
"""Return the name of the rendezvous backend."""

@property
def use_agent_store(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on using this vs making _bootstrap_store_info optional?

torch/distributed/elastic/agent/server/api.py Show resolved Hide resolved
master_port: int

@staticmethod
def build(rank: int, store: Store) -> "RendezvousStoreInfo":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the order of operations.

I thought the build() path was used when we did not have a store to share from the elastic later.

Actually I guess it must be the following- we always have a tcpstore at the elastic layer, so we can use it to share information. But we call build() when we don't want to share that store with the trainer, and instead we just use the elastic's store for exchanging port info so the trainer can create its own store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Build here is a factory method of the RendezvousStoreInfo that will define master_addr/master_port env variables that trainer code can use to start a new store server.

kurman added a commit to kurman/pytorch that referenced this pull request May 21, 2024
…ytorch#125743)

Summary:


1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. 
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. 

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.


Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

…ytorch#125743)

Summary:
Pull Request resolved: pytorch#125743

1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - pytorch#1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store.

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases

Test Plan: CI

Differential Revision: D57055235
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57055235

Copy link
Collaborator

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 21, 2024
@@ -298,7 +302,8 @@ def _get_record_metrics_test_calls(
return calls

def test_rendezvous(self):
spec = self._get_worker_spec(max_restarts=1)
hostname = _get_fq_hostname()
spec = self._get_worker_spec(max_restarts=1, local_addr=hostname)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this test is changing? Are we changing behavior here around requiring the fqdn being passed in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collectives operation is happening in the handler now so we need a address that is resolvable.
Prior to this changes tests would just query store for master_addr/master_port on the store.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@@ -356,37 +357,6 @@ def is_failed(self) -> bool:
return self.state == WorkerState.FAILED


def _get_socket_with_port() -> socket.socket:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kurman. After delete '_get_socket_with_port' api. Can I know what is the equivalent API/method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants