Skip to content

Conversation

cbalioglu
Copy link
Contributor

@cbalioglu cbalioglu commented Apr 20, 2021

Stack from ghstack:

This PR introduces the _RendezvousStateHolder type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: D27892600

This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 20, 2021

💊 CI failures summary and remediations

As of commit e17705c (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_rocm.sh
Auto-merging .circleci/docker/common/install_rocm.sh
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_conda.sh
Auto-merging .circleci/docker/common/install_conda.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_rocm.sh
Auto-merging .circleci/docker/common/install_rocm.sh
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_conda.sh
Auto-merging .circleci/docker/common/install_conda.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

cbalioglu added a commit that referenced this pull request Apr 20, 2021
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

ghstack-source-id: 126985728
Pull Request resolved: #56538
@cbalioglu cbalioglu requested review from aivanou and kiukchung April 20, 2021 21:06
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

[ghstack-poisoned]
cbalioglu added a commit that referenced this pull request Apr 21, 2021
Pull Request resolved: #56538

This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.
ghstack-source-id: 127046689

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

[ghstack-poisoned]
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

[ghstack-poisoned]
cbalioglu added a commit that referenced this pull request Apr 21, 2021
Pull Request resolved: #56538

This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.
ghstack-source-id: 127078863

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

response = self.backend.set_state(state_bits, self._token)
else:
if self.cache_duration > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need cache, it would just complicate the rendezvous.

How is it going to be used? - is it going to be periodically invoked in a daemon thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching (by default 1 second) actually helps quite a bit if you have separate components in your system that all use the same _RendezvousStateHolder instance to retrieve state data (as in `DynamicRendezvousHandler). As they are decoupled from each other, if they frequently query the state, this reduces the amount of traffic.

dead_nodes = [
node
for node, last_keep_alive in self.state.last_keep_alives.items()
if last_keep_alive < expire_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to code, we define a node dead if the latest state contains a node that tried to update(or get?) state earlier than now()-last_keep_alive . Should we rename "last_keep_alive" to something like "last_heartbeat_time"?

Copy link
Contributor Author

@cbalioglu cbalioglu Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the last revision of this PR. I renamed "keep-alive" to "hearbeat" in most places.

This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

[ghstack-poisoned]
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

[ghstack-poisoned]
@cbalioglu cbalioglu changed the title [10/n] [torch/elastic] Introduce _RendezvousStateHolder [15/n] [torch/elastic] Introduce _RendezvousStateHolder Apr 28, 2021
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

[ghstack-poisoned]
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.

Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 76bccfb.

@facebook-github-bot facebook-github-bot deleted the gh/cbalioglu/10/head branch May 7, 2021 14:15
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#56538

This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes.
ghstack-source-id: 127684796

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D27892600

fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants