-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[15/n] [torch/elastic] Introduce _RendezvousStateHolder #56538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit e17705c (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/) ghstack-source-id: 126985728 Pull Request resolved: #56538
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/) [ghstack-poisoned]
Pull Request resolved: #56538 This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. ghstack-source-id: 127046689 Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/) [ghstack-poisoned]
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/) [ghstack-poisoned]
Pull Request resolved: #56538 This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. ghstack-source-id: 127078863 Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/)
|
||
response = self.backend.set_state(state_bits, self._token) | ||
else: | ||
if self.cache_duration > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need cache, it would just complicate the rendezvous.
How is it going to be used? - is it going to be periodically invoked in a daemon thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching (by default 1 second) actually helps quite a bit if you have separate components in your system that all use the same _RendezvousStateHolder
instance to retrieve state data (as in `DynamicRendezvousHandler). As they are decoupled from each other, if they frequently query the state, this reduces the amount of traffic.
dead_nodes = [ | ||
node | ||
for node, last_keep_alive in self.state.last_keep_alives.items() | ||
if last_keep_alive < expire_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to code, we define a node dead if the latest state contains a node that tried to update(or get?) state earlier than now()-last_keep_alive
. Should we rename "last_keep_alive" to something like "last_heartbeat_time"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the last revision of this PR. I renamed "keep-alive" to "hearbeat" in most places.
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/) [ghstack-poisoned]
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/) [ghstack-poisoned]
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/) [ghstack-poisoned]
This PR introduces the `_RendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job. Differential Revision: [D27892600](https://our.internmc.facebook.com/intern/diff/D27892600/) [ghstack-poisoned]
This pull request has been merged in 76bccfb. |
Summary: Pull Request resolved: pytorch#56538 This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes. ghstack-source-id: 127684796 Test Plan: Run the existing and new unit tests. Reviewed By: tierex Differential Revision: D27892600 fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
Stack from ghstack:
This PR introduces the
_RendezvousStateHolder
type that is responsible for synchronizing the local rendezvous state via the backend with the other nodes in the job.Differential Revision: D27892600