New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
RPC should destroy/deinitialize dist autograd container when rpc is shutdown #39340
Labels
better-engineering
Relatively self-contained tasks for better engineering contributors
feature
A request for a proper, new feature.
module: rpc
Related to RPC, distributed autograd, RRef, and distributed optimizer
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Comments
rohan-varma
added
the
module: rpc
Related to RPC, distributed autograd, RRef, and distributed optimizer
label
Jun 1, 2020
izdeby
added
feature
A request for a proper, new feature.
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
labels
Jun 3, 2020
rohan-varma
added
the
better-engineering
Relatively self-contained tasks for better engineering contributors
label
Jun 16, 2020
pritamdamania87
pushed a commit
that referenced
this issue
Aug 7, 2020
This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/) [ghstack-poisoned]
pritamdamania87
pushed a commit
that referenced
this issue
Aug 7, 2020
This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/) ghstack-source-id: 109412898 Pull Request resolved: #42723
pritamdamania87
pushed a commit
that referenced
this issue
Aug 7, 2020
This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/) [ghstack-poisoned]
pritamdamania87
pushed a commit
that referenced
this issue
Aug 7, 2020
Pull Request resolved: #42723 This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. ghstack-source-id: 109478799 Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)
pritamdamania87
pushed a commit
that referenced
this issue
Aug 13, 2020
This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/) [ghstack-poisoned]
pritamdamania87
pushed a commit
that referenced
this issue
Aug 13, 2020
Pull Request resolved: #42723 This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. ghstack-source-id: 109805368 Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)
facebook-github-bot
pushed a commit
that referenced
this issue
Aug 14, 2020
Summary: Pull Request resolved: #42723 This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. ghstack-source-id: 109805368 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D22993909 fbshipit-source-id: 9f1c1e0a58b58b97125f41090601e967f96f70c6
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
better-engineering
Relatively self-contained tasks for better engineering contributors
feature
A request for a proper, new feature.
module: rpc
Related to RPC, distributed autograd, RRef, and distributed optimizer
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃殌 Feature/Enhnacement
During RPC initialization, each participating process initializes a singleton
DistAutogradContainer
. During shutdown, we cleanup items such as the RRef context and RPC agent, but do not reset this singleton dist autograd container. This prevents RPC from being reinitialized (i.e. a group of processes cannot callrpc.shutdown() ; dist.barrier() ; rpc.init_rpc(...)
. It will fail with the following error:Motivation
I was looking into seeing if we could reuse spawned subprocesses across our RPC tests, which would decrease the amount of time it takes to run the RPC test suite; however, this would require processes to be able to re-init RPC, which is blocked by this error. Also, I was writing a set of similar tests in #38590, and wanted to use this pattern to reduce code duplication, but couldn't (the tests need to use different initializations of RPC since they test the RRef deletion behavior triggered by the shutdown sequence).
Pitch
I don't know if we necessarily need to destroy the container, maybe we could just de-initialize it, and then re-init if RPC is being started back up again.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar
The text was updated successfully, but these errors were encountered: