Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC should destroy/deinitialize dist autograd container when rpc is shutdown #39340

Closed
rohan-varma opened this issue Jun 1, 2020 · 0 comments
Labels
better-engineering Relatively self-contained tasks for better engineering contributors feature A request for a proper, new feature. module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@rohan-varma
Copy link
Member

rohan-varma commented Jun 1, 2020

馃殌 Feature/Enhnacement

During RPC initialization, each participating process initializes a singleton DistAutogradContainer. During shutdown, we cleanup items such as the RRef context and RPC agent, but do not reset this singleton dist autograd container. This prevents RPC from being reinitialized (i.e. a group of processes cannot call rpc.shutdown() ; dist.barrier() ; rpc.init_rpc(...). It will fail with the following error:

/test/distributed/rpc/faulty_agent/rpc_spawn_faulty#binary,link-tree
/torch/distributed/rpc/__init__.py", line 85, in init_rpc
>     dist_autograd._init(rank)
> RuntimeError: Container is already initialized! Cannot initialize it twice!

Motivation

I was looking into seeing if we could reuse spawned subprocesses across our RPC tests, which would decrease the amount of time it takes to run the RPC test suite; however, this would require processes to be able to re-init RPC, which is blocked by this error. Also, I was writing a set of similar tests in #38590, and wanted to use this pattern to reduce code duplication, but couldn't (the tests need to use different initializations of RPC since they test the RRef deletion behavior triggered by the shutdown sequence).

Pitch

I don't know if we necessarily need to destroy the container, maybe we could just de-initialize it, and then re-init if RPC is being started back up again.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar

@rohan-varma rohan-varma added the module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer label Jun 1, 2020
@izdeby izdeby added feature A request for a proper, new feature. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 3, 2020
@rohan-varma rohan-varma added the better-engineering Relatively self-contained tasks for better engineering contributors label Jun 16, 2020
pritamdamania87 pushed a commit that referenced this issue Aug 7, 2020
This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this issue Aug 7, 2020
This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)

ghstack-source-id: 109412898
Pull Request resolved: #42723
pritamdamania87 pushed a commit that referenced this issue Aug 7, 2020
This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this issue Aug 7, 2020
Pull Request resolved: #42723

This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.
ghstack-source-id: 109478799

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)
pritamdamania87 pushed a commit that referenced this issue Aug 13, 2020
This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this issue Aug 13, 2020
Pull Request resolved: #42723

This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.
ghstack-source-id: 109805368

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)
facebook-github-bot pushed a commit that referenced this issue Aug 14, 2020
Summary:
Pull Request resolved: #42723

This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.
ghstack-source-id: 109805368

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D22993909

fbshipit-source-id: 9f1c1e0a58b58b97125f41090601e967f96f70c6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better-engineering Relatively self-contained tasks for better engineering contributors feature A request for a proper, new feature. module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants