-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Add master to RPC test #27776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add master to RPC test #27776
Conversation
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging. The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py. I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability. Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/) [ghstack-poisoned]
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging. The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py. I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability. Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/) ghstack-source-id: 91786275 Pull Request resolved: #27776
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging. The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py. I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability. Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/) [ghstack-poisoned]
Pull Request resolved: #27776 I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging. The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py. I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability. Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/) ghstack-source-id: 91799723
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging. The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py. I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability. Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/) [ghstack-poisoned]
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging. The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py. I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability. Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/) [ghstack-poisoned]
Pull Request resolved: #27776 I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging. The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py. I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability. Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/) ghstack-source-id: 91988737
| assert fut.wait() is None, "Sending termination signal failed." | ||
|
|
||
| # Close RPC. | ||
| rpc.join_rpc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
join_rpc here still does termination detection, so that we don't know if the above master-based termination would work or not (I think it could still leak futures, no?). Shall we replace this join_rpc with a local shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to delete join_rpc implementation in this PR, just add a local shutdown and let's see if it would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my plan.
After
- this PR has merged.
- local shutdown API is ready.
I will change this to shutdown.
This PR works, because there is another RpcAgent which implements join as shutdown, and the test_asymmetric_load is failing for that RpcAgent. This PR fix the test, proving it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That makes sense. I am a little worried if the flakiness in the other RpcAgent is due to leaking futures. Regardless, let's get this in first, then we can fix flakiness, and drop join_rpc().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
This pull request has been merged in 3523e54. |
This is a followup for #27776. Especially the comment, #27776 (comment). The plan is as follow. - Add local stop API to RpcAgent. - Change unit tests to use stop rather than join. Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/) [ghstack-poisoned]
This is a followup for #27776. Especially the comment, #27776 (comment). The plan is as follow. - Add local stop API to RpcAgent. - Change unit tests to use stop rather than join. Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/) ghstack-source-id: 92120728 Pull Request resolved: #28238
This is a followup for #27776, especially the comment, #27776 (comment). The plan is as follow. - Add local stop API to RpcAgent. - Change unit tests to use stop rather than join. Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/) [ghstack-poisoned]
Pull Request resolved: #28238 This is a followup for #27776, especially the comment, #27776 (comment). The plan is as follow. - Add local stop API to RpcAgent. - Change unit tests to use stop rather than join. ghstack-source-id: 92132935 Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)
Summary: Pull Request resolved: pytorch#27776 I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging. The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py. I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability. Differential Revision: D5445858 fbshipit-source-id: 56ee24703abd8c5b366829430bef657e0f1dfeba
Stack from ghstack:
I think it’s not worth it to equip other
RPCAgentwith collective communication capability, i.e. 1) have GLOO contained inRPCAgent, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.
I think having those unit tests to have a master is simpler than equipping
RPCAgentwith collective communication capability.Differential Revision: D5445858