Skip to content

Conversation

@xush6528
Copy link
Contributor

@xush6528 xush6528 commented Oct 11, 2019

Stack from ghstack:

I think it’s not worth it to equip other RPCAgent with collective communication capability, i.e. 1) have GLOO contained in RPCAgent, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping RPCAgent with collective communication capability.

Differential Revision: D5445858

I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Oct 11, 2019
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/)

ghstack-source-id: 91786275
Pull Request resolved: #27776
@xush6528 xush6528 changed the title Add master to OSS RPC test Add master to RPC test Oct 11, 2019
@xush6528 xush6528 added the module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer label Oct 11, 2019
@xush6528 xush6528 requested review from mrshenli and satgera October 11, 2019 21:05
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Oct 11, 2019
Pull Request resolved: #27776

I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/)
ghstack-source-id: 91799723
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/)

[ghstack-poisoned]
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Oct 16, 2019
Pull Request resolved: #27776

I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: [D5445858](https://our.internmc.facebook.com/intern/diff/D5445858/)
ghstack-source-id: 91988737
assert fut.wait() is None, "Sending termination signal failed."

# Close RPC.
rpc.join_rpc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join_rpc here still does termination detection, so that we don't know if the above master-based termination would work or not (I think it could still leak futures, no?). Shall we replace this join_rpc with a local shutdown?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to delete join_rpc implementation in this PR, just add a local shutdown and let's see if it would work.

Copy link
Contributor Author

@xush6528 xush6528 Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli

That's my plan.
After

  1. this PR has merged.
  2. local shutdown API is ready.

I will change this to shutdown.

This PR works, because there is another RpcAgent which implements join as shutdown, and the test_asymmetric_load is failing for that RpcAgent. This PR fix the test, proving it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That makes sense. I am a little worried if the flakiness in the other RpcAgent is due to leaking futures. Regardless, let's get this in first, then we can fix flakiness, and drop join_rpc().

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3523e54.

xush6528 added a commit that referenced this pull request Oct 17, 2019
This is a followup for #27776.

Especially the comment, #27776 (comment).

The plan is as follow.


- Add local stop API to RpcAgent.
- Change unit tests to use stop rather than join.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Oct 17, 2019
This is a followup for #27776.

Especially the comment, #27776 (comment).

The plan is as follow.


- Add local stop API to RpcAgent.
- Change unit tests to use stop rather than join.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

ghstack-source-id: 92120728
Pull Request resolved: #28238
xush6528 added a commit that referenced this pull request Oct 17, 2019
This is a followup for #27776, especially the comment, #27776 (comment).

The plan is as follow.
- Add local stop API to RpcAgent.
- Change unit tests to use stop rather than join.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Oct 17, 2019
Pull Request resolved: #28238

This is a followup for #27776, especially the comment, #27776 (comment).

The plan is as follow.

- Add local stop API to RpcAgent.
- Change unit tests to use stop rather than join.

ghstack-source-id: 92132935

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)
@facebook-github-bot facebook-github-bot deleted the gh/xush6528/16/head branch October 28, 2019 22:22
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Pull Request resolved: pytorch#27776

I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: D5445858

fbshipit-source-id: 56ee24703abd8c5b366829430bef657e0f1dfeba
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants