Skip to content

Conversation

jjlilley
Copy link
Contributor

@jjlilley jjlilley commented Nov 6, 2019

Stack from ghstack:

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: D18339715

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 6, 2019
Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

ghstack-source-id: 93328045
Pull Request resolved: #29253
Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 6, 2019
Pull Request resolved: #29253

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93344020

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
@pietern
Copy link
Contributor

pietern commented Nov 6, 2019

LGTM, but will defer to @mrshenli for the approval.

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 6, 2019
Pull Request resolved: #29253

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93389336

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thank you!

// data outlives the scope of this function.
auto serializedPayload = new std::string(serialize(message));
enqueueRecv(RecvWork(
allWorkerInfo_[pg_->getRank()],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a RpcAgent::getWorkerInfo() for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change that. I was just copying the code a few lines below, later in this function. :)

torch::from_blob(
(void*)serializedPayload->data(),
serializedPayload->length(),
[serializedPayload](void*) { delete serializedPayload; },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is capture by value? Does this prevent the local var serializedPayload being freed before the recv is done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serializedPayload is a string*, so we capture the pointer by value.
The goal is for the deleter can delete the buffer when the torch::Tensor is done with it.
The local var won't clean itself up.

In the other torch::from_blob() cases in this file, the tensor buffer doesn't need to be persisted beyond the immediate function scope. This case is different because the Tensor's use is delayed until the other listen thread picks it up.

Maybe I'll go ahead and make it a unique_ptr<>, to make the code a bit more exception-safe, and to clarify the lifetime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and uploaded a version with a smart pointer to make memory lifetime clearer in this func, and safer with exceptions.

That said, I had to use shared_ptr<> instead of unique_ptr<> because Pytorch is supposed to avoid c++14 code (TIL there are some c++11 issues with unique_ptr<> lambda capture that were solved in c++14)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjlilley Thanks to @smessmer, we can use C++14 since 2 weeks (see #28443).

Capturing a std::unique_ptr is possible only if you don't convert it to std::function, because that requires the wrapped function to be copy-constructible. And of course, if you capture a std::unique_ptr, it won't be copy-constructible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we then want to wait for #27864?

Nov 07 01:02:43 /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/process_group_agent.cpp: In lambda function:
Nov 07 01:02:43 /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/process_group_agent.cpp:316:20: error: lambda capture initializers only available with -std=c++14 or -std=gnu++14 [-Werror]
Nov 07 01:02:43                    [payload = std::move(payload)](void*) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, if I create a unique_ptr<> and release the raw pointer into the deleter lambda captures, I think we both:

  1. are still fairly exception-safe, and mostly avoid raw pointers
  2. self-document the memory lifecycle reasonably
    I'll upload a version that does this.

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 7, 2019
Pull Request resolved: #29253

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93413452

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 7, 2019
Pull Request resolved: #29253

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93415357

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 7, 2019
Pull Request resolved: #29253

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93458021

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 7, 2019
Pull Request resolved: #29253

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93460603

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
):
rpc.rpc_sync(self_worker_name, torch.add, args=(torch.ones(2, 2), 1))
fut = rpc.rpc_async(self_worker_info, torch.add, args=(torch.ones(2, 2), 1))
ret = rpc.rpc_sync(self_worker_info, torch.add, args=(torch.ones(2, 2), 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a test for rpc.remote as well?

@jjlilley
Copy link
Contributor Author

jjlilley commented Nov 7, 2019

  • Uploading another try to deal with c++14 vs c++11 lambda issue. I think this one will pass. :)

  • re: rpc.remote - there's still additional work for this, see python_functions.cpp:142, and kind of distinct code to change.
    (I tried trivially uncommenting out the asserts and adding a test case, it seems best deferred to another PR)

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 7, 2019
Pull Request resolved: #29253

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93474900

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
@mrshenli
Copy link
Contributor

mrshenli commented Nov 7, 2019

I tried trivially uncommenting out the asserts and adding a test case, it seems best deferred to another PR.

I see, that makes sense. Yes, besides commenting that assert out, at least the following line should create an owner RRef instead of a user RRef, if calling to self.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

Would I be correct if I assume that we will get rid of passing raw pointers to the deleter when #27864 is in?

@jjlilley
Copy link
Contributor Author

jjlilley commented Nov 7, 2019

Thanks for the review!
And happy to change the lambda capture after we transition to c++14.

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 8, 2019
Pull Request resolved: #29253

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93518076

Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2cd4f86.

@facebook-github-bot facebook-github-bot deleted the gh/jjlilley/19/head branch November 12, 2019 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants