-
Notifications
You must be signed in to change notification settings - Fork 25.4k
[pytorch] Support process_group_agent "sending to itself" #29253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) ghstack-source-id: 93328045 Pull Request resolved: #29253
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Pull Request resolved: #29253 Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. ghstack-source-id: 93344020 Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
LGTM, but will defer to @mrshenli for the approval. |
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Pull Request resolved: #29253 Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. ghstack-source-id: 93389336 Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thank you!
// data outlives the scope of this function. | ||
auto serializedPayload = new std::string(serialize(message)); | ||
enqueueRecv(RecvWork( | ||
allWorkerInfo_[pg_->getRank()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a RpcAgent::getWorkerInfo()
for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change that. I was just copying the code a few lines below, later in this function. :)
torch::from_blob( | ||
(void*)serializedPayload->data(), | ||
serializedPayload->length(), | ||
[serializedPayload](void*) { delete serializedPayload; }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is capture by value? Does this prevent the local var serializedPayload
being freed before the recv is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serializedPayload is a string*, so we capture the pointer by value.
The goal is for the deleter can delete the buffer when the torch::Tensor is done with it.
The local var won't clean itself up.
In the other torch::from_blob() cases in this file, the tensor buffer doesn't need to be persisted beyond the immediate function scope. This case is different because the Tensor's use is delayed until the other listen thread picks it up.
Maybe I'll go ahead and make it a unique_ptr<>, to make the code a bit more exception-safe, and to clarify the lifetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and uploaded a version with a smart pointer to make memory lifetime clearer in this func, and safer with exceptions.
That said, I had to use shared_ptr<> instead of unique_ptr<> because Pytorch is supposed to avoid c++14 code (TIL there are some c++11 issues with unique_ptr<> lambda capture that were solved in c++14)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjlilley Thanks to @smessmer, we can use C++14 since 2 weeks (see #28443).
Capturing a std::unique_ptr
is possible only if you don't convert it to std::function
, because that requires the wrapped function to be copy-constructible. And of course, if you capture a std::unique_ptr
, it won't be copy-constructible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we then want to wait for #27864?
Nov 07 01:02:43 /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/process_group_agent.cpp: In lambda function:
Nov 07 01:02:43 /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/process_group_agent.cpp:316:20: error: lambda capture initializers only available with -std=c++14 or -std=gnu++14 [-Werror]
Nov 07 01:02:43 [payload = std::move(payload)](void*) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, if I create a unique_ptr<> and release the raw pointer into the deleter lambda captures, I think we both:
- are still fairly exception-safe, and mostly avoid raw pointers
- self-document the memory lifecycle reasonably
I'll upload a version that does this.
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Pull Request resolved: #29253 Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. ghstack-source-id: 93413452 Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Pull Request resolved: #29253 Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. ghstack-source-id: 93415357 Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Pull Request resolved: #29253 Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. ghstack-source-id: 93458021 Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Pull Request resolved: #29253 Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. ghstack-source-id: 93460603 Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
): | ||
rpc.rpc_sync(self_worker_name, torch.add, args=(torch.ones(2, 2), 1)) | ||
fut = rpc.rpc_async(self_worker_info, torch.add, args=(torch.ones(2, 2), 1)) | ||
ret = rpc.rpc_sync(self_worker_info, torch.add, args=(torch.ones(2, 2), 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a test for rpc.remote
as well?
|
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Pull Request resolved: #29253 Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. ghstack-source-id: 93474900 Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
I see, that makes sense. Yes, besides commenting that assert out, at least the following line should create an owner RRef instead of a user RRef, if calling to self. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
Would I be correct if I assume that we will get rid of passing raw pointers to the deleter when #27864 is in?
Thanks for the review! |
Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/) [ghstack-poisoned]
Pull Request resolved: #29253 Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. ghstack-source-id: 93518076 Differential Revision: [D18339715](https://our.internmc.facebook.com/intern/diff/D18339715/)
This pull request has been merged in 2cd4f86. |
Stack from ghstack:
Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.
That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
Differential Revision: D18339715