New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support device map for distributed autograd while using TensorPipe. #44859
Support device map for distributed autograd while using TensorPipe. #44859
Conversation
TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]
…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]
Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 112255543 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
💊 CI failures summary and remediationsAs of commit 6dbfacc (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. This comment has been revised 20 times. |
…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]
Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 112351599 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
Codecov Report
@@ Coverage Diff @@
## gh/pritamdamania87/163/base #44859 +/- ##
===============================================================
- Coverage 80.66% 80.65% -0.02%
===============================================================
Files 1913 1913
Lines 208058 208104 +46
===============================================================
+ Hits 167833 167844 +11
- Misses 40225 40260 +35 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR will conflict with some work @mrshenli is doing in #44418 and the stack @beauby is about to merge.
Also, wondering why you went the route of adding a per-RPC override of the device map, rather than for example the solution I proposed in #44170 (comment). This PR certainly introduces a more flexible approach, but which also comes with extra complexity, so I'm thinking whether for the initial version we should go for something simpler?
@@ -157,7 +157,9 @@ class TORCH_API RpcAgent { | |||
virtual std::shared_ptr<FutureMessage> send( | |||
const WorkerInfo& to, | |||
Message&& message, | |||
const float rpcTimeoutSeconds = kUnsetRpcTimeout) = 0; | |||
const float rpcTimeoutSeconds = kUnsetRpcTimeout, | |||
const std::unordered_map<c10::DeviceIndex, c10::DeviceIndex>& deviceMap = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had agreed that in this initial version of CUDA support we would not allow to specify a per-RPC-call mapping but would instead always use the constant global one. It's true that this not being exposed at the Python layer, but introducing such an ability on the agent would add complexity (we'd probably need to attach the map to the message in case the receiver wants to access it and reverse it) and should probably be discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, @mrshenli, hadn't we said that we should use c10::Device rather than c10::DeviceIndex as the latter is implicitly limiting us to CUDA and won't allow (one day) to have host-to-device maps or handle AMD GPUs...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only user facing API we support today is the Python one. The RpcAgent interface can be thought of as an internal API that we have complete control over. In this PR we do attach this map to the message and actually reverse it for distributed autograd.
I went with DeviceIndex here to be consistent with the rest of the device mapping code. I agree with Shen that this should be Device, but that is a much more involved change for 1.7. We control this interface and all its implementations, so it shouldn't be a big deal to change this parameter slightly in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we are confident that we will soon add support for per-RPC device map arguments? If that's the case, adding it to recv backward LGTM. If we don't see that coming in the near future, I am not sure if it would be worthy to introduce the additional complexity. But since the device map will be a beta feature anyway, I think it should be fine either way from the perf's perspective. If we decided to keep the current version, in order to address code complexity concerns, we can create an issue/reminder to revisit this and see whether a global map would be enough before 1.9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is tied with whether or not we want to support per-RPC device map arguments. This is not the public API that users see and is a private one for now. If we do end up building a C++ API, at that point we can evaluate what to do with this extra argument.
Regarding complexity I'm not sure if there is a simpler way to address this issue holistically. A global map for the backward wouldn't work in all cases. For example if nodes 1 and 2 perform RPCs on node3 with separate global device maps for the forward pass, there can't be a global backward map defined on node3 to handle this. It seems like we do need to do this at a per RPC level to handle this in a generic way.
torch/csrc/distributed/rpc/utils.cpp
Outdated
@@ -537,8 +545,9 @@ std::tuple<tensorpipe::Message, TensorpipeWriteBuffers> tensorpipeSerialize( | |||
jit::getWriteableTensorData(tensorDataVec[i]); | |||
// Enforce memory copy if tensor is created from torch::from_blob, means | |||
// that the tensor doesn't own the memory. | |||
std::string metadata = | |||
deviceIndices.empty() ? "" : std::to_string(deviceIndices[i]); | |||
std::string metadata = deviceIndices.empty() || deviceIndices[i] == -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this be equal to -1?
const auto deviceIter = deviceMap.find(tensor.device().index()); | ||
if (deviceIter == deviceMap.end()) { | ||
checkCPUTensor(tensor); | ||
deviceIndices.push_back(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is where the -1 is coming from! Do we actually need it? I suspect that in principle what we need getDevicesForTensors
to return is the (smallest) set of devices on which the tensors reside, deduplicating the devices if multiple tensors reside on them. And at that point we could also just ignore CPU tensors, rather than insert -1 for them. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I completely followed the idea here, but downstream code like tensorpipeSerialize rely on deviceIndices being the same length as tensors, so we do need to put in some placeholder deviceIndex for CPU tensors here?
We can probably resolve the conflicts based on the order in which the PRs land.
I felt it wasn't too hard to implement the flexible approach and that's why I went with this approach. In the long term we probably want to have the backward pass pick up the device mapping like this anyways and since it wasn't too much work in implementing this, I thought it would be better to just go with this approach from the get go. |
…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]
Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 112417506 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]
Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 112516251 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]
Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 119112245 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
@@ -157,7 +157,9 @@ class TORCH_API RpcAgent { | |||
virtual std::shared_ptr<FutureMessage> send( | |||
const WorkerInfo& to, | |||
Message&& message, | |||
const float rpcTimeoutSeconds = kUnsetRpcTimeout) = 0; | |||
const float rpcTimeoutSeconds = kUnsetRpcTimeout, | |||
const std::unordered_map<c10::DeviceIndex, c10::DeviceIndex>& deviceMap = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we are confident that we will soon add support for per-RPC device map arguments? If that's the case, adding it to recv backward LGTM. If we don't see that coming in the near future, I am not sure if it would be worthy to introduce the additional complexity. But since the device map will be a beta feature anyway, I think it should be fine either way from the perf's perspective. If we decided to keep the current version, in order to address code complexity concerns, we can create an issue/reminder to revisit this and see whether a global map would be enough before 1.9.
getDevicesForRemote(clientPipe.pipe_->getRemoteName(), requestMessage); | ||
} else { | ||
// If deviceMap is specified, use that instead. | ||
devices = getDevicesForTensors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to consolidate getDevicesForTensors
and getDevicesForRemote
into one by letting it take an optional device map arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getDevicesForTensors
is actually a subset of getDevicesForRemote
. getDevicesForRemote
internally calls getDevicesForTensors
to avoid any duplication in logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I noticed that. What I didn't follow is why we need two similar but slightly different utility functions, and their didn't convey their difference? If all we need to letting the second utility function skip the map search and use the provided device map, this can potentially be done by using an optional arg? But this is a nit, please feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Shall we wait for #44418 before landing this one? Because if that one didn't land in time, this one will also needs to be reverted. And #44418 will need another fix in TensorPipe to detect GPUs that does not support peer access, i.e., cudaErrorPeerAccessUnsupported
. I will wait that fix in TP, and then rebase and land #44418. cc @lw @beauby
getDevicesForRemote(clientPipe.pipe_->getRemoteName(), requestMessage); | ||
} else { | ||
// If deviceMap is specified, use that instead. | ||
devices = getDevicesForTensors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I noticed that. What I didn't follow is why we need two similar but slightly different utility functions, and their didn't convey their difference? If all we need to letting the second utility function skip the map search and use the provided device map, this can potentially be done by using an optional arg? But this is a nit, please feel free to ignore.
Sure, we can wait for #44418 |
…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]
Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 119950842 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
@mrshenli Could you take another look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding this!
This pull request has been merged in 40eea6d. |
Stack from ghstack:
TensorPipe's
set_device_map
option was applied during the forwardpass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.
As a result, users had to specify both forward and backward device mapping
which is very tedious to do.
In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.
#Closes: #44170
Differential Revision: D23751975
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!