Support device map for distributed autograd while using TensorPipe. #44859

pritamdamania87 · 2020-09-17T03:02:44Z

Stack from ghstack:

Support device map for distributed autograd while using TensorPipe. #44859 Support device map for distributed autograd while using TensorPipe.

TensorPipe's set_device_map option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170

Differential Revision: D23751975

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]

…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]

Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 112255543 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

dr-ci · 2020-09-17T03:13:43Z

💊 CI failures summary and remediations

As of commit 6dbfacc (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 20 times.

…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]

Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 112351599 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

codecov · 2020-09-18T06:21:16Z

Codecov Report

Merging #44859 (a5dd1d8) into gh/pritamdamania87/163/base (7f3a407) will decrease coverage by 0.01%.
The diff coverage is 55.55%.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/163/base   #44859      +/-   ##
===============================================================
- Coverage                        80.66%   80.65%   -0.02%     
===============================================================
  Files                             1913     1913              
  Lines                           208058   208104      +46     
===============================================================
+ Hits                            167833   167844      +11     
- Misses                           40225    40260      +35

lw

This PR will conflict with some work @mrshenli is doing in #44418 and the stack @beauby is about to merge.

Also, wondering why you went the route of adding a per-RPC override of the device map, rather than for example the solution I proposed in #44170 (comment). This PR certainly introduces a more flexible approach, but which also comes with extra complexity, so I'm thinking whether for the initial version we should go for something simpler?

lw · 2020-09-18T09:37:41Z

torch/csrc/distributed/rpc/rpc_agent.h

@@ -157,7 +157,9 @@ class TORCH_API RpcAgent {
  virtual std::shared_ptr<FutureMessage> send(
      const WorkerInfo& to,
      Message&& message,
-      const float rpcTimeoutSeconds = kUnsetRpcTimeout) = 0;
+      const float rpcTimeoutSeconds = kUnsetRpcTimeout,
+      const std::unordered_map<c10::DeviceIndex, c10::DeviceIndex>& deviceMap =


I thought we had agreed that in this initial version of CUDA support we would not allow to specify a per-RPC-call mapping but would instead always use the constant global one. It's true that this not being exposed at the Python layer, but introducing such an ability on the agent would add complexity (we'd probably need to attach the map to the message in case the receiver wants to access it and reverse it) and should probably be discussed.

Also, @mrshenli, hadn't we said that we should use c10::Device rather than c10::DeviceIndex as the latter is implicitly limiting us to CUDA and won't allow (one day) to have host-to-device maps or handle AMD GPUs...?

The only user facing API we support today is the Python one. The RpcAgent interface can be thought of as an internal API that we have complete control over. In this PR we do attach this map to the message and actually reverse it for distributed autograd.

I went with DeviceIndex here to be consistent with the rest of the device mapping code. I agree with Shen that this should be Device, but that is a much more involved change for 1.7. We control this interface and all its implementations, so it shouldn't be a big deal to change this parameter slightly in the future.

Does this mean we are confident that we will soon add support for per-RPC device map arguments? If that's the case, adding it to recv backward LGTM. If we don't see that coming in the near future, I am not sure if it would be worthy to introduce the additional complexity. But since the device map will be a beta feature anyway, I think it should be fine either way from the perf's perspective. If we decided to keep the current version, in order to address code complexity concerns, we can create an issue/reminder to revisit this and see whether a global map would be enough before 1.9.

I don't think this is tied with whether or not we want to support per-RPC device map arguments. This is not the public API that users see and is a private one for now. If we do end up building a C++ API, at that point we can evaluate what to do with this extra argument.

Regarding complexity I'm not sure if there is a simpler way to address this issue holistically. A global map for the backward wouldn't work in all cases. For example if nodes 1 and 2 perform RPCs on node3 with separate global device maps for the forward pass, there can't be a global backward map defined on node3 to handle this. It seems like we do need to do this at a per RPC level to handle this in a generic way.

lw · 2020-09-18T09:54:17Z

torch/csrc/distributed/rpc/utils.cpp

@@ -537,8 +545,9 @@ std::tuple<tensorpipe::Message, TensorpipeWriteBuffers> tensorpipeSerialize(
        jit::getWriteableTensorData(tensorDataVec[i]);
    // Enforce memory copy if tensor is created from torch::from_blob, means
    // that the tensor doesn't own the memory.
-    std::string metadata =
-        deviceIndices.empty() ? "" : std::to_string(deviceIndices[i]);
+    std::string metadata = deviceIndices.empty() || deviceIndices[i] == -1


When would this be equal to -1?

lw · 2020-09-18T09:58:47Z

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

+    const auto deviceIter = deviceMap.find(tensor.device().index());
+    if (deviceIter == deviceMap.end()) {
+      checkCPUTensor(tensor);
+      deviceIndices.push_back(-1);


Oh this is where the -1 is coming from! Do we actually need it? I suspect that in principle what we need getDevicesForTensors to return is the (smallest) set of devices on which the tensors reside, deduplicating the devices if multiple tensors reside on them. And at that point we could also just ignore CPU tensors, rather than insert -1 for them. Right?

Not sure I completely followed the idea here, but downstream code like tensorpipeSerialize rely on deviceIndices being the same length as tensors, so we do need to put in some placeholder deviceIndex for CPU tensors here?

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

pritamdamania87 · 2020-09-18T21:37:01Z

This PR will conflict with some work @mrshenli is doing in #44418 and the stack @beauby is about to merge.

We can probably resolve the conflicts based on the order in which the PRs land.

Also, wondering why you went the route of adding a per-RPC override of the device map, rather than for example the solution I proposed in #44170 (comment). This PR certainly introduces a more flexible approach, but which also comes with extra complexity, so I'm thinking whether for the initial version we should go for something simpler?

I felt it wasn't too hard to implement the flexible approach and that's why I went with this approach. In the long term we probably want to have the backward pass pick up the device mapping like this anyways and since it wasn't too much work in implementing this, I thought it would be better to just go with this approach from the get go.

…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]

Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 112417506 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]

Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 112516251 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]

Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 119112245 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

pritamdamania87 · 2020-12-24T01:36:07Z

@mrshenli @lw We didn't get this into 1.7 since device maps were reverted for that release, although I feel we should land this in 1.8. Let me know your thoughts on this and if it makes sense, would love to have another round of review :)

mrshenli · 2021-01-07T02:59:18Z

torch/csrc/distributed/rpc/rpc_agent.h

@@ -157,7 +157,9 @@ class TORCH_API RpcAgent {
  virtual std::shared_ptr<FutureMessage> send(
      const WorkerInfo& to,
      Message&& message,
-      const float rpcTimeoutSeconds = kUnsetRpcTimeout) = 0;
+      const float rpcTimeoutSeconds = kUnsetRpcTimeout,
+      const std::unordered_map<c10::DeviceIndex, c10::DeviceIndex>& deviceMap =


Does this mean we are confident that we will soon add support for per-RPC device map arguments? If that's the case, adding it to recv backward LGTM. If we don't see that coming in the near future, I am not sure if it would be worthy to introduce the additional complexity. But since the device map will be a beta feature anyway, I think it should be fine either way from the perf's perspective. If we decided to keep the current version, in order to address code complexity concerns, we can create an issue/reminder to revisit this and see whether a global map would be enough before 1.9.

mrshenli · 2021-01-07T03:21:36Z

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

+        getDevicesForRemote(clientPipe.pipe_->getRemoteName(), requestMessage);
+  } else {
+    // If deviceMap is specified, use that instead.
+    devices = getDevicesForTensors(


Is it possible to consolidate getDevicesForTensors and getDevicesForRemote into one by letting it take an optional device map arg?

getDevicesForTensors is actually a subset of getDevicesForRemote. getDevicesForRemote internally calls getDevicesForTensors to avoid any duplication in logic.

Yep, I noticed that. What I didn't follow is why we need two similar but slightly different utility functions, and their didn't convey their difference? If all we need to letting the second utility function skip the map search and use the provided device map, this can potentially be done by using an optional arg? But this is a nit, please feel free to ignore.

mrshenli

LGTM! Shall we wait for #44418 before landing this one? Because if that one didn't land in time, this one will also needs to be reverted. And #44418 will need another fix in TensorPipe to detect GPUs that does not support peer access, i.e., cudaErrorPeerAccessUnsupported. I will wait that fix in TP, and then rebase and land #44418. cc @lw @beauby

mrshenli · 2021-01-12T02:48:58Z

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

+        getDevicesForRemote(clientPipe.pipe_->getRemoteName(), requestMessage);
+  } else {
+    // If deviceMap is specified, use that instead.
+    devices = getDevicesForTensors(


Yep, I noticed that. What I didn't follow is why we need two similar but slightly different utility functions, and their didn't convey their difference? If all we need to letting the second utility function skip the map search and use the provided device map, this can potentially be done by using an optional arg? But this is a nit, please feel free to ignore.

pritamdamania87 · 2021-01-13T00:56:04Z

Sure, we can wait for #44418

…nsorPipe." TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)! [ghstack-poisoned]

Pull Request resolved: #44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: #44170 ghstack-source-id: 119950842 Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

pritamdamania87 · 2021-01-19T20:24:47Z

@mrshenli Could you take another look? Thanks!

mrshenli

LGTM! Thanks for adding this!

facebook-github-bot · 2021-01-27T21:05:24Z

This pull request has been merged in 40eea6d.

pritamdamania87 requested review from mrshenli, pietern, rohan-varma and zhaojuanmao as code owners September 17, 2020 03:02

pritamdamania87 requested a review from lw September 17, 2020 03:03

lw reviewed Sep 18, 2020

View reviewed changes

pritamdamania87 requested a review from lw September 18, 2020 21:49

facebook-github-bot added the cla signed label Oct 30, 2020

pritamdamania87 requested a review from mingzhe09088 as a code owner December 24, 2020 01:34

mrshenli reviewed Jan 7, 2021

View reviewed changes

pritamdamania87 requested review from mrshenli and removed request for pietern January 8, 2021 05:10

mrshenli reviewed Jan 12, 2021

View reviewed changes

mrshenli approved these changes Jan 26, 2021

View reviewed changes

facebook-github-bot closed this in 40eea6d Jan 27, 2021

facebook-github-bot added the Merged label Jan 27, 2021

facebook-github-bot deleted the gh/pritamdamania87/163/head branch January 31, 2021 15:18

pritamdamania87 mentioned this pull request Apr 27, 2021

[RPC Framework] Forbid process group backend in RPC #56854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support device map for distributed autograd while using TensorPipe. #44859

Support device map for distributed autograd while using TensorPipe. #44859

pritamdamania87 commented Sep 17, 2020 •

edited

dr-ci bot commented Sep 17, 2020 •

edited by facebook-github-bot

codecov bot commented Sep 18, 2020 •

edited

lw left a comment

lw Sep 18, 2020

lw Sep 18, 2020

pritamdamania87 Sep 18, 2020

mrshenli Jan 7, 2021

pritamdamania87 Jan 8, 2021

lw Sep 18, 2020

lw Sep 18, 2020

pritamdamania87 Sep 18, 2020

pritamdamania87 commented Sep 18, 2020

pritamdamania87 commented Dec 24, 2020

mrshenli Jan 7, 2021

mrshenli Jan 7, 2021

pritamdamania87 Jan 8, 2021

mrshenli Jan 12, 2021

mrshenli left a comment

mrshenli Jan 12, 2021

pritamdamania87 commented Jan 13, 2021

pritamdamania87 commented Jan 19, 2021

mrshenli left a comment

facebook-github-bot commented Jan 27, 2021

Support device map for distributed autograd while using TensorPipe. #44859

Support device map for distributed autograd while using TensorPipe. #44859

Conversation

pritamdamania87 commented Sep 17, 2020 • edited

dr-ci bot commented Sep 17, 2020 • edited by facebook-github-bot

💊 CI failures summary and remediations

codecov bot commented Sep 18, 2020 • edited

Codecov Report

lw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pritamdamania87 commented Sep 18, 2020

pritamdamania87 commented Dec 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pritamdamania87 commented Jan 13, 2021

pritamdamania87 commented Jan 19, 2021

mrshenli left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 27, 2021

pritamdamania87 commented Sep 17, 2020 •

edited

dr-ci bot commented Sep 17, 2020 •

edited by facebook-github-bot

codecov bot commented Sep 18, 2020 •

edited