Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support device map for distributed autograd while using TensorPipe. #44859

Closed

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Sep 17, 2020

Stack from ghstack:

TensorPipe's set_device_map option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170

Differential Revision: D23751975

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

[ghstack-poisoned]
…nsorPipe."

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Sep 17, 2020
Pull Request resolved: #44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170
ghstack-source-id: 112255543

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
@dr-ci
Copy link

dr-ci bot commented Sep 17, 2020

💊 CI failures summary and remediations

As of commit 6dbfacc (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 20 times.

…nsorPipe."

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Sep 18, 2020
Pull Request resolved: #44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170
ghstack-source-id: 112351599

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
@codecov
Copy link

codecov bot commented Sep 18, 2020

Codecov Report

Merging #44859 (a5dd1d8) into gh/pritamdamania87/163/base (7f3a407) will decrease coverage by 0.01%.
The diff coverage is 55.55%.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/163/base   #44859      +/-   ##
===============================================================
- Coverage                        80.66%   80.65%   -0.02%     
===============================================================
  Files                             1913     1913              
  Lines                           208058   208104      +46     
===============================================================
+ Hits                            167833   167844      +11     
- Misses                           40225    40260      +35     

Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will conflict with some work @mrshenli is doing in #44418 and the stack @beauby is about to merge.

Also, wondering why you went the route of adding a per-RPC override of the device map, rather than for example the solution I proposed in #44170 (comment). This PR certainly introduces a more flexible approach, but which also comes with extra complexity, so I'm thinking whether for the initial version we should go for something simpler?

@@ -157,7 +157,9 @@ class TORCH_API RpcAgent {
virtual std::shared_ptr<FutureMessage> send(
const WorkerInfo& to,
Message&& message,
const float rpcTimeoutSeconds = kUnsetRpcTimeout) = 0;
const float rpcTimeoutSeconds = kUnsetRpcTimeout,
const std::unordered_map<c10::DeviceIndex, c10::DeviceIndex>& deviceMap =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had agreed that in this initial version of CUDA support we would not allow to specify a per-RPC-call mapping but would instead always use the constant global one. It's true that this not being exposed at the Python layer, but introducing such an ability on the agent would add complexity (we'd probably need to attach the map to the message in case the receiver wants to access it and reverse it) and should probably be discussed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @mrshenli, hadn't we said that we should use c10::Device rather than c10::DeviceIndex as the latter is implicitly limiting us to CUDA and won't allow (one day) to have host-to-device maps or handle AMD GPUs...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only user facing API we support today is the Python one. The RpcAgent interface can be thought of as an internal API that we have complete control over. In this PR we do attach this map to the message and actually reverse it for distributed autograd.

I went with DeviceIndex here to be consistent with the rest of the device mapping code. I agree with Shen that this should be Device, but that is a much more involved change for 1.7. We control this interface and all its implementations, so it shouldn't be a big deal to change this parameter slightly in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we are confident that we will soon add support for per-RPC device map arguments? If that's the case, adding it to recv backward LGTM. If we don't see that coming in the near future, I am not sure if it would be worthy to introduce the additional complexity. But since the device map will be a beta feature anyway, I think it should be fine either way from the perf's perspective. If we decided to keep the current version, in order to address code complexity concerns, we can create an issue/reminder to revisit this and see whether a global map would be enough before 1.9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is tied with whether or not we want to support per-RPC device map arguments. This is not the public API that users see and is a private one for now. If we do end up building a C++ API, at that point we can evaluate what to do with this extra argument.

Regarding complexity I'm not sure if there is a simpler way to address this issue holistically. A global map for the backward wouldn't work in all cases. For example if nodes 1 and 2 perform RPCs on node3 with separate global device maps for the forward pass, there can't be a global backward map defined on node3 to handle this. It seems like we do need to do this at a per RPC level to handle this in a generic way.

@@ -537,8 +545,9 @@ std::tuple<tensorpipe::Message, TensorpipeWriteBuffers> tensorpipeSerialize(
jit::getWriteableTensorData(tensorDataVec[i]);
// Enforce memory copy if tensor is created from torch::from_blob, means
// that the tensor doesn't own the memory.
std::string metadata =
deviceIndices.empty() ? "" : std::to_string(deviceIndices[i]);
std::string metadata = deviceIndices.empty() || deviceIndices[i] == -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this be equal to -1?

const auto deviceIter = deviceMap.find(tensor.device().index());
if (deviceIter == deviceMap.end()) {
checkCPUTensor(tensor);
deviceIndices.push_back(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is where the -1 is coming from! Do we actually need it? I suspect that in principle what we need getDevicesForTensors to return is the (smallest) set of devices on which the tensors reside, deduplicating the devices if multiple tensors reside on them. And at that point we could also just ignore CPU tensors, rather than insert -1 for them. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I completely followed the idea here, but downstream code like tensorpipeSerialize rely on deviceIndices being the same length as tensors, so we do need to put in some placeholder deviceIndex for CPU tensors here?

@pritamdamania87
Copy link
Contributor Author

This PR will conflict with some work @mrshenli is doing in #44418 and the stack @beauby is about to merge.

We can probably resolve the conflicts based on the order in which the PRs land.

Also, wondering why you went the route of adding a per-RPC override of the device map, rather than for example the solution I proposed in #44170 (comment). This PR certainly introduces a more flexible approach, but which also comes with extra complexity, so I'm thinking whether for the initial version we should go for something simpler?

I felt it wasn't too hard to implement the flexible approach and that's why I went with this approach. In the long term we probably want to have the backward pass pick up the device mapping like this anyways and since it wasn't too much work in implementing this, I thought it would be better to just go with this approach from the get go.

…nsorPipe."

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Sep 18, 2020
Pull Request resolved: #44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170
ghstack-source-id: 112417506

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
…nsorPipe."

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Sep 21, 2020
Pull Request resolved: #44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170
ghstack-source-id: 112516251

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
…nsorPipe."

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Dec 24, 2020
Pull Request resolved: #44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170
ghstack-source-id: 119112245

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
@pritamdamania87
Copy link
Contributor Author

@mrshenli @lw We didn't get this into 1.7 since device maps were reverted for that release, although I feel we should land this in 1.8. Let me know your thoughts on this and if it makes sense, would love to have another round of review :)

@@ -157,7 +157,9 @@ class TORCH_API RpcAgent {
virtual std::shared_ptr<FutureMessage> send(
const WorkerInfo& to,
Message&& message,
const float rpcTimeoutSeconds = kUnsetRpcTimeout) = 0;
const float rpcTimeoutSeconds = kUnsetRpcTimeout,
const std::unordered_map<c10::DeviceIndex, c10::DeviceIndex>& deviceMap =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we are confident that we will soon add support for per-RPC device map arguments? If that's the case, adding it to recv backward LGTM. If we don't see that coming in the near future, I am not sure if it would be worthy to introduce the additional complexity. But since the device map will be a beta feature anyway, I think it should be fine either way from the perf's perspective. If we decided to keep the current version, in order to address code complexity concerns, we can create an issue/reminder to revisit this and see whether a global map would be enough before 1.9.

getDevicesForRemote(clientPipe.pipe_->getRemoteName(), requestMessage);
} else {
// If deviceMap is specified, use that instead.
devices = getDevicesForTensors(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to consolidate getDevicesForTensors and getDevicesForRemote into one by letting it take an optional device map arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getDevicesForTensors is actually a subset of getDevicesForRemote. getDevicesForRemote internally calls getDevicesForTensors to avoid any duplication in logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I noticed that. What I didn't follow is why we need two similar but slightly different utility functions, and their didn't convey their difference? If all we need to letting the second utility function skip the map search and use the provided device map, this can potentially be done by using an optional arg? But this is a nit, please feel free to ignore.

@pritamdamania87 pritamdamania87 requested review from mrshenli and removed request for pietern January 8, 2021 05:10
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Shall we wait for #44418 before landing this one? Because if that one didn't land in time, this one will also needs to be reverted. And #44418 will need another fix in TensorPipe to detect GPUs that does not support peer access, i.e., cudaErrorPeerAccessUnsupported. I will wait that fix in TP, and then rebase and land #44418. cc @lw @beauby

getDevicesForRemote(clientPipe.pipe_->getRemoteName(), requestMessage);
} else {
// If deviceMap is specified, use that instead.
devices = getDevicesForTensors(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I noticed that. What I didn't follow is why we need two similar but slightly different utility functions, and their didn't convey their difference? If all we need to letting the second utility function skip the map search and use the provided device map, this can potentially be done by using an optional arg? But this is a nit, please feel free to ignore.

@pritamdamania87
Copy link
Contributor Author

Sure, we can wait for #44418

…nsorPipe."

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jan 19, 2021
Pull Request resolved: #44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: #44170
ghstack-source-id: 119950842

Differential Revision: [D23751975](https://our.internmc.facebook.com/intern/diff/D23751975/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23751975/)!
@pritamdamania87
Copy link
Contributor Author

@mrshenli Could you take another look? Thanks!

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this!

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 40eea6d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants