Avoid scatter for single-device case in DDP #46304

rohan-varma · 2020-10-14T02:58:39Z

Stack from ghstack:

Avoid scatter for single-device case in DDP #46304 Avoid scatter for single-device case in DDP

In the case that a single process operates only on one GPU, we can
avoid this scatter and instead replace it with a recursive version of to
which transfers the input tensors to the correct device.

The implementation of _recursive_to is modeled after scatter in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved).

Differential Revision: D24296377

In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) [ghstack-poisoned]

facebook-github-bot · 2020-10-14T02:58:47Z

💊 CI failures summary and remediations

As of commit 5b9773d (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 4 times.

In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) [ghstack-poisoned]

Pull Request resolved: #46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114251484 Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/)

dr-ci · 2020-10-14T04:07:51Z

💊 CI failures summary and remediations

As of commit 0967685 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/1 broken upstream at merge base 6011b36 since Oct 21

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_backward_compatibility_check_test since Oct 21
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 35 times.

In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) [ghstack-poisoned]

Pull Request resolved: #46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114298122 Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/)

pritamdamania87 · 2020-10-14T19:35:32Z

torch/testing/_internal/distributed/distributed_test.py

+                        return _self.lin(x.t)
+                    else:
+                        self.assertTrue(len(x), expected_len)
+                        self.assertTrue(x[0].device == x[1].device)


Shouldn't we pass in the expected device to the constructor of ToyModel and validate it is correct here?

We could do this, but the test is basically expected to validate that the device is the current rank of the process, as we are testing single GPU per process. The next line asserts that the input is on the expected device.

pritamdamania87 · 2020-10-14T19:35:57Z

torch/testing/_internal/distributed/distributed_test.py

+            inp = [torch.randn(10, 10) for _ in range(expected_len)]
+            model(inp, list)


Can we add tests for dict and namedtuple as well?

Added these in the latest version of the diff.

codecov · 2020-10-14T21:41:15Z

Codecov Report

Merging #46304 into gh/rohan-varma/185/base will decrease coverage by 0.06%.
The diff coverage is 13.84%.

@@                     Coverage Diff                     @@
##           gh/rohan-varma/185/base   #46304      +/-   ##
===========================================================
- Coverage                    68.33%   68.27%   -0.07%     
===========================================================
  Files                          410      410              
  Lines                        53795    53856      +61     
===========================================================
+ Hits                         36760    36768       +8     
- Misses                       17035    17088      +53

Impacted Files	Coverage Δ
torch/nn/parallel/distributed.py	`39.52% <10.00%> (-2.97%)`	⬇️
.../testing/_internal/distributed/distributed_test.py	`29.50% <15.15%> (-0.23%)`	⬇️
torch/nn/parallel/scatter_gather.py	`12.76% <50.00%> (ø)`
torch/testing/_internal/expecttest.py	`78.57% <0.00%> (+1.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f2e5ae4...a608196. Read the comment docs.

In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) [ghstack-poisoned]

Pull Request resolved: #46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114332450 Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/)

In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) [ghstack-poisoned]

Pull Request resolved: #46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114338896 Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/)

mrshenli · 2020-10-15T14:23:16Z

torch/nn/parallel/distributed.py

            if len(self.device_ids) == 1:
+                inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0])
                output = self.module(*inputs[0], **kwargs[0])


Curious, any reason we still need to return inputs and kwargs as a list?

I was mostly doing that to keep parity with the current version, but we can probably remove it in this code path.

In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) [ghstack-poisoned]

Pull Request resolved: #46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114504496 Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24296377/)!

In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) [ghstack-poisoned]

Pull Request resolved: #46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114861410 Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24296377/)!

In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) [ghstack-poisoned]

Pull Request resolved: #46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114896677 Differential Revision: [D24296377](https://our.internmc.facebook.com/intern/diff/D24296377/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24296377/)!

facebook-github-bot · 2020-10-22T16:15:15Z

This pull request has been merged in 7245d2c.

ngimel · 2020-12-25T00:00:20Z

@rohan-varma what's the motivation behind this PR? Does scatter for a single device incur some performance penalty? As #49819 says, previously it was possible to overlap h2d transfers with computation, now since the transfers happen on the default stream (.to() uses the default stream, as opposed to side stream in Scatter) this overlap is not possible.
Of course, current implementation can be fixed to use side stream, but I'm wondering if it's worth code complexity.

rohan-varma requested review from apaszke, mrshenli, pritamdamania87 and zhaojuanmao as code owners October 14, 2020 02:58

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 14, 2020

pritamdamania87 reviewed Oct 14, 2020

View reviewed changes

pritamdamania87 approved these changes Oct 15, 2020

View reviewed changes

mrshenli approved these changes Oct 15, 2020

View reviewed changes

facebook-github-bot closed this in 7245d2c Oct 22, 2020

facebook-github-bot added the Merged label Oct 22, 2020

facebook-github-bot deleted the gh/rohan-varma/185/head branch October 26, 2020 14:17

cicirori mentioned this pull request Dec 24, 2020

Data copy from CPU to GPU use default stream in nightly version. #49819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid scatter for single-device case in DDP #46304

Avoid scatter for single-device case in DDP #46304

rohan-varma commented Oct 14, 2020 •

edited

facebook-github-bot commented Oct 14, 2020 •

edited

dr-ci bot commented Oct 14, 2020 •

edited

pritamdamania87 Oct 14, 2020

rohan-varma Oct 14, 2020

pritamdamania87 Oct 14, 2020

rohan-varma Oct 14, 2020

codecov bot commented Oct 14, 2020 •

edited

mrshenli Oct 15, 2020

rohan-varma Oct 21, 2020

facebook-github-bot commented Oct 22, 2020

ngimel commented Dec 25, 2020

		inp = [torch.randn(10, 10) for _ in range(expected_len)]
		model(inp, list)

Avoid scatter for single-device case in DDP #46304

Avoid scatter for single-device case in DDP #46304

Conversation

rohan-varma commented Oct 14, 2020 • edited

facebook-github-bot commented Oct 14, 2020 • edited

💊 CI failures summary and remediations

dr-ci bot commented Oct 14, 2020 • edited

💊 CI failures summary and remediations

🚧 1 ongoing upstream failure:

pritamdamania87 Oct 14, 2020

Choose a reason for hiding this comment

rohan-varma Oct 14, 2020

Choose a reason for hiding this comment

pritamdamania87 Oct 14, 2020

Choose a reason for hiding this comment

rohan-varma Oct 14, 2020

Choose a reason for hiding this comment

codecov bot commented Oct 14, 2020 • edited

Codecov Report

mrshenli Oct 15, 2020

Choose a reason for hiding this comment

rohan-varma Oct 21, 2020

Choose a reason for hiding this comment

facebook-github-bot commented Oct 22, 2020

ngimel commented Dec 25, 2020

rohan-varma commented Oct 14, 2020 •

edited

facebook-github-bot commented Oct 14, 2020 •

edited

dr-ci bot commented Oct 14, 2020 •

edited

codecov bot commented Oct 14, 2020 •

edited