[Pipe] Add a `WithDevice` wrapper to specify device execution for a module. #65190

pritamdamania87 · 2021-09-17T01:07:35Z

Stack from ghstack:

-> [Pipe] Add a WithDevice wrapper to specify device execution for a module. #65190

As described in #65093, there
could be modules which don't have any parameters/buffers. In this case, Pipe
determines that the module should be executed on CPU. However this might result
in unnecessary GPU to CPU transfers whereas the user expected the module to be
executed on the GPU itself by keeping its inputs and outputs on GPU.

For this use case, we introduce a WithDevice wrapper which can be used to
override which device a particular module should be executed on as part of the
pipeline.

#Closes: #65093

Differential Revision: D31010027

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23

…odule. As described in #65093, there could be modules which don't have any parameters/buffers. In this case, Pipe determines that the module should be executed on CPU. However this might result in unnecessary GPU to CPU transfers whereas the user expected the module to be executed on the GPU itself by keeping its inputs and outputs on GPU. For this use case, we introduce a `WithDevice` wrapper which can be used to override which device a particular module should be executed on as part of the pipeline. #Closes: #65093 Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/) [ghstack-poisoned]

facebook-github-bot · 2021-09-17T01:07:38Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/65190
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
↩️ [fb-only] Re-run with SSH instructions
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 2497e1c (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

linux-xenial-py3.6-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-09-17T18:32:35.7509120Z The PR is introduc...m to confirm whether this change is wanted or not.

2021-09-17T18:32:35.7495724Z processing existing schema:  alltoall_base(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor _1, Tensor _2, int[] _3, int[] _4) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7497070Z processing existing schema:  alltoall(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, Tensor[] _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7498381Z processing existing schema:  send(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7499910Z processing existing schema:  recv(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7501215Z processing existing schema:  recv_anysource(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7502483Z processing existing schema:  barrier(__torch__.torch.classes.dist_c10d.ProcessGroup _0) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7503644Z processing existing schema:  __init__(__torch__.torch.classes.dist_c10d.frontend _0) -> (NoneType _0)
2021-09-17T18:32:35.7505036Z processing existing schema:  new_process_group_helper(__torch__.torch.classes.dist_c10d.frontend _0, int _1, int _2, int[] _3, str _4, __torch__.torch.classes.dist_c10d.Store _5, str? _6, int _7) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
2021-09-17T18:32:35.7506592Z processing existing schema:  get_process_group_by_name(__torch__.torch.classes.dist_c10d.frontend _0, str _1) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
2021-09-17T18:32:35.7507966Z processing existing schema:  get_name_of_process_group(__torch__.torch.classes.dist_c10d.frontend _0, __torch__.torch.classes.dist_c10d.ProcessGroup _1) -> (str _0)
2021-09-17T18:32:35.7509120Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2021-09-17T18:32:35.7509720Z 
2021-09-17T18:32:35.7509985Z Broken ops: [
2021-09-17T18:32:35.7510670Z 	aten::meshgrid.indexing(Tensor[] tensors, *, str indexing) -> (Tensor[])
2021-09-17T18:32:35.7511119Z ]
2021-09-17T18:32:35.7511478Z + cleanup
2021-09-17T18:32:35.7511748Z + retcode=1
2021-09-17T18:32:35.7512021Z + set +x
2021-09-17T18:32:35.7512355Z =================== sccache compilation log ===================
2021-09-17T18:32:35.7569579Z =========== If your build fails, please take a look at the log above for possible reasons ===========
2021-09-17T18:32:35.7590060Z Compile requests                      0

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…odule. As described in #65093, there could be modules which don't have any parameters/buffers. In this case, Pipe determines that the module should be executed on CPU. However this might result in unnecessary GPU to CPU transfers whereas the user expected the module to be executed on the GPU itself by keeping its inputs and outputs on GPU. For this use case, we introduce a `WithDevice` wrapper which can be used to override which device a particular module should be executed on as part of the pipeline. #Closes: #65093 Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/) ghstack-source-id: 138316171 Pull Request resolved: #65190

test/distributed/pipeline/sync/test_pipe.py

torch/distributed/pipeline/sync/pipe.py

mrshenli · 2021-09-17T03:26:05Z

torch/distributed/pipeline/sync/pipe.py

+        >>> fc2 = nn.Linear(8, 4).cuda(1)
+        >>> dropout = nn.Dropout()
+        >>>
+        >>> # Dropout does not have any parameters/buffers, but we want to


I wonder what's the expected behavior here. If dropout does not have a device, should we just (recursively) inherit the device from the previous layer? So that users don't need to manually specify this?

+1

The goal here is to avoid the extra data transfer, so I also think no need to manually specify a device and creating more syntax sugar.

Another future use case whereWithDevice may not work well is that, if the precursor layer is a tensor with lazy placement (device="meta"), then the current pure-computation layer can only be placed on a meta device, but I am afraid that the heuristic of co-locating two layers still won't be easily expressed by WithDevice.

I thought about this option as well, although this wouldn't work for the first partition in the sequence where we don't have any previous information to rely on. Also, I do feel it's probably better have an explicit API to deal with this rather than having many implicit rules.

For example, what if the user actually wanted a GPU to CPU transfer for a particular layer. Or if there was a sequence of layers with no params/buffers and they want them to run on different devices to spread out computation evenly?

Another future use case whereWithDevice may not work well is that, if the precursor layer is a tensor with lazy placement (device="meta"), then the current pure-computation layer can only be placed on a meta device, but I am afraid that the heuristic of co-locating two layers still won't be easily expressed by WithDevice.

When the Pipe eventually initializes everything, we would have concrete devices for each partition and not "meta" at that point.

e, what if the user actually wanted a GPU to CPU transfer for a particular layer. Or if there was a sequence of layers with no params/buffers and they want them to run on different devices to spread out computation evenly?

I thought about this option as well, although this wouldn't work for the first partition in the sequence where we don't have any previous information to rely on. Also, I do feel it's probably better have an explicit API to deal with this rather than having many implicit rules.

Is it possible that the first layer is a pure computation layer? As long as the first layer in the first partition is not pure computation layer, shouldn't we always be able to find a valid device from the previous layers?

For example, what if the user actually wanted a GPU to CPU transfer for a particular layer. Or if there was a sequence of layers with no params/buffers and they want them to run on different devices to spread out computation evenly?

Then that computation must be much more expensive than H2D ops. I agree WithDevice be more valuable in this case, but isn't this a uncommon use case? Maybe we can let the placement be the same as the previous layer by default, and provide WithDevice for extra flexibility?

Is it possible that the first layer is a pure computation layer? As long as the first layer in the first partition is not pure computation layer, shouldn't we always be able to find a valid device from the previous layers?

It could be possible, for example we didn't anticipate that users would want a separate compute layer without any params/buffers as a different stage. Also, I think there is also a more general case where we could have a sequence of compute layers which need to be on different devices, in that case we do need WithDevice.

Maybe we can let the placement be the same as the previous layer by default, and provide WithDevice for extra flexibility?

If we have to provide WithDevice anyways, I feel we should just have this instead of having a variety of implicit rules around how devices are handled.

codecov · 2021-09-17T04:21:12Z

Codecov Report

Merging #65190 (2497e1c) into gh/pritamdamania87/268/base (8800a8b) will increase coverage by 0.08%.
The diff coverage is 52.63%.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/268/base   #65190      +/-   ##
===============================================================
+ Coverage                        66.37%   66.45%   +0.08%     
===============================================================
  Files                              727      727              
  Lines                            93571    93588      +17     
===============================================================
+ Hits                             62109    62198      +89     
+ Misses                           31462    31390      -72

…ion for a module." As described in #65093, there could be modules which don't have any parameters/buffers. In this case, Pipe determines that the module should be executed on CPU. However this might result in unnecessary GPU to CPU transfers whereas the user expected the module to be executed on the GPU itself by keeping its inputs and outputs on GPU. For this use case, we introduce a `WithDevice` wrapper which can be used to override which device a particular module should be executed on as part of the pipeline. #Closes: #65093 Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/) [ghstack-poisoned]

…odule. Pull Request resolved: #65190 As described in #65093, there could be modules which don't have any parameters/buffers. In this case, Pipe determines that the module should be executed on CPU. However this might result in unnecessary GPU to CPU transfers whereas the user expected the module to be executed on the GPU itself by keeping its inputs and outputs on GPU. For this use case, we introduce a `WithDevice` wrapper which can be used to override which device a particular module should be executed on as part of the pipeline. #Closes: #65093 ghstack-source-id: 138376272 Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/)

facebook-github-bot · 2021-09-20T18:28:52Z

This pull request has been merged in 3e64c9e.

rohan-varma · 2021-09-20T18:52:43Z

torch/distributed/pipeline/sync/pipe.py

+            module = module.module
+            module.to(device)
+        else:
+            device = _retrieve_device(module)


nit: we can move the above into the _retrieve_device helper?

pritamdamania87 requested review from cbalioglu, H-Huang, mingzhe09088, mrshenli, rohan-varma, wayi1 and zhaojuanmao as code owners September 17, 2021 01:07

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Sep 17, 2021

pritamdamania87 mentioned this pull request Sep 17, 2021

[AWS EC2 P3DN] torch sync.pipeline makes the tensor passing to the next stage in CPU device; it's working without using pipeline #65093

Closed

mrshenli reviewed Sep 17, 2021

View reviewed changes

wayi1 requested a review from mrshenli September 17, 2021 03:44

wayi1 approved these changes Sep 18, 2021

View reviewed changes

facebook-github-bot closed this in 3e64c9e Sep 20, 2021

facebook-github-bot added the Merged label Sep 20, 2021

rohan-varma reviewed Sep 20, 2021

View reviewed changes

facebook-github-bot deleted the gh/pritamdamania87/268/head branch September 24, 2021 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pipe] Add a `WithDevice` wrapper to specify device execution for a module. #65190

[Pipe] Add a `WithDevice` wrapper to specify device execution for a module. #65190

pritamdamania87 commented Sep 17, 2021 •

edited by pytorch-probot bot

facebook-github-bot commented Sep 17, 2021 •

edited

mrshenli Sep 17, 2021

wayi1 Sep 17, 2021 •

edited

pritamdamania87 Sep 17, 2021

pritamdamania87 Sep 17, 2021

wayi1 Sep 18, 2021

pritamdamania87 Sep 20, 2021

codecov bot commented Sep 17, 2021 •

edited

facebook-github-bot commented Sep 20, 2021

rohan-varma Sep 20, 2021

[Pipe] Add a WithDevice wrapper to specify device execution for a module. #65190

[Pipe] Add a WithDevice wrapper to specify device execution for a module. #65190

Conversation

pritamdamania87 commented Sep 17, 2021 • edited by pytorch-probot bot

facebook-github-bot commented Sep 17, 2021 • edited

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

linux-xenial-py3.6-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

mrshenli Sep 17, 2021

Choose a reason for hiding this comment

wayi1 Sep 17, 2021 • edited

Choose a reason for hiding this comment

pritamdamania87 Sep 17, 2021

Choose a reason for hiding this comment

pritamdamania87 Sep 17, 2021

Choose a reason for hiding this comment

wayi1 Sep 18, 2021

Choose a reason for hiding this comment

pritamdamania87 Sep 20, 2021

Choose a reason for hiding this comment

codecov bot commented Sep 17, 2021 • edited

Codecov Report

facebook-github-bot commented Sep 20, 2021

rohan-varma Sep 20, 2021

Choose a reason for hiding this comment

[Pipe] Add a `WithDevice` wrapper to specify device execution for a module. #65190

[Pipe] Add a `WithDevice` wrapper to specify device execution for a module. #65190

pritamdamania87 commented Sep 17, 2021 •

edited by pytorch-probot bot

facebook-github-bot commented Sep 17, 2021 •

edited

wayi1 Sep 17, 2021 •

edited

codecov bot commented Sep 17, 2021 •

edited