Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pipe] Add a WithDevice wrapper to specify device execution for a module. #65190

Closed

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Sep 17, 2021

Stack from ghstack:

As described in #65093, there
could be modules which don't have any parameters/buffers. In this case, Pipe
determines that the module should be executed on CPU. However this might result
in unnecessary GPU to CPU transfers whereas the user expected the module to be
executed on the GPU itself by keeping its inputs and outputs on GPU.

For this use case, we introduce a WithDevice wrapper which can be used to
override which device a particular module should be executed on as part of the
pipeline.

#Closes: #65093

Differential Revision: D31010027

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23

…odule.

As described in #65093, there
could be modules which don't have any parameters/buffers. In this case, Pipe
determines that the module should be executed on CPU. However this might result
in unnecessary GPU to CPU transfers whereas the user expected the module to be
executed on the GPU itself by keeping its inputs and outputs on GPU.

For this use case, we introduce a `WithDevice` wrapper which can be used to
override which device a particular module should be executed on as part of the
pipeline.

#Closes: #65093

Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 17, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 2497e1c (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-09-17T18:32:35.7509120Z The PR is introduc...m to confirm whether this change is wanted or not.
2021-09-17T18:32:35.7495724Z processing existing schema:  alltoall_base(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor _1, Tensor _2, int[] _3, int[] _4) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7497070Z processing existing schema:  alltoall(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, Tensor[] _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7498381Z processing existing schema:  send(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7499910Z processing existing schema:  recv(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7501215Z processing existing schema:  recv_anysource(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7502483Z processing existing schema:  barrier(__torch__.torch.classes.dist_c10d.ProcessGroup _0) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-09-17T18:32:35.7503644Z processing existing schema:  __init__(__torch__.torch.classes.dist_c10d.frontend _0) -> (NoneType _0)
2021-09-17T18:32:35.7505036Z processing existing schema:  new_process_group_helper(__torch__.torch.classes.dist_c10d.frontend _0, int _1, int _2, int[] _3, str _4, __torch__.torch.classes.dist_c10d.Store _5, str? _6, int _7) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
2021-09-17T18:32:35.7506592Z processing existing schema:  get_process_group_by_name(__torch__.torch.classes.dist_c10d.frontend _0, str _1) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
2021-09-17T18:32:35.7507966Z processing existing schema:  get_name_of_process_group(__torch__.torch.classes.dist_c10d.frontend _0, __torch__.torch.classes.dist_c10d.ProcessGroup _1) -> (str _0)
2021-09-17T18:32:35.7509120Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2021-09-17T18:32:35.7509720Z 
2021-09-17T18:32:35.7509985Z Broken ops: [
2021-09-17T18:32:35.7510670Z 	aten::meshgrid.indexing(Tensor[] tensors, *, str indexing) -> (Tensor[])
2021-09-17T18:32:35.7511119Z ]
2021-09-17T18:32:35.7511478Z + cleanup
2021-09-17T18:32:35.7511748Z + retcode=1
2021-09-17T18:32:35.7512021Z + set +x
2021-09-17T18:32:35.7512355Z =================== sccache compilation log ===================
2021-09-17T18:32:35.7569579Z =========== If your build fails, please take a look at the log above for possible reasons ===========
2021-09-17T18:32:35.7590060Z Compile requests                      0

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Sep 17, 2021
pritamdamania87 pushed a commit that referenced this pull request Sep 17, 2021
…odule.

As described in #65093, there
could be modules which don't have any parameters/buffers. In this case, Pipe
determines that the module should be executed on CPU. However this might result
in unnecessary GPU to CPU transfers whereas the user expected the module to be
executed on the GPU itself by keeping its inputs and outputs on GPU.

For this use case, we introduce a `WithDevice` wrapper which can be used to
override which device a particular module should be executed on as part of the
pipeline.

#Closes: #65093

Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/)

ghstack-source-id: 138316171
Pull Request resolved: #65190
test/distributed/pipeline/sync/test_pipe.py Outdated Show resolved Hide resolved
torch/distributed/pipeline/sync/pipe.py Outdated Show resolved Hide resolved
>>> fc2 = nn.Linear(8, 4).cuda(1)
>>> dropout = nn.Dropout()
>>>
>>> # Dropout does not have any parameters/buffers, but we want to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what's the expected behavior here. If dropout does not have a device, should we just (recursively) inherit the device from the previous layer? So that users don't need to manually specify this?

Copy link
Contributor

@wayi1 wayi1 Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

The goal here is to avoid the extra data transfer, so I also think no need to manually specify a device and creating more syntax sugar.

Another future use case whereWithDevice may not work well is that, if the precursor layer is a tensor with lazy placement (device="meta"), then the current pure-computation layer can only be placed on a meta device, but I am afraid that the heuristic of co-locating two layers still won't be easily expressed by WithDevice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this option as well, although this wouldn't work for the first partition in the sequence where we don't have any previous information to rely on. Also, I do feel it's probably better have an explicit API to deal with this rather than having many implicit rules.

For example, what if the user actually wanted a GPU to CPU transfer for a particular layer. Or if there was a sequence of layers with no params/buffers and they want them to run on different devices to spread out computation evenly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another future use case whereWithDevice may not work well is that, if the precursor layer is a tensor with lazy placement (device="meta"), then the current pure-computation layer can only be placed on a meta device, but I am afraid that the heuristic of co-locating two layers still won't be easily expressed by WithDevice.

When the Pipe eventually initializes everything, we would have concrete devices for each partition and not "meta" at that point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e, what if the user actually wanted a GPU to CPU transfer for a particular layer. Or if there was a sequence of layers with no params/buffers and they want them to run on different devices to spread out computation evenly?

I thought about this option as well, although this wouldn't work for the first partition in the sequence where we don't have any previous information to rely on. Also, I do feel it's probably better have an explicit API to deal with this rather than having many implicit rules.

Is it possible that the first layer is a pure computation layer? As long as the first layer in the first partition is not pure computation layer, shouldn't we always be able to find a valid device from the previous layers?

For example, what if the user actually wanted a GPU to CPU transfer for a particular layer. Or if there was a sequence of layers with no params/buffers and they want them to run on different devices to spread out computation evenly?

Then that computation must be much more expensive than H2D ops. I agree WithDevice be more valuable in this case, but isn't this a uncommon use case? Maybe we can let the placement be the same as the previous layer by default, and provide WithDevice for extra flexibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the first layer is a pure computation layer? As long as the first layer in the first partition is not pure computation layer, shouldn't we always be able to find a valid device from the previous layers?

It could be possible, for example we didn't anticipate that users would want a separate compute layer without any params/buffers as a different stage. Also, I think there is also a more general case where we could have a sequence of compute layers which need to be on different devices, in that case we do need WithDevice.

Maybe we can let the placement be the same as the previous layer by default, and provide WithDevice for extra flexibility?

If we have to provide WithDevice anyways, I feel we should just have this instead of having a variety of implicit rules around how devices are handled.

@wayi1 wayi1 requested a review from mrshenli September 17, 2021 03:44
@codecov
Copy link

codecov bot commented Sep 17, 2021

Codecov Report

Merging #65190 (2497e1c) into gh/pritamdamania87/268/base (8800a8b) will increase coverage by 0.08%.
The diff coverage is 52.63%.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/268/base   #65190      +/-   ##
===============================================================
+ Coverage                        66.37%   66.45%   +0.08%     
===============================================================
  Files                              727      727              
  Lines                            93571    93588      +17     
===============================================================
+ Hits                             62109    62198      +89     
+ Misses                           31462    31390      -72     

…ion for a module."

As described in #65093, there
could be modules which don't have any parameters/buffers. In this case, Pipe
determines that the module should be executed on CPU. However this might result
in unnecessary GPU to CPU transfers whereas the user expected the module to be
executed on the GPU itself by keeping its inputs and outputs on GPU.

For this use case, we introduce a `WithDevice` wrapper which can be used to
override which device a particular module should be executed on as part of the
pipeline.

#Closes: #65093

Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Sep 17, 2021
…odule.

Pull Request resolved: #65190

As described in #65093, there
could be modules which don't have any parameters/buffers. In this case, Pipe
determines that the module should be executed on CPU. However this might result
in unnecessary GPU to CPU transfers whereas the user expected the module to be
executed on the GPU itself by keeping its inputs and outputs on GPU.

For this use case, we introduce a `WithDevice` wrapper which can be used to
override which device a particular module should be executed on as part of the
pipeline.

#Closes: #65093
ghstack-source-id: 138376272

Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3e64c9e.

module = module.module
module.to(device)
else:
device = _retrieve_device(module)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can move the above into the _retrieve_device helper?

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/268/head branch September 24, 2021 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants