New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pipe] Add a WithDevice
wrapper to specify device execution for a module.
#65190
[Pipe] Add a WithDevice
wrapper to specify device execution for a module.
#65190
Conversation
…odule. As described in #65093, there could be modules which don't have any parameters/buffers. In this case, Pipe determines that the module should be executed on CPU. However this might result in unnecessary GPU to CPU transfers whereas the user expected the module to be executed on the GPU itself by keeping its inputs and outputs on GPU. For this use case, we introduce a `WithDevice` wrapper which can be used to override which device a particular module should be executed on as part of the pipeline. #Closes: #65093 Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/) [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 2497e1c (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: linux-xenial-py3.6-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)
|
…odule. As described in #65093, there could be modules which don't have any parameters/buffers. In this case, Pipe determines that the module should be executed on CPU. However this might result in unnecessary GPU to CPU transfers whereas the user expected the module to be executed on the GPU itself by keeping its inputs and outputs on GPU. For this use case, we introduce a `WithDevice` wrapper which can be used to override which device a particular module should be executed on as part of the pipeline. #Closes: #65093 Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/) ghstack-source-id: 138316171 Pull Request resolved: #65190
>>> fc2 = nn.Linear(8, 4).cuda(1) | ||
>>> dropout = nn.Dropout() | ||
>>> | ||
>>> # Dropout does not have any parameters/buffers, but we want to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what's the expected behavior here. If dropout
does not have a device, should we just (recursively) inherit the device from the previous layer? So that users don't need to manually specify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
The goal here is to avoid the extra data transfer, so I also think no need to manually specify a device and creating more syntax sugar.
Another future use case whereWithDevice
may not work well is that, if the precursor layer is a tensor with lazy placement (device="meta"
), then the current pure-computation layer can only be placed on a meta device, but I am afraid that the heuristic of co-locating two layers still won't be easily expressed by WithDevice
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this option as well, although this wouldn't work for the first partition in the sequence where we don't have any previous information to rely on. Also, I do feel it's probably better have an explicit API to deal with this rather than having many implicit rules.
For example, what if the user actually wanted a GPU to CPU transfer for a particular layer. Or if there was a sequence of layers with no params/buffers and they want them to run on different devices to spread out computation evenly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another future use case whereWithDevice may not work well is that, if the precursor layer is a tensor with lazy placement (device="meta"), then the current pure-computation layer can only be placed on a meta device, but I am afraid that the heuristic of co-locating two layers still won't be easily expressed by WithDevice.
When the Pipe eventually initializes everything, we would have concrete devices for each partition and not "meta" at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e, what if the user actually wanted a GPU to CPU transfer for a particular layer. Or if there was a sequence of layers with no params/buffers and they want them to run on different devices to spread out computation evenly?
I thought about this option as well, although this wouldn't work for the first partition in the sequence where we don't have any previous information to rely on. Also, I do feel it's probably better have an explicit API to deal with this rather than having many implicit rules.
Is it possible that the first layer is a pure computation layer? As long as the first layer in the first partition is not pure computation layer, shouldn't we always be able to find a valid device from the previous layers?
For example, what if the user actually wanted a GPU to CPU transfer for a particular layer. Or if there was a sequence of layers with no params/buffers and they want them to run on different devices to spread out computation evenly?
Then that computation must be much more expensive than H2D ops. I agree WithDevice
be more valuable in this case, but isn't this a uncommon use case? Maybe we can let the placement be the same as the previous layer by default, and provide WithDevice
for extra flexibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that the first layer is a pure computation layer? As long as the first layer in the first partition is not pure computation layer, shouldn't we always be able to find a valid device from the previous layers?
It could be possible, for example we didn't anticipate that users would want a separate compute layer without any params/buffers as a different stage. Also, I think there is also a more general case where we could have a sequence of compute layers which need to be on different devices, in that case we do need WithDevice
.
Maybe we can let the placement be the same as the previous layer by default, and provide WithDevice for extra flexibility?
If we have to provide WithDevice
anyways, I feel we should just have this instead of having a variety of implicit rules around how devices are handled.
Codecov Report
@@ Coverage Diff @@
## gh/pritamdamania87/268/base #65190 +/- ##
===============================================================
+ Coverage 66.37% 66.45% +0.08%
===============================================================
Files 727 727
Lines 93571 93588 +17
===============================================================
+ Hits 62109 62198 +89
+ Misses 31462 31390 -72 |
…ion for a module." As described in #65093, there could be modules which don't have any parameters/buffers. In this case, Pipe determines that the module should be executed on CPU. However this might result in unnecessary GPU to CPU transfers whereas the user expected the module to be executed on the GPU itself by keeping its inputs and outputs on GPU. For this use case, we introduce a `WithDevice` wrapper which can be used to override which device a particular module should be executed on as part of the pipeline. #Closes: #65093 Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/) [ghstack-poisoned]
…odule. Pull Request resolved: #65190 As described in #65093, there could be modules which don't have any parameters/buffers. In this case, Pipe determines that the module should be executed on CPU. However this might result in unnecessary GPU to CPU transfers whereas the user expected the module to be executed on the GPU itself by keeping its inputs and outputs on GPU. For this use case, we introduce a `WithDevice` wrapper which can be used to override which device a particular module should be executed on as part of the pipeline. #Closes: #65093 ghstack-source-id: 138376272 Differential Revision: [D31010027](https://our.internmc.facebook.com/intern/diff/D31010027/)
This pull request has been merged in 3e64c9e. |
module = module.module | ||
module.to(device) | ||
else: | ||
device = _retrieve_device(module) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can move the above into the _retrieve_device
helper?
Stack from ghstack:
WithDevice
wrapper to specify device execution for a module. #65190As described in #65093, there
could be modules which don't have any parameters/buffers. In this case, Pipe
determines that the module should be executed on CPU. However this might result
in unnecessary GPU to CPU transfers whereas the user expected the module to be
executed on the GPU itself by keeping its inputs and outputs on GPU.
For this use case, we introduce a
WithDevice
wrapper which can be used tooverride which device a particular module should be executed on as part of the
pipeline.
#Closes: #65093
Differential Revision: D31010027
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23