-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow DDP to wrap multi-GPU modules #19271
Conversation
17f4652
to
b891c96
Compare
b891c96
to
68d676c
Compare
68d676c
to
19acac3
Compare
19acac3
to
b2c2c32
Compare
b2c2c32
to
29baf9e
Compare
29baf9e
to
9670ad8
Compare
9670ad8
to
0da1e6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor points. Looking good overall. Glad that we'll be able to support multi device modules here!
model = QuadraGpuNet(gpus) | ||
|
||
ddp_model = DistributedDataParallel( | ||
copy.deepcopy(model), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for deepcopy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to make sure that model
and ddp_model
operate on independent params so that we can compare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, it's needed because of the numerical equivalence testing.
test/test_c10d.py
Outdated
self.assertEqual(len(gpus), 4, "expecting 4 gpus per process") | ||
gpus = gpus[:4] | ||
gpu_strs = list(map(lambda i: torch.device('cuda:' + str(i)), gpus)) | ||
self._test_gloo_backend(gpus, gpu_strs, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two can be factored into another helper that calls _test_gloo_backend
. At the top level it's good to have them be separate tests so that we see the ones that get skipped.
test/test_c10d.py
Outdated
self.assertEqual(len(gpus), 4, "expecting 4 gpus per process") | ||
gpus = gpus[:4] | ||
gpu_strs = list(map(lambda i: torch.device('cuda:' + str(i)), gpus)) | ||
self._test_nccl_backend(gpus, gpu_strs, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two can be factored into another helper that calls _test_nccl_backend
.
torch/nn/parallel/distributed.py
Outdated
@@ -153,11 +153,22 @@ class DistributedDataParallel(Module): | |||
|
|||
Args: | |||
module (Module): module to be parallelized | |||
device_ids (list of int or torch.device): CUDA devices (default: all devices) | |||
output_device (int or torch.device): device location of output (default: device_ids[0]) | |||
device_ids (list of int or torch.device): CUDA devices. This should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra whitespace -- is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, not intentional. I will edit, thanks!
0da1e6f
to
c71acdc
Compare
c71acdc
to
f326925
Compare
f326925
to
c7b2c4d
Compare
Summary: Pull Request resolved: pytorch#19271 allow DDP to take multi-gpu models Reviewed By: pietern Differential Revision: D14822375 fbshipit-source-id: 8c8bcd4526643be5fa44134620d58fcf2c197238
c7b2c4d
to
ea8bdc9
Compare
This pull request has been merged in 6732358. |
Summary: Pull Request resolved: pytorch#19271 allow DDP to take multi-gpu models Reviewed By: pietern Differential Revision: D14822375 fbshipit-source-id: 1eebfaa33371766d3129f0ac6f63a573332b2f1c
Summary: allow DDP to take multi-gpu models
Differential Revision: D14822375