Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow DDP to wrap multi-GPU modules #19271

Closed
wants to merge 1 commit into from

Conversation

mrshenli
Copy link
Contributor

Summary: allow DDP to take multi-gpu models

Differential Revision: D14822375

@mrshenli mrshenli changed the title Allow DDP to wrap multi-GPU modules [WIP][Don't Review] Allow DDP to wrap multi-GPU modules Apr 15, 2019
@mrshenli mrshenli changed the title [WIP][Don't Review] Allow DDP to wrap multi-GPU modules Allow DDP to wrap multi-GPU modules Apr 16, 2019
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor points. Looking good overall. Glad that we'll be able to support multi device modules here!

model = QuadraGpuNet(gpus)

ddp_model = DistributedDataParallel(
copy.deepcopy(model),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for deepcopy here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to make sure that model and ddp_model operate on independent params so that we can compare?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, it's needed because of the numerical equivalence testing.

self.assertEqual(len(gpus), 4, "expecting 4 gpus per process")
gpus = gpus[:4]
gpu_strs = list(map(lambda i: torch.device('cuda:' + str(i)), gpus))
self._test_gloo_backend(gpus, gpu_strs, True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two can be factored into another helper that calls _test_gloo_backend. At the top level it's good to have them be separate tests so that we see the ones that get skipped.

self.assertEqual(len(gpus), 4, "expecting 4 gpus per process")
gpus = gpus[:4]
gpu_strs = list(map(lambda i: torch.device('cuda:' + str(i)), gpus))
self._test_nccl_backend(gpus, gpu_strs, True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two can be factored into another helper that calls _test_nccl_backend.

@@ -153,11 +153,22 @@ class DistributedDataParallel(Module):

Args:
module (Module): module to be parallelized
device_ids (list of int or torch.device): CUDA devices (default: all devices)
output_device (int or torch.device): device location of output (default: device_ids[0])
device_ids (list of int or torch.device): CUDA devices. This should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace -- is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, not intentional. I will edit, thanks!

@pietern pietern added oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 17, 2019
@pietern pietern added this to the 1.1 milestone Apr 17, 2019
Summary:
Pull Request resolved: pytorch#19271

allow DDP to take multi-gpu models

Reviewed By: pietern

Differential Revision: D14822375

fbshipit-source-id: 8c8bcd4526643be5fa44134620d58fcf2c197238
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6732358.

zhangguanheng66 pushed a commit to zhangguanheng66/pytorch that referenced this pull request May 6, 2019
Summary:
Pull Request resolved: pytorch#19271

allow DDP to take multi-gpu models

Reviewed By: pietern

Differential Revision: D14822375

fbshipit-source-id: 1eebfaa33371766d3129f0ac6f63a573332b2f1c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants