DataParallel copies the model onto GPUs sequentially #51385

vadimkantorov · 2021-01-29T23:03:17Z

I have 8 GPUs and can see it graphically with watch -n 0.1 nvidia-smi. It could save some time by doing the parallel async copy?

Same seems for distributing the batch size, but less sure

P.S. I know that DP is not recommended in favor of DDP, but for legacy code and simplicity reasons (also for easier recovery from OOMs, exceptions and easier logging), DP still remains important

cc @VitalyFedyunin @ngimel

The text was updated successfully, but these errors were encountered:

ezyang · 2021-02-01T18:29:40Z

This sounds pretty reasonable

vadimkantorov · 2021-02-01T18:34:16Z

Maybe even scatter_all somehow could be used (e.g. we don't have bandwidth to copy to all devices at once, once the first device has received the copy, it could also transfer the model copy to other devices, halving the copy time)

ngimel · 2021-02-01T18:37:56Z

Are you talking about initial parameter copy? Because during training broadcast/reduce should not copy data to the cpu.

vadimkantorov · 2021-02-01T18:43:04Z

Yes, initial parameter copy. But the model is already on the first GPU, since I'm doing:

model = Model().cuda()
model = nn.DataParallel(model)

So it should already be amenable to inter-GPU copies (if it's enabled)

ngimel · 2021-02-01T18:52:55Z

We will accept a PR implementing that.

vadimkantorov changed the title ~~DataParalell copies the model onto GPUs sequentially~~ DataParallel copies the model onto GPUs sequentially Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataParallel copies the model onto GPUs sequentially #51385

DataParallel copies the model onto GPUs sequentially #51385

vadimkantorov commented Jan 29, 2021 •

edited by pytorch-probot bot

ezyang commented Feb 1, 2021

vadimkantorov commented Feb 1, 2021

ngimel commented Feb 1, 2021

vadimkantorov commented Feb 1, 2021 •

edited

ngimel commented Feb 1, 2021

DataParallel copies the model onto GPUs sequentially #51385

DataParallel copies the model onto GPUs sequentially #51385

Comments

vadimkantorov commented Jan 29, 2021 • edited by pytorch-probot bot

ezyang commented Feb 1, 2021

vadimkantorov commented Feb 1, 2021

ngimel commented Feb 1, 2021

vadimkantorov commented Feb 1, 2021 • edited

ngimel commented Feb 1, 2021

vadimkantorov commented Jan 29, 2021 •

edited by pytorch-probot bot

vadimkantorov commented Feb 1, 2021 •

edited