Why not data_parallel? #34

LMescheder · 2017-10-20T18:50:29Z

I wonder why you implemented the multi GPU training using a custom Event Loop instead of using torch.nn.DataParallel. I suppose it is for performance reasons?
If so, what is the main bottleneck in data_parallel that prevents you from using it? Do you have an estimate of how much the speed up compared to the (simpler) DataParallel solution is?

The text was updated successfully, but these errors were encountered:

myleott · 2017-10-20T20:00:36Z

Yes, it's for performance reasons.

DataParallel relies on Python threading, which is slow due to the GIL [1][2]. When we tried nn.DataParallel initially, we saw negative speedup with multiple GPUs (e.g., one GPU training was faster than using four GPUs).

The custom event loop in fairseq-py uses multiprocessing (i.e., one Process per GPU), which gets around the GIL and gives much better multi-GPU performance. We typically see ~5.5-6x speedup with 8 GPUs.

[1] OpenNMT/OpenNMT-py#89 (comment)
[2] pytorch/pytorch#54

LMescheder · 2017-10-21T23:24:43Z

Okay, I see. Thanks for the prompt reply. Have you tried DistributedDataParallel which shouldn't have issues with the GIL? Amazing work by the way!

myleott · 2017-10-24T20:08:21Z

I haven't tried DistributedDataParallel yet, but it looks promising. I'll look into it when I get some time. This discussion also seems relevant (albeit a little discouraging):

jekbradbury · 2017-10-25T02:04:09Z

It's definitely worked for our use cases, including speech and MT. I think it's ultimately very similar to the implementation you built into fairseq, except that the user must explicitly launch N copies of the script, and each copy should have its own data loader or data loader shard.

…h#34)

myleott closed this as completed Oct 24, 2017

myleott added a commit that referenced this issue Jun 26, 2018

Factor out and group common CLI args (#34)

fb58369

taylanbil added a commit to taylanbil/fairseq that referenced this issue May 18, 2020

Write TimeMeters and StopWatchMeters to tensorboard. (facebookresearc…

c252f6f

…h#34)

jogonba2 mentioned this issue Jul 9, 2020

BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated #2311

Open

nreimers mentioned this issue Jul 17, 2020

Multi-GPU training UKPLab/sentence-transformers#311

Open

Gavin90s mentioned this issue Mar 11, 2021

speech_recognition/w2l_decoder.py load kenlm core dump #3337

Closed

Jxu-Thu mentioned this issue Jul 1, 2021

Wav2vec2 error after validation when training : terminate called after throwing an instance of 'c10::Error' #3674

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not data_parallel? #34

Why not data_parallel? #34

LMescheder commented Oct 20, 2017

myleott commented Oct 20, 2017

LMescheder commented Oct 21, 2017

myleott commented Oct 24, 2017

jekbradbury commented Oct 25, 2017

Why not data_parallel? #34

Why not data_parallel? #34

Comments

LMescheder commented Oct 20, 2017

myleott commented Oct 20, 2017

LMescheder commented Oct 21, 2017

myleott commented Oct 24, 2017

jekbradbury commented Oct 25, 2017