Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not data_parallel? #34

Closed
LMescheder opened this issue Oct 20, 2017 · 4 comments
Closed

Why not data_parallel? #34

LMescheder opened this issue Oct 20, 2017 · 4 comments

Comments

@LMescheder
Copy link

I wonder why you implemented the multi GPU training using a custom Event Loop instead of using torch.nn.DataParallel. I suppose it is for performance reasons?
If so, what is the main bottleneck in data_parallel that prevents you from using it? Do you have an estimate of how much the speed up compared to the (simpler) DataParallel solution is?

@myleott
Copy link
Contributor

myleott commented Oct 20, 2017

Yes, it's for performance reasons.

DataParallel relies on Python threading, which is slow due to the GIL [1][2]. When we tried nn.DataParallel initially, we saw negative speedup with multiple GPUs (e.g., one GPU training was faster than using four GPUs).

The custom event loop in fairseq-py uses multiprocessing (i.e., one Process per GPU), which gets around the GIL and gives much better multi-GPU performance. We typically see ~5.5-6x speedup with 8 GPUs.

[1] OpenNMT/OpenNMT-py#89 (comment)
[2] pytorch/pytorch#54

@LMescheder
Copy link
Author

Okay, I see. Thanks for the prompt reply. Have you tried DistributedDataParallel which shouldn't have issues with the GIL? Amazing work by the way!

@myleott
Copy link
Contributor

myleott commented Oct 24, 2017

I haven't tried DistributedDataParallel yet, but it looks promising. I'll look into it when I get some time. This discussion also seems relevant (albeit a little discouraging):

@myleott myleott closed this as completed Oct 24, 2017
@jekbradbury
Copy link

It's definitely worked for our use cases, including speech and MT. I think it's ultimately very similar to the implementation you built into fairseq, except that the user must explicitly launch N copies of the script, and each copy should have its own data loader or data loader shard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants