-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Parallel slows things down - ResNet 1001 #3917
Comments
I also met similar problem before and I don't know why. |
My guess would be that you’re using a small batch size (64, so 16 per GPU) so they are not effectively used at 100%, but you have to pay the communication costs that make it slower. Check how the run times change if you increase the batch size (you still have plenty of free memory) |
Thanks for the quick response! I used a batch size of 64 as that is what they used in the torch implementation. In fact Kaiming He has shown that, in their experiments, a minibatch size of 64 actually achieves better results than 128! (Which was obviously unexpected :) Increasing the batch size to 128 gives me roughly the same time to evaluate each batch (1.4s) as with a batch size of 64 (but obviously will result in half the time per epoch!). I was just a bit confused as to why the speed up was so much worse than the torch implementation - any ideas? Regards, Alex |
@al3xsh on the other hand there's the "Imagenet in 1 hour" paper that shows that you can safely increase the batch size up to 40k and achieve pretty much the same accuracy, as long as you linearly decay the learning rate. It's hard to say what's the reason for the slowdown. It might be Python overhead, but I'd need to jump into the profiler to say that for sure. BTW if you install master branch from source you might see an improvement. |
I ran your network through a profiler and identified a few bottlenecks. In general, we should be able to cut the iteration time by around a half in the multi-GPU case, and around 1/3 in the single GPU case. We have some initial plans sketched out, now we only need to implement them. Thanks for the report! |
@apaszke thanks for the prompt and detailed response! I have noticed that the first batch of the initial epoch takes a very long time (I assume because it's setting up and broadcasting all the initial parameters over multiple GPUs), and the time comes down after a couple of batches. Thanks for all the assistance! If you need any help with testing implementations then let me know! Regards, Alex |
@apaszke Thanks for explaining DataParallel. From what I am understanding so far, one of the biggest advantages from DataParallel is that it allows a model to run at a larger size of minibatch. Say if my GPU0 is only capable of a batch size of 200, but by using DataParallel on two GPUs, it should be roughly 400. Am I right? But in my experiments, I got GPU out of memory error immediately after increasing the batch size. My GPU memory usage is like 90% on GPU0, 20% on the others. |
@chenyangh is your code public? we've been getting such complaints, but couldn't really reproduce the problem |
@apaszke I think I just found out the reason. My code is still a mess and I am ashamed of sharing it right now. :(
if I dont delete the loss and decoder_logit, the memory usinage will increase in the main GPU until full( w/ or w/o DataParallel). But with DataParallel, the 'loss' and 'decoder_logit' are actually not on the 'slave gpus'. So the memory usage on 'slave GPUs' will be far less than that on the 'master GPU' I think I can say this issue is caused by the memory leak in the main thread. |
They aren't on the other GPUs, but the graph on them should be alive as long as those outputs are. A minimal snippet that would help us reproduce it would still be helpful. |
I test CondenseNet in 1 1080Ti v.s 4 1080Ti, This is my test result: I just change my batchsize from 32 to 128, the dataset and model are same. why the time is 300ms/batch in 4 1080Ti instead of 80ms. When I increase the number of GPUs to 4, the amount of data also increases by 4 times. Shouldn't the time be close? Can someone help me? |
This is my test code:
|
Closing this issue due to age. Please reopen if you're still experiencing this issue with a more recent PyTorch version. |
I am experiencing this issue currently. Long story short, an epoch using a batch size of 512 takes 4:50 on 1 Nvidia GTX 1080ti (without using DataParallel). A batch size of 512 takes 7:30 on 4 1 Nvidia GTX 1080ti (using DataParallel). And a batch size of 512*4=2048 takes 6:10 on 4 1 Nvidia GTX 1080ti (using DataParallel). Any larger batch sizes overflows CUDA memory. So, what is even the point of DataParallel if it makes training slower? |
Hi,
I am attempting to train ResNet 1001 on Cifar 10 using multiple GPUs (I have 4 x Titan X Maxwell cards) using DataParallel. However, it seems to train faster on one GPU than on multiple GPUs. I have attached screenshots below.
4 GPUs:
![4 GPUs](https://user-images.githubusercontent.com/10220372/33313461-9b8caaac-d422-11e7-882d-98c81c0fd2a4.png)
1 GPU:
![Single GPU](https://user-images.githubusercontent.com/10220372/33313474-a447228a-d422-11e7-9c53-88f3f9ade7c5.png)
(For reference, the torch implementation trains a batch in around 0.4 - 0.45s).
Using nvidia-smi I can verify that it is using multiple GPUs:
But there isn't the speed up I was expecting ...
My code uses:
net = torch.nn.DataParallel(net).cuda()
to set up the network across all the available GPUs, and
input = input.cuda(async=True) target = target.cuda(async=True) input_var = torch.autograd.Variable(input) target_var = torch.autograd.Variable(target)
to send the variables to the GPUS.
The full code is here: https://github.com/al3xsh/pytorch-models
Can anybody tell me what am I missing? (I'm sure it's something obvious :)
Regards,
Alex
nb. I am using version 0.2.0_4 & python 3.6 installed via the conda instructions on the installation page.
The text was updated successfully, but these errors were encountered: