Multi-GPU [Torch DataParallel] #6
Comments
I believe it’s a CuPy thing. You’d probably have more success running with DistributedDataParallel since each process has a totally separate CUDA handle, but maybe there’s a way to fix it here? What's the error? |
Yep CuPy thing for sure. Here's my stack trace: Code is very simple. Will try
|
+1 for @jekbradbury's My guess as to the line that needs to be updated for such support is: pytorch-qrnn/torchqrnn/forget_mult.py Line 112 in 3aa5e72
For now Setting |
Actually, scratch that, I have a version that appears to be working for In master of https://github.com/Smerity/pytorch-qrnn where I've updated For quite large matrices and sequences (where I didn't want to go much larger as the single GPU runs out of memory):
where "difference" is the total sum difference between the result from the single GPU and two GPU runs. Note: the speed-up could be even better (the single GPU sits at 100% utilization but the two GPUs sit at ~70% utilization when the batch is split) but then the experiment would take forever on a single GPU / run out of memory. |
Sorry for the late reply @Smerity -- pulled the changes in your branch is it works on my code! Running on multi-GPU no with |
Glad you could test it - and huzzah that it's working for you! ^_^ Any vague approximation note on how much faster 4xGPU QRNN is than 4xLSTM? Eh, I'll settle for "very fast" anyway - really glad it's working for you ^_^ I'll merge this in now but will need to update the README to state completion either if I get ahead of my paper deadline for this Friday (lol) or early next week. |
Will update numbers here when I get them for 4x GPU comparison with LSTM. Had trouble getting a P100 scheduled. I also need to get QRNN working for 2+ layers -- some size mismatch for that one so not "drop in replacement" for LSTM, but I'm sure I can fix it. |
Could you guys get it to work with
torch.nn.DataParallel(model).cuda()
? I could not, but perhaps did not try hard enough. Can't tell if it's a wrong-GPU problem, or CuPy won't support it.Runs pretty fast on 1x GPUs though. A bit faster than 4x GPUs for vanilla LSTM, but not by much without scaling to multiple GPUs...
The text was updated successfully, but these errors were encountered: