-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[WIP] infiniband support #2903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] infiniband support #2903
Conversation
b031f08
to
bb09176
Compare
@pytorchbot add to whitelist |
@ailzhang What's the status on this patch? :) |
I tried to compile from source with this patch and got the following complains: However, I do have the file gloo/transport/ibverbs/device.h. |
This error was caused by missing header file in your temp build folder, which means "pytorch/torch/lib/tmp_install/include/gloo/transport/ibverbs/device.h" is missing. Could you please check that? |
Just figured it out. I have to turn on the WITH_IBVERBS=1 in the file torch/lib/build_libs.sh. |
I tried a similar script as this one https://github.com/pytorch/examples/blob/master/imagenet/main.py.
|
If I use the torch.distributed to average the gradient instead of torch.nn.parallel.DistributedDataParallel, the code works fine and has good scaling with the number of GPUs. |
Hi @xqding , could you share your script? I'm trying to debug this issue. Thanks! |
@ailzhang My script is essentially the same as https://github.com/pytorch/examples/blob/master/imagenet/main.py in distributed mode, except that I use a different dataset instead of imagenet dataset. |
Do you need the whole script to debug this? If so, I can try to reproduce the issue using the MNIST dataset. |
Hi @xqding, if would be good to share a code snippet how you averaged the gradient using nn.distributed instead of DistributedDataParallel, thanks! |
When epoch = 0, the code works fine. When it starts the epoch = 1, it crashes when it reaches the dist.all_reduce command with the following error message:
|
Hi @ailzhang, let me know if you need any other information. |
Hi @xqding , I can randomly reproduce the problem on my machine. I wonder if your workaround below solves the problem permanently. If so, could you share that part so that it may help me locate which part of DistributedDataParallel caused the timeout. Thanks!
|
Here is a summary of what I have tried:
It works fine for the first epoch. It crashes once it starts the second epoch with the follwing error:
|
@xqding I found your code snippet quite like this. However, Gloo use TCP by default. I wonder how to use Gloo IBVERBS. Thanks. |
Have you solved this problem? how to use Gloo IBVERBS |
Currently Pytorch supports using 1 IB card. By default the first one in device list is used. Or you can manually specify the device name by setting .name="mlx5_1".