[WIP] infiniband support #2903

ailzhang · 2017-09-29T19:12:07Z

Currently Pytorch supports using 1 IB card. By default the first one in device list is used. Or you can manually specify the device name by setting .name="mlx5_1".

soumith · 2017-09-29T21:54:03Z

@pytorchbot add to whitelist

ezyang · 2017-10-20T14:23:30Z

@ailzhang What's the status on this patch? :)

xqding · 2017-10-25T03:55:15Z

I tried to compile from source with this patch and got the following complains:
pytorch/torch/lib/THD/base/data_channels/DataChannelGloo.cpp:7:43: fatal error: gloo/transport/ibverbs/device.h: No such file or directory #include "gloo/transport/ibverbs/device.h"

However, I do have the file gloo/transport/ibverbs/device.h.
Any thoughts on it?

ailzhang · 2017-10-25T04:33:29Z

This error was caused by missing header file in your temp build folder, which means "pytorch/torch/lib/tmp_install/include/gloo/transport/ibverbs/device.h" is missing. Could you please check that?

xqding · 2017-10-25T05:14:31Z

Just figured it out. I have to turn on the WITH_IBVERBS=1 in the file torch/lib/build_libs.sh.
Now it compiles fine. I will see if it works soon. Thanks.

xqding · 2017-10-25T18:26:47Z

I tried a similar script as this one https://github.com/pytorch/examples/blob/master/imagenet/main.py.
Here is the error messages:

Traceback (most recent call last):
File "./script/train_dist_gloo.py", line 99, in <module>
outputs = net(inputs)
File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 263, in __call__
result = self.forward(*input, **kwargs)
File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 156, in forward
self._sync_params()
File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 187, in _sync_params
dist.broadcast(flat_buffers, 0)
File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/distributed/__init__.py", line 198, in broadcast
return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: [/home/yuyou/downloads/mypytorch2/pytorch/torch/lib/gloo/gloo/transport/ibverbs/buffer.cc:108] Read timeout LID: 34 QPN: 7435 PSN: 13656770
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what():  [enforce fail at /home/yuyou/downloads/mypytorch2/pytorch/torch/lib/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /home/yuyou/downloads/mypyt\
orch2/pytorch/torch/lib/gloo/gloo/cuda.cu:249: driver shutting down

xqding · 2017-10-26T14:11:41Z

If I use the torch.distributed to average the gradient instead of torch.nn.parallel.DistributedDataParallel, the code works fine and has good scaling with the number of GPUs.

ailzhang · 2017-10-27T23:16:14Z

Hi @xqding , could you share your script? I'm trying to debug this issue. Thanks!

xqding · 2017-10-28T02:21:17Z

@ailzhang My script is essentially the same as https://github.com/pytorch/examples/blob/master/imagenet/main.py in distributed mode, except that I use a different dataset instead of imagenet dataset.

xqding · 2017-10-28T02:23:21Z

Do you need the whole script to debug this? If so, I can try to reproduce the issue using the MNIST dataset.

ailzhang · 2017-10-28T16:24:05Z

Hi @xqding, if would be good to share a code snippet how you averaged the gradient using nn.distributed instead of DistributedDataParallel, thanks!

xqding · 2017-10-28T17:35:53Z

train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)
train_loader = DataLoader(train_data,
                          batch_size = 32,
                          sampler = train_sampler,
                          num_workers= 2)

for epoch in range(num_epoches):  # loop over the dataset multiple times
    running_loss = 0.0
    print("Epoch: {}".format(epoch))
    train_sampler.set_epoch(epoch)
    for i, data in enumerate(train_loader, 0):
        # gettheinputs
        inputs = data['image']
        labels = data['category_id']
        labels = np.array([category_ids.index(l) for l in labels])
        print("i: {}".format(i))

        print("labels", labels)
        # wrap them in Variable
        inputs = inputs.cuda(async=True)
        labels = torch.from_numpy(labels).cuda(async=True)
	inputs, labels= Variable(inputs), Variable(labels)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss=criterion(outputs,labels)
	loss.backward()

        size = float(dist.get_world_size())
        for param in net.parameters():
            dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
            param.grad.data /= size
	optimizer.step()

When epoch = 0, the code works fine. When it starts the epoch = 1, it crashes when it reaches the dist.all_reduce command with the following error message:

Traceback (most recent call last):
  File "./script/train_dist_new.py", line 126, in <module>
    dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
  File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/distributed/__init__.py", line 216, in all_reduce
    return torch._C._dist_all_reduce(tensor, op, group)
RuntimeError: [/home/yuyou/downloads/mypytorch2/pytorch/torch/lib/gloo/gloo/transport/ibverbs/buffer.cc:108] Read timeout LID: 45 QPN: 29770 PSN: 11590737

xqding · 2017-10-28T18:05:08Z

Hi @ailzhang, let me know if you need any other information.

ailzhang · 2017-10-29T23:45:50Z

Hi @xqding , I can randomly reproduce the problem on my machine. I wonder if your workaround below solves the problem permanently. If so, could you share that part so that it may help me locate which part of DistributedDataParallel caused the timeout. Thanks!

If I use the torch.distributed to average the gradient instead of torch.nn.parallel.DistributedDataParallel, the code works fine and has good scaling with the number of GPUs.

xqding · 2017-10-31T18:07:43Z

Here is a summary of what I have tried:

Using DistributedDataParrallel gives me the following error:

Traceback (most recent call last):
File "./script/train_dist_gloo.py", line 99, in <module>
outputs = net(inputs)
File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 263, in __call__
result = self.forward(*input, **kwargs)
File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 156, in forward
self._sync_params()
File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 187, in _sync_params
dist.broadcast(flat_buffers, 0)
File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/distributed/__init__.py", line 198, in broadcast
return torch._C._dist_broadcast(tensor, src, group)

In stead of using DistributedDataParrallel, I tried to train independent model on each node and average the gradient using the following code:

for epoch in range(num_epoches):  # loop over the dataset multiple times
    running_loss = 0.0
    print("Epoch: {}".format(epoch))
    train_sampler.set_epoch(epoch)
    for i, data in enumerate(train_loader, 0):
        # gettheinputs
        inputs = data['image']
        labels = data['category_id']
        labels = np.array([category_ids.index(l) for l in labels])
        print("i: {}".format(i))

        print("labels", labels)
        # wrap them in Variable
        inputs = inputs.cuda(async=True)
        labels = torch.from_numpy(labels).cuda(async=True)
	inputs, labels= Variable(inputs), Variable(labels)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss=criterion(outputs,labels)
	loss.backward()

        size = float(dist.get_world_size())
        for param in net.parameters():
            dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
            param.grad.data /= size
	optimizer.step()

It works fine for the first epoch. It crashes once it starts the second epoch with the follwing error:

Traceback (most recent call last):
  File "./script/train_dist_new.py", line 126, in <module>
    dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
  File "/home/yuyou/apps/anaconda3/lib/python3.5/site-packages/torch/distributed/__init__.py", line 216, in all_reduce
    return torch._C._dist_all_reduce(tensor, op, group)
RuntimeError: [/home/yuyou/downloads/mypytorch2/pytorch/torch/lib/gloo/gloo/transport/ibverbs/buffer.cc:108] Read timeout LID: 45 QPN: 29770 PSN: 11590737

My workaround now is to combine multiple epoch of data into one epoch when I define the Dataset. Basically, looping over the dataloader once is same as looping over my train data multiple times. It works pretty well so far.

OnezeroW · 2017-11-21T05:50:27Z

@xqding I found your code snippet quite like this.
I know that https://github.com/pytorch/examples/blob/master/imagenet/main.py use Gloo as the default backend, just as the following code:
parser.add_argument('--dist-backend', default='gloo', type=str, help='distributed backend').

However, Gloo use TCP by default. I wonder how to use Gloo IBVERBS. Thanks.

401qingkong · 2019-07-13T06:13:16Z

@xqding I found your code snippet quite like this.
I know that https://github.com/pytorch/examples/blob/master/imagenet/main.py use Gloo as the default backend, just as the following code:
parser.add_argument('--dist-backend', default='gloo', type=str, help='distributed backend').

However, Gloo use TCP by default. I wonder how to use Gloo IBVERBS. Thanks.

Have you solved this problem? how to use Gloo IBVERBS

ailzhang force-pushed the master branch 4 times, most recently from b031f08 to bb09176 Compare September 29, 2017 20:08

soumith mentioned this pull request Oct 3, 2017

Updated functions for benchmark test #2932

Merged

zjoe mentioned this pull request Nov 20, 2017

Issue with ibverbs support #3787

Closed

onnxbot-worker-2 mentioned this pull request Dec 18, 2017

[auto] pytorch-pr-2903 onnxbot/onnx-fb-universe#56

Closed

ailzhang closed this Jan 23, 2018

ailzhang force-pushed the master branch from bb09176 to 409b1c8 Compare January 23, 2018 08:08

zsk423200 mentioned this pull request Aug 16, 2018

Error:Read timeout when start pytorch #10295

Closed

[WIP] infiniband support #2903

[WIP] infiniband support #2903

Uh oh!

Conversation

ailzhang commented Sep 29, 2017

Uh oh!

soumith commented Sep 29, 2017

Uh oh!

ezyang commented Oct 20, 2017

Uh oh!

xqding commented Oct 25, 2017

Uh oh!

ailzhang commented Oct 25, 2017

Uh oh!

xqding commented Oct 25, 2017

Uh oh!

xqding commented Oct 25, 2017

Uh oh!

xqding commented Oct 26, 2017

Uh oh!

ailzhang commented Oct 27, 2017

Uh oh!

xqding commented Oct 28, 2017

Uh oh!

xqding commented Oct 28, 2017

Uh oh!

ailzhang commented Oct 28, 2017

Uh oh!

xqding commented Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xqding commented Oct 28, 2017

Uh oh!

ailzhang commented Oct 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xqding commented Oct 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OnezeroW commented Nov 21, 2017

Uh oh!

401qingkong commented Jul 13, 2019

Uh oh!

Uh oh!

xqding commented Oct 28, 2017 •

edited

Loading

ailzhang commented Oct 29, 2017 •

edited

Loading

xqding commented Oct 31, 2017 •

edited

Loading