Data Parallel slows things down - ResNet 1001 #3917

al3xsh · 2017-11-28T10:12:05Z

Hi,

I am attempting to train ResNet 1001 on Cifar 10 using multiple GPUs (I have 4 x Titan X Maxwell cards) using DataParallel. However, it seems to train faster on one GPU than on multiple GPUs. I have attached screenshots below.

4 GPUs:

1 GPU:

(For reference, the torch implementation trains a batch in around 0.4 - 0.45s).

Using nvidia-smi I can verify that it is using multiple GPUs:

But there isn't the speed up I was expecting ...

My code uses:

net = torch.nn.DataParallel(net).cuda()

to set up the network across all the available GPUs, and

input = input.cuda(async=True) target = target.cuda(async=True) input_var = torch.autograd.Variable(input) target_var = torch.autograd.Variable(target)

to send the variables to the GPUS.

The full code is here: https://github.com/al3xsh/pytorch-models

Can anybody tell me what am I missing? (I'm sure it's something obvious :)

Regards,

Alex

nb. I am using version 0.2.0_4 & python 3.6 installed via the conda instructions on the installation page.

The text was updated successfully, but these errors were encountered:

zhengyunqq · 2017-11-28T10:27:27Z

I also met similar problem before and I don't know why.

apaszke · 2017-11-28T10:27:57Z

My guess would be that you’re using a small batch size (64, so 16 per GPU) so they are not effectively used at 100%, but you have to pay the communication costs that make it slower. Check how the run times change if you increase the batch size (you still have plenty of free memory)

al3xsh · 2017-11-28T10:39:48Z

@apaszke

Thanks for the quick response! I used a batch size of 64 as that is what they used in the torch implementation. In fact Kaiming He has shown that, in their experiments, a minibatch size of 64 actually achieves better results than 128!

(Which was obviously unexpected :)

Increasing the batch size to 128 gives me roughly the same time to evaluate each batch (1.4s) as with a batch size of 64 (but obviously will result in half the time per epoch!).

I was just a bit confused as to why the speed up was so much worse than the torch implementation - any ideas?

Regards,

Alex

apaszke · 2017-11-28T10:42:22Z

@al3xsh on the other hand there's the "Imagenet in 1 hour" paper that shows that you can safely increase the batch size up to 40k and achieve pretty much the same accuracy, as long as you linearly decay the learning rate.

It's hard to say what's the reason for the slowdown. It might be Python overhead, but I'd need to jump into the profiler to say that for sure. BTW if you install master branch from source you might see an improvement.

apaszke · 2017-11-28T23:12:48Z

I ran your network through a profiler and identified a few bottlenecks. In general, we should be able to cut the iteration time by around a half in the multi-GPU case, and around 1/3 in the single GPU case. We have some initial plans sketched out, now we only need to implement them. Thanks for the report!

apaszke · 2017-11-29T00:15:08Z

I opened #3929 and #3930 that should fix the problems in your code. I'll keep this issue open until they are solved, as a reference.

al3xsh · 2017-11-29T08:26:00Z

@apaszke thanks for the prompt and detailed response!

I have noticed that the first batch of the initial epoch takes a very long time (I assume because it's setting up and broadcasting all the initial parameters over multiple GPUs), and the time comes down after a couple of batches.

Thanks for all the assistance! If you need any help with testing implementations then let me know!

Regards,

Alex

chenyangh · 2017-12-08T04:15:40Z

@apaszke Thanks for explaining DataParallel. From what I am understanding so far, one of the biggest advantages from DataParallel is that it allows a model to run at a larger size of minibatch. Say if my GPU0 is only capable of a batch size of 200, but by using DataParallel on two GPUs, it should be roughly 400. Am I right? But in my experiments, I got GPU out of memory error immediately after increasing the batch size. My GPU memory usage is like 90% on GPU0, 20% on the others.

apaszke · 2017-12-08T19:50:34Z

@chenyangh is your code public? we've been getting such complaints, but couldn't really reproduce the problem

chenyangh · 2018-01-08T14:33:26Z

@apaszke I think I just found out the reason. My code is still a mess and I am ashamed of sharing it right now. :(
The unevenly GPU usage in my case is caused by some intermediate Tensors in the main loop. I don't quite know how you manage the variables that lost references, but they seem to be staying on the GPU memory unless specifically 'del' it.
My main loop is like below.

        for i, (data, lable) in enumerate(data_loader):
            # print('i=%d: ' % (i))
            decoder_logit = model(Variable(data).cuda())
            optimizer.zero_grad()
            loss = loss_criterion(decoder_logit, lable )
            train_loss_sum += loss.data[0]
            loss.backward()
            optimizer.step()
           # del loss, decoder_logit

if I dont delete the loss and decoder_logit, the memory usinage will increase in the main GPU until full( w/ or w/o DataParallel). But with DataParallel, the 'loss' and 'decoder_logit' are actually not on the 'slave gpus'. So the memory usage on 'slave GPUs' will be far less than that on the 'master GPU'

I think I can say this issue is caused by the memory leak in the main thread.

apaszke · 2018-01-10T11:45:43Z

They aren't on the other GPUs, but the graph on them should be alive as long as those outputs are. A minimal snippet that would help us reproduce it would still be helpful.

younghe · 2018-10-09T13:40:51Z

I test CondenseNet in 1 1080Ti v.s 4 1080Ti, This is my test result:
1 GPU:

4 GPU:

I just change my batchsize from 32 to 128, the dataset and model are same. why the time is 300ms/batch in 4 1080Ti instead of 80ms. When I increase the number of GPUs to 4, the amount of data also increases by 4 times. Shouldn't the time be close? Can someone help me?
here's the code
https://github.com/ShichenLiu/CondenseNet

younghe · 2018-10-09T13:48:20Z

This is my test code:
`
def validate(val_loader, model, criterion):
batch_time = AverageMeter()
losses = AverageMeter()
top1 = AverageMeter()
top5 = AverageMeter()

### Switch to evaluate mode
model.eval()

end = time.time()
initial = False
for i, (input, target) in enumerate(val_loader):
    target = target.cuda(async=True)
    input_var = torch.autograd.Variable(input, volatile=True)
    target_var = torch.autograd.Variable(target, volatile=True)

    ### Compute output
    output = model(input_var)
    loss = criterion(output, target_var)

    ### Measure accuracy and record loss
    #prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
    prec1 = accuracy(output.data, target, topk=(1))
    losses.update(loss.data[0], input.size(0))
    top1.update(prec1[0], input.size(0))
    #top5.update(prec5[0], input.size(0))

    ### Measure elapsed time
    batch_time.update(time.time() - end)
    end = time.time()
    
    if i % args.print_freq == 0:
        print('Test: [{0}/{1}]\t'
              'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
              'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
              'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
              'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
                  i, len(val_loader), batch_time=batch_time, loss=losses,
                  top1=top1, top5=top5))
    
    if not initial:
        initial = True
        batch_time = AverageMeter()
        print("Initial finished......")
        total_start = time.time()

total_time = time.time() - total_start
print("total time: %.5f"%total_time)
print(' * Prec@1 {top1.avg:.3f} Prec@5 {top5.avg:.3f}'
      .format(top1=top1, top5=top5))

return 100. - top1.avg, 100. - top5.avg`

mruberry · 2020-06-12T04:40:46Z

Closing this issue due to age. Please reopen if you're still experiencing this issue with a more recent PyTorch version.

syamamo1 · 2023-04-26T02:44:01Z

I am experiencing this issue currently. Long story short, an epoch using a batch size of 512 takes 4:50 on 1 Nvidia GTX 1080ti (without using DataParallel). A batch size of 512 takes 7:30 on 4 1 Nvidia GTX 1080ti (using DataParallel). And a batch size of 512*4=2048 takes 6:10 on 4 1 Nvidia GTX 1080ti (using DataParallel). Any larger batch sizes overflows CUDA memory. So, what is even the point of DataParallel if it makes training slower?

soumith added this to distributed/multiGPU in Issue Categories Dec 1, 2017

epeterson12 mentioned this issue Nov 15, 2018

Train on multiple GPUs NRCan/geo-deep-learning#17

Closed

mruberry closed this as completed Jun 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Parallel slows things down - ResNet 1001 #3917

Data Parallel slows things down - ResNet 1001 #3917

al3xsh commented Nov 28, 2017

zhengyunqq commented Nov 28, 2017

apaszke commented Nov 28, 2017

al3xsh commented Nov 28, 2017

apaszke commented Nov 28, 2017

apaszke commented Nov 28, 2017 •

edited

Loading

apaszke commented Nov 29, 2017

al3xsh commented Nov 29, 2017

chenyangh commented Dec 8, 2017

apaszke commented Dec 8, 2017

chenyangh commented Jan 8, 2018

apaszke commented Jan 10, 2018

younghe commented Oct 9, 2018

younghe commented Oct 9, 2018

mruberry commented Jun 12, 2020

syamamo1 commented Apr 26, 2023 •

edited

Loading

Data Parallel slows things down - ResNet 1001 #3917

Data Parallel slows things down - ResNet 1001 #3917

Comments

al3xsh commented Nov 28, 2017

zhengyunqq commented Nov 28, 2017

apaszke commented Nov 28, 2017

al3xsh commented Nov 28, 2017

apaszke commented Nov 28, 2017

apaszke commented Nov 28, 2017 • edited Loading

apaszke commented Nov 29, 2017

al3xsh commented Nov 29, 2017

chenyangh commented Dec 8, 2017

apaszke commented Dec 8, 2017

chenyangh commented Jan 8, 2018

apaszke commented Jan 10, 2018

younghe commented Oct 9, 2018

younghe commented Oct 9, 2018

mruberry commented Jun 12, 2020

syamamo1 commented Apr 26, 2023 • edited Loading

apaszke commented Nov 28, 2017 •

edited

Loading

syamamo1 commented Apr 26, 2023 •

edited

Loading