Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Parallel slows things down - ResNet 1001 #3917

Closed
al3xsh opened this issue Nov 28, 2017 · 15 comments
Closed

Data Parallel slows things down - ResNet 1001 #3917

al3xsh opened this issue Nov 28, 2017 · 15 comments

Comments

@al3xsh
Copy link

al3xsh commented Nov 28, 2017

Hi,

I am attempting to train ResNet 1001 on Cifar 10 using multiple GPUs (I have 4 x Titan X Maxwell cards) using DataParallel. However, it seems to train faster on one GPU than on multiple GPUs. I have attached screenshots below.

4 GPUs:
4 GPUs

1 GPU:
Single GPU

(For reference, the torch implementation trains a batch in around 0.4 - 0.45s).

Using nvidia-smi I can verify that it is using multiple GPUs:

nvidia-smi

But there isn't the speed up I was expecting ...

My code uses:

net = torch.nn.DataParallel(net).cuda()

to set up the network across all the available GPUs, and

input = input.cuda(async=True) target = target.cuda(async=True) input_var = torch.autograd.Variable(input) target_var = torch.autograd.Variable(target)

to send the variables to the GPUS.

The full code is here: https://github.com/al3xsh/pytorch-models

Can anybody tell me what am I missing? (I'm sure it's something obvious :)

Regards,

Alex

nb. I am using version 0.2.0_4 & python 3.6 installed via the conda instructions on the installation page.

@zhengyunqq
Copy link

I also met similar problem before and I don't know why.

@apaszke
Copy link
Contributor

apaszke commented Nov 28, 2017

My guess would be that you’re using a small batch size (64, so 16 per GPU) so they are not effectively used at 100%, but you have to pay the communication costs that make it slower. Check how the run times change if you increase the batch size (you still have plenty of free memory)

@al3xsh
Copy link
Author

al3xsh commented Nov 28, 2017

@apaszke

Thanks for the quick response! I used a batch size of 64 as that is what they used in the torch implementation. In fact Kaiming He has shown that, in their experiments, a minibatch size of 64 actually achieves better results than 128!

(Which was obviously unexpected :)

Increasing the batch size to 128 gives me roughly the same time to evaluate each batch (1.4s) as with a batch size of 64 (but obviously will result in half the time per epoch!).

I was just a bit confused as to why the speed up was so much worse than the torch implementation - any ideas?

Regards,

Alex

@apaszke
Copy link
Contributor

apaszke commented Nov 28, 2017

@al3xsh on the other hand there's the "Imagenet in 1 hour" paper that shows that you can safely increase the batch size up to 40k and achieve pretty much the same accuracy, as long as you linearly decay the learning rate.

It's hard to say what's the reason for the slowdown. It might be Python overhead, but I'd need to jump into the profiler to say that for sure. BTW if you install master branch from source you might see an improvement.

@apaszke
Copy link
Contributor

apaszke commented Nov 28, 2017

I ran your network through a profiler and identified a few bottlenecks. In general, we should be able to cut the iteration time by around a half in the multi-GPU case, and around 1/3 in the single GPU case. We have some initial plans sketched out, now we only need to implement them. Thanks for the report!

@apaszke
Copy link
Contributor

apaszke commented Nov 29, 2017

I opened #3929 and #3930 that should fix the problems in your code. I'll keep this issue open until they are solved, as a reference.

@al3xsh
Copy link
Author

al3xsh commented Nov 29, 2017

@apaszke thanks for the prompt and detailed response!

I have noticed that the first batch of the initial epoch takes a very long time (I assume because it's setting up and broadcasting all the initial parameters over multiple GPUs), and the time comes down after a couple of batches.

Thanks for all the assistance! If you need any help with testing implementations then let me know!

Regards,

Alex

@soumith soumith added this to distributed/multiGPU in Issue Categories Dec 1, 2017
@chenyangh
Copy link

@apaszke Thanks for explaining DataParallel. From what I am understanding so far, one of the biggest advantages from DataParallel is that it allows a model to run at a larger size of minibatch. Say if my GPU0 is only capable of a batch size of 200, but by using DataParallel on two GPUs, it should be roughly 400. Am I right? But in my experiments, I got GPU out of memory error immediately after increasing the batch size. My GPU memory usage is like 90% on GPU0, 20% on the others.

@apaszke
Copy link
Contributor

apaszke commented Dec 8, 2017

@chenyangh is your code public? we've been getting such complaints, but couldn't really reproduce the problem

@chenyangh
Copy link

@apaszke I think I just found out the reason. My code is still a mess and I am ashamed of sharing it right now. :(
The unevenly GPU usage in my case is caused by some intermediate Tensors in the main loop. I don't quite know how you manage the variables that lost references, but they seem to be staying on the GPU memory unless specifically 'del' it.
My main loop is like below.

        for i, (data, lable) in enumerate(data_loader):
            # print('i=%d: ' % (i))
            decoder_logit = model(Variable(data).cuda())
            optimizer.zero_grad()
            loss = loss_criterion(decoder_logit, lable )
            train_loss_sum += loss.data[0]
            loss.backward()
            optimizer.step()
           # del loss, decoder_logit

if I dont delete the loss and decoder_logit, the memory usinage will increase in the main GPU until full( w/ or w/o DataParallel). But with DataParallel, the 'loss' and 'decoder_logit' are actually not on the 'slave gpus'. So the memory usage on 'slave GPUs' will be far less than that on the 'master GPU'

I think I can say this issue is caused by the memory leak in the main thread.

@apaszke
Copy link
Contributor

apaszke commented Jan 10, 2018

They aren't on the other GPUs, but the graph on them should be alive as long as those outputs are. A minimal snippet that would help us reproduce it would still be helpful.

@younghe
Copy link

younghe commented Oct 9, 2018

I test CondenseNet in 1 1080Ti v.s 4 1080Ti, This is my test result:
1 GPU:
1-1080ti

4 GPU:
4-1080ti

I just change my batchsize from 32 to 128, the dataset and model are same. why the time is 300ms/batch in 4 1080Ti instead of 80ms. When I increase the number of GPUs to 4, the amount of data also increases by 4 times. Shouldn't the time be close? Can someone help me?
here's the code
https://github.com/ShichenLiu/CondenseNet

@younghe
Copy link

younghe commented Oct 9, 2018

This is my test code:
`
def validate(val_loader, model, criterion):
batch_time = AverageMeter()
losses = AverageMeter()
top1 = AverageMeter()
top5 = AverageMeter()

### Switch to evaluate mode
model.eval()

end = time.time()
initial = False
for i, (input, target) in enumerate(val_loader):
    target = target.cuda(async=True)
    input_var = torch.autograd.Variable(input, volatile=True)
    target_var = torch.autograd.Variable(target, volatile=True)

    ### Compute output
    output = model(input_var)
    loss = criterion(output, target_var)

    ### Measure accuracy and record loss
    #prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
    prec1 = accuracy(output.data, target, topk=(1))
    losses.update(loss.data[0], input.size(0))
    top1.update(prec1[0], input.size(0))
    #top5.update(prec5[0], input.size(0))

    ### Measure elapsed time
    batch_time.update(time.time() - end)
    end = time.time()
    
    if i % args.print_freq == 0:
        print('Test: [{0}/{1}]\t'
              'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
              'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
              'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
              'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
                  i, len(val_loader), batch_time=batch_time, loss=losses,
                  top1=top1, top5=top5))
    
    if not initial:
        initial = True
        batch_time = AverageMeter()
        print("Initial finished......")
        total_start = time.time()

total_time = time.time() - total_start
print("total time: %.5f"%total_time)
print(' * Prec@1 {top1.avg:.3f} Prec@5 {top5.avg:.3f}'
      .format(top1=top1, top5=top5))

return 100. - top1.avg, 100. - top5.avg`

@mruberry
Copy link
Collaborator

Closing this issue due to age. Please reopen if you're still experiencing this issue with a more recent PyTorch version.

@syamamo1
Copy link

syamamo1 commented Apr 26, 2023

I am experiencing this issue currently. Long story short, an epoch using a batch size of 512 takes 4:50 on 1 Nvidia GTX 1080ti (without using DataParallel). A batch size of 512 takes 7:30 on 4 1 Nvidia GTX 1080ti (using DataParallel). And a batch size of 512*4=2048 takes 6:10 on 4 1 Nvidia GTX 1080ti (using DataParallel). Any larger batch sizes overflows CUDA memory. So, what is even the point of DataParallel if it makes training slower?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Issue Categories
distributed/multiGPU
Development

No branches or pull requests

7 participants