Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model is moved to GPU after the optimizer is instatiated, resulting in a performance hit. #82

Closed
schlabrendorff opened this issue Nov 26, 2020 · 4 comments

Comments

@schlabrendorff
Copy link

I noticed that the optimizer is instantiated before the model is moved to the GPU.

This is contrary to the PyTorch docs:

If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.
-- https://pytorch.org/docs/stable/optim.html#how-to-use-an-optimizer

I noticed the problem on my machine, because I had fluctuating GPU utilization (checked with nvtop). The utilization jumped every couple seconds from 10-20% to 80% and back. Moving the model to cuda beforehand (in train.py) fixed issue for me. (Afterwards the utilization never dropped under 70%).

model = config.init_obj('arch', module_arch)
model.cuda()
...
optimizer = config.init_obj('optimizer', torch.optim, trainable_params)

Can you reproduce the behavior?

@SunQpark
Copy link
Collaborator

Thank you for reporting this, @schlabrendorff.
I've been completely unaware of this issue for so long time.

I'll look more into it and make a PR handling this soon.

@SunQpark
Copy link
Collaborator

SunQpark commented Nov 29, 2020

Reproduced issue with this script.
It seems that default LeNet model for MNIST example is too small to reproduce this issue.

import time
import torch
import torch.nn as nn
from torchvision.models import resnet152
from tqdm import tqdm


device = torch.device('cuda')

def run_training(model, optimizer, steps, batch_size=10):
    data = torch.randn(batch_size, 3, 224, 224).to(device)
    target = torch.LongTensor([1]*batch_size).to(device)
    loss_fn = nn.NLLLoss()

    for _ in tqdm(range(steps)):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()

def optim_before_cuda(steps, batch_size):
    print('optim_before_cuda')
    model = resnet152()
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters())

    run_training(model, optimizer, steps, batch_size)

def optim_after_cuda(steps, batch_size):
    print('optim_after_cuda')
    model = resnet152()
    optimizer = torch.optim.Adam(model.parameters())
    model = model.to(device)

    run_training(model, optimizer, steps, batch_size)

if __name__ == '__main__':
    steps = 200
    batch_size = 16

    start = time.time()
    optim_after_cuda(steps, batch_size)
    # optim_before_cuda(steps, batch_size)
    end = time.time()

    print(end - start)

The difference in speed was small, but GPU utilization was significantly unstable when optimizer initialization preceded model.to(device).

@SunQpark
Copy link
Collaborator

@schlabrendorff I made PR #83 handling this issue.
Can you please review the changes?

@schlabrendorff
Copy link
Author

@SunQpark Looks good!! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants