Model is moved to GPU after the optimizer is instatiated, resulting in a performance hit. #82

schlabrendorff · 2020-11-26T09:37:52Z

I noticed that the optimizer is instantiated before the model is moved to the GPU.

This is contrary to the PyTorch docs:

If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.
-- https://pytorch.org/docs/stable/optim.html#how-to-use-an-optimizer

I noticed the problem on my machine, because I had fluctuating GPU utilization (checked with nvtop). The utilization jumped every couple seconds from 10-20% to 80% and back. Moving the model to cuda beforehand (in train.py) fixed issue for me. (Afterwards the utilization never dropped under 70%).

model = config.init_obj('arch', module_arch)
model.cuda()
...
optimizer = config.init_obj('optimizer', torch.optim, trainable_params)

Can you reproduce the behavior?

The text was updated successfully, but these errors were encountered:

SunQpark · 2020-11-26T18:28:06Z

Thank you for reporting this, @schlabrendorff.
I've been completely unaware of this issue for so long time.

I'll look more into it and make a PR handling this soon.

SunQpark · 2020-11-29T16:38:08Z

Reproduced issue with this script.
It seems that default LeNet model for MNIST example is too small to reproduce this issue.

import time
import torch
import torch.nn as nn
from torchvision.models import resnet152
from tqdm import tqdm


device = torch.device('cuda')

def run_training(model, optimizer, steps, batch_size=10):
    data = torch.randn(batch_size, 3, 224, 224).to(device)
    target = torch.LongTensor([1]*batch_size).to(device)
    loss_fn = nn.NLLLoss()

    for _ in tqdm(range(steps)):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()

def optim_before_cuda(steps, batch_size):
    print('optim_before_cuda')
    model = resnet152()
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters())

    run_training(model, optimizer, steps, batch_size)

def optim_after_cuda(steps, batch_size):
    print('optim_after_cuda')
    model = resnet152()
    optimizer = torch.optim.Adam(model.parameters())
    model = model.to(device)

    run_training(model, optimizer, steps, batch_size)

if __name__ == '__main__':
    steps = 200
    batch_size = 16

    start = time.time()
    optim_after_cuda(steps, batch_size)
    # optim_before_cuda(steps, batch_size)
    end = time.time()

    print(end - start)

The difference in speed was small, but GPU utilization was significantly unstable when optimizer initialization preceded model.to(device).

fix issue #82

SunQpark · 2020-11-29T17:25:03Z

@schlabrendorff I made PR #83 handling this issue.
Can you please review the changes?

schlabrendorff · 2020-11-30T18:20:17Z

@SunQpark Looks good!! Thank you!

SunQpark added a commit that referenced this issue Nov 29, 2020

move model to GPU, before optimizer init.

dad0d96

fix issue #82

SunQpark mentioned this issue Nov 29, 2020

Fix optimizer initialization order #83

Merged

schlabrendorff closed this as completed Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model is moved to GPU after the optimizer is instatiated, resulting in a performance hit. #82

Model is moved to GPU after the optimizer is instatiated, resulting in a performance hit. #82

schlabrendorff commented Nov 26, 2020

SunQpark commented Nov 26, 2020

SunQpark commented Nov 29, 2020 •

edited

SunQpark commented Nov 29, 2020

schlabrendorff commented Nov 30, 2020

Model is moved to GPU after the optimizer is instatiated, resulting in a performance hit. #82

Model is moved to GPU after the optimizer is instatiated, resulting in a performance hit. #82

Comments

schlabrendorff commented Nov 26, 2020

SunQpark commented Nov 26, 2020

SunQpark commented Nov 29, 2020 • edited

SunQpark commented Nov 29, 2020

schlabrendorff commented Nov 30, 2020

SunQpark commented Nov 29, 2020 •

edited