Error with CUDA10 #2

jackonelli · 2019-03-05T15:32:14Z

Pytorch/CUDA error when running experiments/train/run_swag.py when running on your setup and
CUDA 10.0
nvidia driver: 410.79

This seems to be in no way your fault, see for instance here, but you should be aware that it affects this repo.

The proposed fix is to change this setting:
torch.backends.cudnn.benchmark = False.
The error remains but it does not break the script and it continues to training.

$ python run_swag.py --data_path ./data/cifar --dir train --use_test --model VGG16

Preparing directory train
Using model VGG16
Loading dataset CIFAR10 from ./data/cifar
Files already downloaded and verified
You are going to run models on the test set. Are you sure?
Files already downloaded and verified
Preparing model

SGD training
**THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument **
Traceback (most recent call last):
  File "run_swag.py", line 172, in <module>
    train_res = utils.train_epoch(loaders['train'], model, criterion, optimizer)
  File "/home/jakob/dev/swa_gaussian/swag/utils.py", line 69, in train_epoch
    loss, output = criterion(model, input, target)
  File "/home/jakob/dev/swa_gaussian/swag/losses.py", line 7, in cross_entropy
    output = model(input)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jakob/dev/swa_gaussian/swag/models/vgg.py", line 57, in forward
    x = self.features(x)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

The text was updated successfully, but these errors were encountered:

wjmaddox · 2019-03-05T18:43:30Z

Hi,

I'll take a look at reproducing this issue next week. We ran several experiments with CUDA10 using this codebase and didn't have any issues.

jackonelli · 2019-03-06T07:23:40Z

Interesting! That was my first guess since it was such a widely reported issue.
What nvidia driver are you using?

jackonelli · 2019-03-20T07:58:28Z

I have solved the issue now. The cause was that running your setup.py (at least on my system) makes pip install pytorch with Cuda 9.
I had to update the torch package manually and then the error disappeared.

wjmaddox · 2019-03-20T13:29:32Z

Gotcha, I'll close the issue for now then and add a line mentioning manual installation.

jackonelli · 2019-03-21T16:45:26Z

Tres bien!

wjmaddox closed this as completed Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with CUDA10 #2

Error with CUDA10 #2

jackonelli commented Mar 5, 2019

wjmaddox commented Mar 5, 2019

jackonelli commented Mar 6, 2019 •

edited

Loading

jackonelli commented Mar 20, 2019

wjmaddox commented Mar 20, 2019 •

edited

Loading

jackonelli commented Mar 21, 2019

Error with CUDA10 #2

Error with CUDA10 #2

Comments

jackonelli commented Mar 5, 2019

wjmaddox commented Mar 5, 2019

jackonelli commented Mar 6, 2019 • edited Loading

jackonelli commented Mar 20, 2019

wjmaddox commented Mar 20, 2019 • edited Loading

jackonelli commented Mar 21, 2019

jackonelli commented Mar 6, 2019 •

edited

Loading

wjmaddox commented Mar 20, 2019 •

edited

Loading