Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with CUDA10 #2

Closed
jackonelli opened this issue Mar 5, 2019 · 5 comments
Closed

Error with CUDA10 #2

jackonelli opened this issue Mar 5, 2019 · 5 comments

Comments

@jackonelli
Copy link

Pytorch/CUDA error when running experiments/train/run_swag.py when running on your setup and
CUDA 10.0
nvidia driver: 410.79

This seems to be in no way your fault, see for instance here, but you should be aware that it affects this repo.

The proposed fix is to change this setting:
torch.backends.cudnn.benchmark = False.
The error remains but it does not break the script and it continues to training.

$ python run_swag.py --data_path ./data/cifar --dir train --use_test --model VGG16

Preparing directory train
Using model VGG16
Loading dataset CIFAR10 from ./data/cifar
Files already downloaded and verified
You are going to run models on the test set. Are you sure?
Files already downloaded and verified
Preparing model

SGD training
**THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument **
Traceback (most recent call last):
  File "run_swag.py", line 172, in <module>
    train_res = utils.train_epoch(loaders['train'], model, criterion, optimizer)
  File "/home/jakob/dev/swa_gaussian/swag/utils.py", line 69, in train_epoch
    loss, output = criterion(model, input, target)
  File "/home/jakob/dev/swa_gaussian/swag/losses.py", line 7, in cross_entropy
    output = model(input)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jakob/dev/swa_gaussian/swag/models/vgg.py", line 57, in forward
    x = self.features(x)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jakob/dev/swa_gaussian/venv/lib/python3.6/site-packages/torch-1.0.1.post2-py3.6-linux-x86_64.egg/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
@wjmaddox
Copy link
Owner

wjmaddox commented Mar 5, 2019

Hi,

I'll take a look at reproducing this issue next week. We ran several experiments with CUDA10 using this codebase and didn't have any issues.

@jackonelli
Copy link
Author

jackonelli commented Mar 6, 2019

Interesting! That was my first guess since it was such a widely reported issue.
What nvidia driver are you using?

@jackonelli
Copy link
Author

I have solved the issue now. The cause was that running your setup.py (at least on my system) makes pip install pytorch with Cuda 9.
I had to update the torch package manually and then the error disappeared.

@wjmaddox
Copy link
Owner

wjmaddox commented Mar 20, 2019

Gotcha, I'll close the issue for now then and add a line mentioning manual installation.

@jackonelli
Copy link
Author

Tres bien!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants