Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA: no NVIDIA devices found (THCudaCheck FAIL file=torch/csrc/autograd/engine.cpp) #1154

Closed
woofie56 opened this issue Mar 31, 2017 · 15 comments

Comments

@woofie56
Copy link

Hi,

I did the following :

  1. Install PyTorch in Ubuntu 16.04 via :
    conda install pytorch torchvision -c soumith
    http://pytorch.org/
    using Anaconda2-4.3.1

  2. Installed CUDA Linux Ubuntu 16.04 x86_64 via :
    https://developer.nvidia.com/cuda-downloads
    There is no CUDA GPU installed on the machine (it is Virtualbox virtual machine)

  3. Ran autograd_tutorial.py in :
    http://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

and I got the following error (using pytorch-0.1.11-py27_2) :

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]

<torch.autograd._functions.basic_ops.AddConstant object at 0x7feddc620220>
(Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]
, Variable containing:
 27
[torch.FloatTensor of size 1]
)
NVIDIA: no NVIDIA devices found
THCudaCheck FAIL file=torch/csrc/autograd/engine.cpp line=352 error=30 : unknown error
Traceback (most recent call last):
  File "autograd_tutorial.py", line 81, in <module>
    out.backward()
  File "/home/testuser/Anaconda2-4.3.1/lib/python2.7/site-packages/torch/autograd/variable.py", line 146, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
RuntimeError: cuda runtime error (30) : unknown error at torch/csrc/autograd/engine.cpp:352

How do I get around not having a CUDA GPU card? Thanks

@useryc
Copy link

useryc commented Mar 31, 2017

Hello,

I have the same problem, I'm trying to run on a machine with no CPU.

I have installed with pip python 2.7 on Linux as instructed no CUDA option:

pip install http://download.pytorch.org/whl/cu75/torch-0.1.11.post4-cp27-none-linux_x86_64.whl
pip install torchvision

Error:

THCudaCheck FAIL file=torch/csrc/autograd/engine.cpp line=353 error=35 : CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 146, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/autograd/engine.cpp:353

Can someone help with this issue?

Thanks a lot.

@apaszke
Copy link
Contributor

apaszke commented Mar 31, 2017

Why do you have the CUDA driver installed if you don't have a GPU?

@soumith
Copy link
Member

soumith commented Mar 31, 2017

i've reproduced the issue on a node that does not have driver or CUDA.

The check Sam introduced here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/engine.cpp#L353 is wrong.
It needs access to the driver.

We first have to check if isDriverSufficient and if it is not, then we have to skip this case.

I'll fix this and rebuild binaries. Thanks for the report @woofie56

@woofie56
Copy link
Author

@soumith Thanks. I would love to play with this over the weekend, so if you could rebuild the binaries today that would be great.

@soumith
Copy link
Member

soumith commented Mar 31, 2017

for sure. On it. Should be ready in 6 to 7 hours.

@woofie56
Copy link
Author

@soumith Wicked 👍 ! Thanks

@ethancaballero
Copy link

ethancaballero commented Mar 31, 2017

I had this same issue with installing pytorch0.1.11 build 4 on any aws linux cpu via
pip install http://download.pytorch.org/whl/cu75/torch-0.1.11.post4-cp35-cp35m-linux_x86_64.whl

Build number 5 remedies it:
pip install http://download.pytorch.org/whl/cu75/torch-0.1.11.post5-cp35-cp35m-linux_x86_64.whl

@woofie56
Copy link
Author

@soumith Hi, I assumed that the package had been updated and so I tried installing via conda install pytorch torchvision -c soumith and I got the following error message :

The following packages will be UPDATED:

pytorch: 0.1.11-py27_2 soumith --> 0.1.11-py27_5 soumith

Proceed ([y]/n)? y



CondaError: dist_name is not a valid conda package: c

Thanks.

@soumith
Copy link
Member

soumith commented Mar 31, 2017

This should now be fixed with the new binaries.

@woofie56 i'm not sure why you see that error, but please try: conda install conda to update your conda first.

@soumith soumith closed this as completed Mar 31, 2017
@woofie56
Copy link
Author

woofie56 commented Apr 1, 2017

@soumith Hi I created a new linux user account, and this time installed Anaconda3-4.3.1-Linux-x86_64.sh (Python 3.6). However I still get the same problem :

conda install pytorch -c soumith
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /home/tftest2/anaconda3:

The following NEW packages will be INSTALLED:

    pytorch: 0.1.11-py36_5 soumith

Proceed ([y]/n)? y

pytorch-0.1.11 100% |################################| Time: 0:18:07 255.07 kB/s

Then when I run I get the following error message :

> python autograd_tutorial.py 

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]

<torch.autograd._functions.basic_ops.AddConstant object at 0x7fdb51c43748>
Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]
 Variable containing:
 27
[torch.FloatTensor of size 1]

NVIDIA: no NVIDIA devices found
THCudaCheck FAIL file=torch/csrc/autograd/engine.cpp line=359 error=30 : unknown error
Traceback (most recent call last):
  File "autograd_tutorial.py", line 81, in <module>
    out.backward()
  File "/home/tftest2/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 146, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
RuntimeError: cuda runtime error (30) : unknown error at torch/csrc/autograd/engine.cpp:359

I also get the same error when I install via pip :

>pip install http://download.pytorch.org/whl/cu75/torch-0.1.11.post5-cp36-cp36m-linux_x86_64.whl

Collecting torch==0.1.11.post5 from http://download.pytorch.org/whl/cu75/torch-0.1.11.post5-cp36-cp36m-linux_x86_64.whl
  Downloading http://download.pytorch.org/whl/cu75/torch-0.1.11.post5-cp36-cp36m-linux_x86_64.whl (343.0MB)
    100% |████████████████████████████████| 343.0MB 353kB/s 
Requirement already satisfied: pyyaml in /home/tftest2/anaconda3/lib/python3.6/site-packages (from torch==0.1.11.post5)
Installing collected packages: torch
Successfully installed torch-0.1.11.post5

Should I be using the Python 3.5 version?

@soumith
Copy link
Member

soumith commented Apr 1, 2017

@woofie56 i've tested the binaries on two linux machines with no GPUs, and I ran this script:
http://pytorch.org/tutorials/_downloads/autograd_tutorial.py

The seem to run fine.

I am presuming you did either of the following:

  • modified the script in some way to add .cuda() calls
  • you proactively installed CUDA on your VM, I am not sure what happened but it is probably a botched install that is screwing up things. For example I dont know what happens when you install the nvidia driver on a machine with no NVIDIA GPUs.

Are either of these cases what you did? If so can you revert either of them?

@woofie56
Copy link
Author

woofie56 commented Apr 1, 2017

@soumith : Hi thanks for the reply. I uninstalled the cuda drivers and reinstalled pytorch but that didnt help. Maybe I will try reinstalling again anaconda from scratch (since i didnt reinstall this after removing the cuda drivers).

If this fails, I will make a fresh ubuntu virtual machine installation and try again in that. Thanks

@woofie56
Copy link
Author

woofie56 commented Apr 1, 2017

@soumith Hi it turned out that despite uninstalling nvidia from ubuntu, I still had nvida-375 installed (maybe from an earlier attempt to install the nvidia drivers). When I removed this and reinstalled anaconda and pytorch everything work.

Thanks for taking the time out of your weekend to help me out.

@wddabc
Copy link

wddabc commented Apr 14, 2017

Hi @soumith,
I ran into the same issue. Looks like the problem of my case indeed comes from the second case you mentioned "install the nvidia driver on a machine with no NVIDIA GPUs". But I few hard to get away with it.

My case is I'm using the queuing system on the server, some are gpu queues, some are cpu queues. They share the same cuda installation. When I submit my jobs to cpu queues with no NVIDIA GPUs, this error occurs.

@wddabc
Copy link

wddabc commented Apr 14, 2017

Hello, I think I found the issue it comes from.

I'd like to ask is there a particular reason using a black list like https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/engine.cpp#L357 for blocking the GPU? I understand the logic here --- if the error code == 35 (CUDA driver version is insufficient for CUDA runtime version), then backoff to CPU. But shouldn't it be a white list -- only allowing the error code == 0 to pass and otherwise backoff to CPU?

The reason of my suggestion is I have a weird situation where the error code is unknown (30) and it crashed the whole program at https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/engine.cpp#L360. I've already asked my server admin how this err code could happen. (Something is apparently wrong with the system setup) But I might suggest to change this to a white list to make it more robust? Any concerns for that?

jjsjann123 pushed a commit to jjsjann123/pytorch that referenced this issue Nov 5, 2021
* Expose some of the utility functions

They are useful to have for the C++ interface.
hubertlu-tw pushed a commit to hubertlu-tw/pytorch that referenced this issue Nov 1, 2022
…_code_split

Add functions to compute grad_out1, grad_out1_halo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants