Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not use .cuda() function to load the model into GPU using Pytorch 1.3 #27738

Closed
phongnhhn92 opened this issue Oct 11, 2019 · 56 comments
Closed
Labels
high priority module: binaries Anything related to official binaries that we release to users

Comments

@phongnhhn92
Copy link

phongnhhn92 commented Oct 11, 2019

🐛 Bug

I am trying to run the Captum CIFAR10 example link and I want to test it on GPU so I modified a line net = Net().cuda() to load the model into the GPU (I am having a single GPU RTX 2080TI). However I got this error:

AssertionError: 
The NVIDIA driver on your system is too old (found version 10000).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

At the moment, I am using NVIDIA driver version 410. I have tried to upgrade NVIDIA GPU driver to version 435 and I don't see that error anymore but the code got stuck trying to load the model into the GPU.

To Reproduce

Steps to reproduce the behavior:

  1. I upgrade the lastest pytorch version using this command conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
  2. Download the Captum CIFAR 10 example code and modify the line to load the network into GPU
  3. Run the code and receive the error.

Environment

Collecting environment information...
PyTorch version: 1.3.0
Is debug build: No
CUDA used to build PyTorch: 10.1.243

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: No
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce RTX 2080 Ti
Nvidia driver version: 410.104
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.17.0
[pip3] numpydoc==0.7.0
[conda] _tflow_select 2.3.0 mkl
[conda] blas 1.0 mkl
[conda] captum 0.1.0 0 pytorch
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.14 py37ha843d7b_0
[conda] mkl_random 1.0.2 py37hd81dba3_0
[conda] pytorch 1.3.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] tensorflow 1.14.0 mkl_py37h45c423b_0
[conda] tensorflow-base 1.14.0 mkl_py37h7ce6ba3_0
[conda] torchtext 0.4.0 pypi_0 pypi
[conda] torchvision 0.4.1 py37_cu101 pytorch

cc @ezyang @gchanan @zou3519

@mazzma12
Copy link

Same issue here. I think it is related to the release of 1.3.0. Installing 1.2.0 solved it for me

@phongnhhn92
Copy link
Author

I think the problem is related to CUDA version since the new pytorch 1.3 is built using CUDA 10.1.243 and my current CUDA version is 10.1.168 (I installed it from the conda package). I guess I have to wait until the cuda conda package gets some updates to the new version. Another solution is installing cuda 10.1.243 manually.
Btw, the Captum code works perfectly with pytorch 1.2 :D

@phongnhhn92 phongnhhn92 changed the title Can not use .cuda() function to load the model into GPU Can not use .cuda() function to load the model into GPU using Pytorch 1.3 Oct 11, 2019
@Zehaos
Copy link

Zehaos commented Oct 11, 2019

Same issue here. Current conda cudatoolkit version is old(10.1.168).

@bryant1410
Copy link
Contributor

Guys, you can still use pytorch=1.3.0 with cudatoolkit=10.0

@soumith
Copy link
Member

soumith commented Oct 11, 2019

@jjhelmus would you know when anaconda would upgrade to 10.1.243 if there's a plan. Also, if there's a better way to ask about anaconda cuda / cudnn upgrades let me know I'll follow it :)

@soumith
Copy link
Member

soumith commented Oct 11, 2019

@ptrblck can you folks try to repro this (see the user's 2nd comment), basically they are running into a hang of some sort with 10.1.243 vs 10.1.168

@huyvnphan
Copy link

huyvnphan commented Oct 12, 2019

I have the same issue. Right now, the only solution is to revert back to Pytorch 1.2

@ssnl
Copy link
Collaborator

ssnl commented Oct 12, 2019

@soumith I can reproduce this with system CUDA cuda_10.1.105_418.39, conda cudatoolkit pkgs/main/linux-64::cudatoolkit-10.1.168-0 and 1.3 binary.

@mazzma12
Copy link

Guys, you can still use pytorch=1.3.0 with cudatoolkit=10.0

Intersting, so the cuda version would not be the cause of the problem ? Where do you find information about the cuda version required for each torch version, and the installation steps ? I couldn't find it on torch website.

I still have: no cuda available when installing with pip though

@WenmuZhou
Copy link

pytorch1.3 cuda10 has same error

@leftthomas
Copy link

leftthomas commented Oct 12, 2019

@mazzma12 here can find the exact version of supported cuda

@soumith
Copy link
Member

soumith commented Oct 12, 2019

another issue reporter reports that the startup time for 10.1 is in minutes, and we are looking. so, it's not deadlocked, but starts after a few minutes. looks like some PTX->SASS compilation happening: #27807

@soumith soumith added high priority module: binaries Anything related to official binaries that we release to users labels Oct 12, 2019
@soumith
Copy link
Member

soumith commented Oct 12, 2019

@leftthomas @phongnhhn92 @ssnl can you confirm or deny that pip install -U torch does not produce the slow-down?

@soumith
Copy link
Member

soumith commented Oct 12, 2019

in the interest of not having two threads (and not copy-pasting my comments), I am closing this in favor of #27807

Please follow updates at #27807
I think i have a possible solution, I'll post an update there.

@soumith soumith closed this as completed Oct 12, 2019
@soumith
Copy link
Member

soumith commented Oct 12, 2019

This issue is now fixed with newly updated binaries.
Uninstalling and reinstalling PyTorch from Anaconda will fix it.

@phongnhhn92
Copy link
Author

I have tried to create a new Conda environment and I can still see the mismatch between the cudatoolkit version (10.1.168) and the built Pytorch cuda (10.1.243).
Screenshot_20191014_142719
When I try to check the cuda with the installed pytorch, this is the result:
Screenshot_20191014_143608

@soumith
Copy link
Member

soumith commented Oct 14, 2019

@phongnhhn92 upgrade your NVIDIA driver

@phongnhhn92
Copy link
Author

@phongnhhn92 upgrade your NVIDIA driver

It works. Thanks!

@jjhelmus
Copy link

@jjhelmus would you know when anaconda would upgrade to 10.1.243 if there's a plan. Also, if there's a better way to ask about anaconda cuda / cudnn upgrades let me know I'll follow it :)

I'll add an update to the cudatoolkit and related packages to our backlog. A new sprint starts next Monday so these will likely be available sometime next week. The anaconda-issues repository is a better place for requests like these. Multiple members of Anaconda's distribution team monitor that repository's issue tracker.

@MOAboAli
Copy link

@phongnhhn92 upgrade your NVIDIA driver

I get same Error but I don't have NVIDIA Driver to upgrade , Display Driver :Intel HD Graphics 4000,

so in my case what can i do ?

@phongnhhn92
Copy link
Author

phongnhhn92 commented Oct 18, 2019

In your case, you dont have nvidia gpu then you shouldn’t use .cuda() function at all.

@MOAboAli
Copy link

i get this ERROR

18102019

@MOAboAli
Copy link

So what is the replacement ?

@isalirezag
Copy link

I still see that error, although i have cuda 10.1 and nvcc -V shows that i have cuda 10.1

@ptrblck
Copy link
Collaborator

ptrblck commented Oct 21, 2019

@mohamedaboali1990 as @phongnhhn92 explained, you won't be able to use the .cuda calls, if you don't have an NVIDIA GPU.

@isalirezag did you (re)install the binaries after the fix was published?

@ptrblck
Copy link
Collaborator

ptrblck commented Oct 24, 2019

@KoalaSheep Your local CUDA (and cudnn) installations won't be used, as the PyTorch binaries ship with their own CUDA, cudnn and other libs.

Could you create a new conda environment and reinstall the latest PyTorch version, please?
Let us know, if you still face this issue.

@olix86
Copy link

olix86 commented Oct 25, 2019

Is this issue specific to conda? I'm having the same issue with CUDA 10.0 and pytorch 1.3.0 but using pip... @soumith

@vlad-i
Copy link

vlad-i commented Oct 28, 2019

@phongnhhn92 upgrade your NVIDIA driver

It works. Thanks!

How do you update the Nvidia driver? Is there a one-liner? I'm on a GCP instance.

I'm also looking into this, trying to get the Nvidia driver to update via the terminal, no luck so far.

@KoalaSheep
Copy link

KoalaSheep commented Oct 29, 2019

@KoalaSheep Your local CUDA (and cudnn) installations won't be used, as the PyTorch binaries ship with their own CUDA, cudnn and other libs.

Could you create a new conda environment and reinstall the latest PyTorch version, please?
Let us know, if you still face this issue.
@ptrblck

Thank you for reply. After make a new environment, I still face the same problem.
When I install the pytorch, cuda 10.1.168 is installed and it still takes too long to move to cuda.

I reinstall pytorch 1.2. cudatoolkit is downgraded to 10.0.130-0 and it works well for me now.
Still waiting for a way to use pytorch 1.3.

@Smerity
Copy link
Contributor

Smerity commented Nov 1, 2019

Just to note I ran into a potentially related issue in upgrading to PyTorch 1.3 so in case this helps anyone:

My Titan V card had no issue but my GTX 1080 Ti reported "cuda runtime error (209): ... no kernel image is available for execution on the device" upon using something from rnn.py.

This was with conda install pytorch cudatoolkit=10.1 -c pytorch.

Whilst this may be related to an older Nvidia driver my system decided to make upgrading that difficult. I instead ran conda install pytorch torchvision cudatoolkit=10.0 -c pytorch and that seems to work for now.

I'll tackle broken Ubuntu / Nvidia packages another day ^_^

@bryant1410
Copy link
Contributor

bryant1410 commented Nov 1, 2019 via email

@Smerity
Copy link
Contributor

Smerity commented Nov 1, 2019

Using conda install pytorch torchvision cudatoolkit=10.1.243 -c pytorch still results in a RuntimeError on my 1080 Ti sadly:

  File "/home/smerity/anaconda3/envs/pyt/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 516, in forward_impl
    dtype=input.dtype, device=input.device)
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at /opt/conda/conda-bld/pytorch_1570910687650/work/aten/src/THC/generic/THCTensorMath.cu:35```

@soumith
Copy link
Member

soumith commented Nov 1, 2019

ok let me get my hands onto a 1080Ti. This shouldn't happen, it is weird.

@soumith
Copy link
Member

soumith commented Nov 1, 2019

@Smerity can you please confirm that running https://github.com/pytorch/examples/tree/master/word_language_model with python main.py --cuda is also failing on your end?

@soumith
Copy link
Member

soumith commented Nov 1, 2019

also can you give your output of nvidia-smi, and preferably open a new issue. I want to get to reproducing it, so far haven't been able to.

@jjhelmus
Copy link

jjhelmus commented Nov 4, 2019

@jjhelmus would you know when anaconda would upgrade to 10.1.243 if there's a plan

cudatoolkit 10.1.243 packages are now available in defaults for the linux-64 and win-64 platforms.

@soumith
Copy link
Member

soumith commented Nov 4, 2019

thanks, noticed that over the weekend :)

@magic282
Copy link

Still have this problem.
conda package:

pytorch                   1.3.1           py3.6_cuda10.1.243_cudnn7.6.3_0    pytorch
cudatoolkit               10.1.243             h6bb024c_0

smi output:

Mon Nov 11 14:46:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 0000A9AA:00:00.0 Off |                    0 |
| N/A   31C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 0000D120:00:00.0 Off |                    0 |
| N/A   30C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 0000E26F:00:00.0 Off |                    0 |
| N/A   30C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 0000F5EF:00:00.0 Off |                    0 |
| N/A   30C    P0    30W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

@magic282
Copy link

Found that downgrading to pytorch 1.3.0 fixes this issue:

sudo /opt/conda/bin/conda install cudatoolkit=10.0 pytorch=1.3.0 -c pytorch -n pytorch-py3.6

But this seems to be just an ad-hoc fix.

@sayakpaul
Copy link

I am facing this issue on my GCP instance which equipped with CUDA 10.0. Is anyone else also facing the same on a GCP instance?

Also, out of curiosity, I ran !nvcc --version on a Colab notebook and I found out the version of CUDA there is also 10.0 and PyTorch 1.3.1 runs successfully.

@justanhduc
Copy link

justanhduc commented Nov 12, 2019

I got a problem with cuda(). Loading to GPU using cuda() takes forever in an RTX 2080. I tried the following simple script

import torch as T
foo = T.rand(10)
foo.cuda()

The system hangs (or takes a very unreasonably long time that I cannot wait) at the last command. I checked nvidia-smi and found that the memory slowly raised.

Environment:

  • Pytorch 1.3.0 installed using the default conda command.
  • OS: Ubuntu 16.04 LTS
  • Python version: 3.7
  • CUDA runtime version: 10.1.243
  • CuDNN version: 7603

Edit: Fixed in 1.3.1.

@soumith
Copy link
Member

soumith commented Nov 14, 2019

@magic282 please try upgrading your nvidia driver to 430 or above and please confirm if that fixes things. I have just tried things on a P100 and GP100, and the binaries worked fine for me.

@sayakpaul
Copy link

I am facing this issue on my GCP instance which equipped with CUDA 10.0. Is anyone else also facing the same on a GCP instance?

Also, out of curiosity, I ran !nvcc --version on a Colab notebook and I found out the version of CUDA there is also 10.0 and PyTorch 1.3.1 runs successfully.

Anything on this? :(

@soumith
Copy link
Member

soumith commented Nov 15, 2019

@sayakpaul what GPU does your instance have? by the same issue, can you expand what you're seeing?

@sayakpaul
Copy link

@soumith it's a P100. I am seeing:

The NVIDIA driver on your system is too old (found version 10000).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

@soumith
Copy link
Member

soumith commented Nov 15, 2019

@sayakpaul as the error message says, upgrade your CUDA driver, you installed the CUDA 10.1 compatible pytorch package which is the default.

@sayakpaul
Copy link

sayakpaul commented Nov 16, 2019

@soumith yeah! But Colab too has CUDA10.0 and PyTorch 1.3.1 still runs.

@soumith
Copy link
Member

soumith commented Nov 16, 2019

colab is loaded up with special builds of pytorch that are built against cuda 10.0

@magic282
Copy link

@soumith Thank you. Upgrading the driver indeed solves this problem. Actually I was running inside docker on a cluster, so I didn't know that the driver on the host machine is lower.

@sayakpaul
Copy link

@soumith thanks much for the clarification!

@darolt
Copy link

darolt commented Jan 11, 2020

just to register here: got the same error on a fresh install of pytorch 1.3 and cuda 10.1. Both had the same cuda version given by conda list. For me, updating the driver was not an option because my card (Tesla k40) has as suggested driver (NVIDIA drivers page) 418.xx. Downgrading to pytorch 1.3.0 solved the problem for me.

@AceEviliano
Copy link

AceEviliano commented Jan 21, 2020

@darolt
I am new to this community so I don't understand where the above discussion finally ended. Can you help me with this ? I have a K40c as well and I am stuck with installing pytorch. I tried all versions and most of them take too long to load a model to GPU and then this pops up :

RuntimeError: cuda runtime error (209) : no kernel image is available for execution

here is some more info :

     active environment : base
    active env location : /home/rishi/anaconda3
            shell level : 1
       user config file : /home/rishi/.condarc
 populated config files : /home/rishi/.condarc
          conda version : 4.8.1
    conda-build version : 3.18.8
         python version : 3.7.3.final.0
       virtual packages : __cuda=10.1
                          __glibc=2.23
       base environment : /home/rishi/anaconda3  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /home/rishi/anaconda3/pkgs
                          /home/rishi/.conda/pkgs
       envs directories : /home/rishi/anaconda3/envs
                          /home/rishi/.conda/envs
               platform : linux-64
             user-agent : conda/4.8.1 requests/2.22.0 CPython/3.7.3 Linux/4.4.0-170-generic ubuntu/16.04 glibc/2.23
                UID:GID : 1014:1014
             netrc file : None
           offline mode : False

and here's my gpu-info :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 00000000:02:00.0 Off |                    0 |
| 43%   74C    P0   132W / 235W |   3609MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: binaries Anything related to official binaries that we release to users
Projects
None yet
Development

No branches or pull requests