Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.3 no longer supporting Tesla K40m? #30532

Open
JamesOwers opened this issue Nov 27, 2019 · 11 comments
Open

Version 1.3 no longer supporting Tesla K40m? #30532

JamesOwers opened this issue Nov 27, 2019 · 11 comments

Comments

@JamesOwers
Copy link

@JamesOwers JamesOwers commented Nov 27, 2019

馃悰 Bug

I am using a Tesla K40m, installed pytorch 1.3 with conda, using CUDA 10.1

To Reproduce

Steps to reproduce the behavior:

  1. Have a box with a Tesla K40m
  2. conda install pytorch cudatoolkit -c pytorch
  3. show cuda is available
python -c 'import torch; print(torch.cuda.is_available());'
>>> True
  1. Instantiate a model and call .forward()
Traceback (most recent call last):
  File "./baselines/get_results.py", line 395, in <module>
    main(args)
  File "./baselines/get_results.py", line 325, in main
    log_info = eval_main(eval_args)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/baselines/eval_task.py", line 165, in main
    log_info = trainer.test(0, evaluate=True)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/mdtk/pytorch_trainers.py", line 110, in test
    evaluate=evaluate)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/mdtk/pytorch_trainers.py", line 220, in iteration
    model_output = self.model.forward(input_data, input_lengths)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/mdtk/pytorch_models.py", line 49, in forward
    self.hidden = self.init_hidden(batch_size, device=device)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/mdtk/pytorch_models.py", line 40, in init_hidden
    return (torch.randn(1, batch_size, self.hidden_dim, device=device),
RuntimeError: CUDA error: no kernel image is available for execution on the device

First tried downgrading to cudatoolkit=10.0, that exhibited same issue.

The code will run fine if you repeat steps above but instead conda install pytorch=1.2 cudatoolkit=10.0 -c pytorch.

Expected behavior

If no longer supporting a specific GPU, please bomb out upon load with useful error message.

Environment

Unfort ran your script after I 'fixed' so pytorch version will be 1.2 here - issue encountered with version 1.3.

Collecting environment information...
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Scientific Linux release 7.6 (Nitrogen)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: version 2.8.12.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla K40m
Nvidia driver version: 430.50
cuDNN version: /usr/lib64/libcudnn.so.6.5.18

Versions of relevant libraries:
[pip3] numpy==1.16.3
[pip3] numpydoc==0.8.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.4                      243  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.0.15           py37ha843d7b_0  
[conda] mkl_random                1.1.0            py37hd6b4f25_0  
[conda] pytorch                   1.2.0           py3.7_cuda10.0.130_cudnn7.6.2_0    pytorch
[conda] torchvision               0.4.0                py37_cu100    pytorch

cc @ezyang @gchanan @zou3519 @jerryzh168 @ngimel

@albanD

This comment has been minimized.

Copy link
Contributor

@albanD albanD commented Nov 27, 2019

Just to be sure, were you using 1.3.0 or 1.3.1?

@JamesOwers

This comment has been minimized.

Copy link
Author

@JamesOwers JamesOwers commented Nov 27, 2019

1.3.1

conda list 'pytorch|cuda'
>>> # packages in environment at /home/s0816700/miniconda3/envs/mdtk:
>>> #
>>> # Name                    Version                   Build  Channel
>>> cudatoolkit               10.1.243             h6bb024c_0  
>>> pytorch                   1.3.1           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch

Was the env at point of failure.

@albanD

This comment has been minimized.

Copy link
Contributor

@albanD albanD commented Nov 27, 2019

cc @ngimel

@SsnL

This comment has been minimized.

Copy link
Collaborator

@SsnL SsnL commented Nov 27, 2019

K40m has a compute capability of 3.5, which I believe we have dropped support of.

@JamesOwers

This comment has been minimized.

Copy link
Author

@JamesOwers JamesOwers commented Nov 27, 2019

Ok. Please may you implement a useful "oldgpu" warning? Like here: #6529

Error at the moment very unclear to casual user like me.

--- EDIT ---
Would also be great to link users:

  1. to a page detailing what compute capacity you support (if this exists) and
  2. how to find out what the compute capacity of your GPU is (I guess here: https://developer.nvidia.com/cuda-gpus#compute for most?)

Struggl(ed/ing) to find both of those things!

As an aside, @SsnL - possibly this line needs updating if you are correct:

on an NVIDIA GPU with compute capability >= 3.0.
. Where did you get your information about minimal compute capability support?

@ptrblck

This comment has been minimized.

Copy link
Contributor

@ptrblck ptrblck commented Nov 30, 2019

@JamesOwers If I'm not mistaken, this commit bumped the minimal compute capability to 3.7.

@xsacha

This comment has been minimized.

Copy link
Contributor

@xsacha xsacha commented Nov 30, 2019

There's no technical reason for it to be changed to 3.7 right?
The code still supports 3.5 (and even 3.0 again).

This is just for Conda? Looks like it went from 3.5 and 5.0+ to 3.7 and 5.0+ so it was always missing either 3.5 or 3.7. I suppose it takes too long/becomes too large to support more than 2 built architectures.

@ptrblck

This comment has been minimized.

Copy link
Contributor

@ptrblck ptrblck commented Dec 1, 2019

@soumith might correct me, but I think the main reason is the growing size of the binaries.

@xsacha

This comment has been minimized.

Copy link
Contributor

@xsacha xsacha commented Dec 2, 2019

@ptrblck that is the reason but it is strange it went from supporting K40 (+ several consumer cards) and not K80 to supporting K80 and not K40 (+ several consumer cards).

on an NVIDIA GPU with compute capability >= 3.0.

I also wish there was a way for the message to reflect the minimum cuda arch from the cuda arch list for when it was compiled. This would make it easier when it gets changed to 3.7, for example. Or when a user supports 3.0 by compiling it themselves.

@ezyang

This comment has been minimized.

Copy link
Contributor

@ezyang ezyang commented Dec 3, 2019

This is also being discussed at #24205 (comment)

@jeherr

This comment has been minimized.

Copy link

@jeherr jeherr commented Dec 16, 2019

I'd just like to suggest that the compatible compute capabilities for the precompiled binaries be added somewhere to the documentation, especially when providing installation instructions for the binaries. That information does not appear to be readily available anywhere.

@ngimel ngimel added the module: docs label Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can鈥檛 perform that action at this time.