-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
invalid device ordinal in pytorch1.8+cuda11.1 #54245
Comments
This is typically a symptom of the hardware problems. What does |
I'm able to reproduce this issue using the However, the issue seems to be resolved by @zasdfgbnm's PR to replace thrust with cub in randperm, which landed on March 12th. $ pip install torch==1.9.0.dev20210312 torchvision==0.9.0.dev20210312 -f https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html
$ python -c "import torch; print(torch.__version__); import torchvision; print(torchvision.__version__)"
1.9.0.dev20210312+cu111
0.9.0.dev20210312+cu111
$ CUDA_VISIBLE_DEVICES=0 python lala.py
...
RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal
$ CUDA_VISIBLE_DEVICES=1 python lala.py
...
RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal $ pip install torch==1.9.0.dev20210313 torchvision==0.9.0.dev20210313 -f https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html
$ python -c "import torch; print(torch.__version__); import torchvision; print(torchvision.__version__)"
1.9.0.dev20210313+cu111
0.9.0.dev20210313+cu111
$ CUDA_VISIBLE_DEVICES=0 python lala.py
using cuda:0 device.
$ CUDA_VISIBLE_DEVICES=1 python lala.py
using cuda:0 device. @malfet would it be possible to pick this PR for |
@ptrblck there are few more operators that use
|
I don't know it yet. I've ran some unit tests on |
I am not sure if this is a thrust problem or a build problem. I manually built PyTorch 1.8 and PyTorch master with my thrust->cub PR (#53841) reverted, and none of them reproduces the failure. I also tried our container 20.12, and can not reproduce either. I can only reproduce the error with PyTorch wheels downloaded from the official site. PS: minimum repro: import torch
torch.randperm(159826, device='cuda') |
I also can't repro with source builds. There's #52663 that looks very similar and reproduces with master source build, but it also might be related to how extensions are built. |
@zasdfgbnm what version of CUDA toolkit(up to a minor revision) are you using? And are you linking statically or not? |
@malfet I am using CUDA 11.2, linking dynamically. I also tried with CUDA 11.1 in our 20.12 container, with PyTorch dynamically linked, can not repro either. |
@zasdfgbnm, can you paste output of
|
This is what I have
|
same problem here on GTX1660. (I also have error on GTX2080) import torch |
the same issue encountered Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.randperm(65535)
tensor([23965, 52215, 46436, ..., 55898, 54460, 32672])
>>> torch.randperm(32, device=0)
tensor([ 4, 17, 19, 11, 16, 20, 5, 3, 21, 2, 31, 9, 6, 29, 27, 28, 0, 7,
22, 15, 10, 12, 30, 8, 24, 26, 18, 1, 14, 13, 25, 23],
device='cuda:0')
>>> torch.randperm(65535, device=0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal here is the environment
|
DGL project also encountered same problem before. The solution is that each library should use their own Thrust/CUB namespace. The root cause is that CUB using static variable inside template function for device attribute caching. (at https://github.com/NVIDIA/cub/blob/f5ef160af684fcd00c76443c42a393cae5653f2e/cub/util_device.cuh)
Reference: |
This issue should be fixed in 1.8.1 release and in nightly binaries. |
Thank you for your great work. I've seen 1.8.1 released on the official website. |
is this really fixed? I still see the same issue with pytorch 1.9.0 |
fixed here: #52663 |
馃悰 Bug
To Reproduce
runing test.py:
I can get following info:
Expected behavior
There should be no mistake here.
If using pytorch1.6/1.7 GPU, there is no mistake.
Environment
conda
,pip
, source): pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.htmlIn addition, I tried to use the official docker image(pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime), but still encountered the same problem.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @ngimel
The text was updated successfully, but these errors were encountered: