Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in SVD cusolver on Linux #69203

Open
denix56 opened this issue Dec 1, 2021 · 19 comments
Open

Error in SVD cusolver on Linux #69203

denix56 opened this issue Dec 1, 2021 · 19 comments
Labels
module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@denix56
Copy link

denix56 commented Dec 1, 2021

馃悰 Bug

The svd-functions (svd, svdvals) return an error when called on certain vectors on linux, while it works on Windows
RuntimeError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling cusolverDnSgesvdj( handle, jobz, econ, m, n, A, lda, S, U, ldu, V, ldv, static_cast<float*>(dataPtr.get()), lwork, info, params)

To Reproduce

cpt = torch.load('tensor.pth')
a = cpt['t']
torch.linalg.svd(a)

Expected behavior

No error should be given, function is executed

Environment

Windows Environment:

PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Home
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.22.0
Libc version: N/A

Python version: 3.8.11 (default, Aug  3 2021, 06:49:12) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22000-SP0
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 466.85
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] pytorch-lightning==1.4.1
[pip3] torch==1.10.0
[pip3] torchmetrics==0.4.1
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.11.0
[pip3] torchvision==0.11.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1               h59b6b97_2
[conda] mkl                       2021.3.0           haa95532_524
[conda] mkl-service               2.4.0            py38h2bbff1b_0
[conda] mkl_fft                   1.3.0            py38h277e83a_2
[conda] mkl_random                1.2.2            py38hf11a4ad_0
[conda] numpy                     1.20.3           py38ha4e8547_0
[conda] numpy-base                1.20.3           py38hc2deb75_0
[conda] pytorch                   1.10.0          py3.8_cuda11.3_cudnn8_0    pytorch
[conda] pytorch-lightning         1.4.1              pyhd8ed1ab_0    conda-forge
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchmetrics              0.4.1              pyhd8ed1ab_0    conda-forge
[conda] torchsummary              1.5.1                    pypi_0    pypi
[conda] torchtext                 0.11.0                     py38    pytorch
[conda] torchvision               0.11.0               py38_cu113    pytorch

Linux Environment:

PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.17

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB-LS
GPU 1: Tesla V100-SXM2-32GB-LS
GPU 2: Tesla V100-SXM2-32GB-LS
GPU 3: Tesla V100-SXM2-32GB-LS
GPU 4: Tesla V100-SXM2-32GB-LS
GPU 5: Tesla V100-SXM2-32GB-LS
GPU 6: Tesla V100-SXM2-32GB-LS
GPU 7: Tesla V100-SXM2-32GB-LS

Nvidia driver version: 450.156.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] pytorch-lightning==1.5.3
[pip3] torch==1.10.0
[pip3] torchmetrics==0.6.0
[pip3] torchvision==0.11.1
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libblas                   3.9.0            12_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            12_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            12_linux64_mkl    conda-forge
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] numpy                     1.20.3           py38h9894fe3_1    conda-forge
[conda] pytorch                   1.10.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-lightning         1.5.3              pyhd8ed1ab_0    conda-forge
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchmetrics              0.6.0              pyhd8ed1ab_0    conda-forge
[conda] torchvision               0.11.1               py38_cu113    pytorch

tensor.zip

cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano

@denix56
Copy link
Author

denix56 commented Dec 1, 2021

Just checked on torch build on Linux with cuda 10.2 - it works fine

@xwang233
Copy link
Collaborator

xwang233 commented Dec 1, 2021

I tried this with 1.10+cuda11.3 pip wheel, and it worked fine on a Linux V100 machine.

In case interested, tensor size is (1152, 384), float, doesn't have nan or inf elements.

You may see CUSOLVER_STATUS_EXECUTION_FAILED error in GPU SVD calculations if your input matrix contains NaN or inf. This is a known issue in pytorch 1.10, and is already fixed in the latest master branch.

@xwang233 xwang233 added the module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul label Dec 1, 2021
@denix56
Copy link
Author

denix56 commented Dec 1, 2021

I used pytorch version from pytorch conda channel

@lezcano
Copy link
Collaborator

lezcano commented Dec 1, 2021

Could you try using the nightly version see if this is fixed in master?
image

@H-Huang H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 1, 2021
@denix56
Copy link
Author

denix56 commented Dec 3, 2021

Could you try using the nightly version see if this is fixed in master? image

Same error.
Maybe the problem is in cuda itself?

@denix56
Copy link
Author

denix56 commented Dec 6, 2021

I will try again when torch with latest cuda 11.5 will be avaiable.

@mruberry mruberry added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label Dec 10, 2021
@gleize
Copy link

gleize commented Dec 17, 2021

Could be related to #70122 ?

@lezcano
Copy link
Collaborator

lezcano commented Dec 18, 2021

I think they are not related, as solve does not use an svd internally

@lezcano
Copy link
Collaborator

lezcano commented Feb 8, 2022

What's the state of this @denix56 ? Did you try with a different cuda?

@DinoMan
Copy link

DinoMan commented Feb 14, 2022

I also get a similar error when using torch.svd() or torch.linalg.svd(). This happened as I upgraded from pytorch 1.9 and cuda 11.1 to 1.10.2 and cuda 11.3.1. In my case, it is even weirder since the error only comes up the first time I run the svd. If I run it again it works and when I checked with some simple matrices it seems to give correct results. In theory, I can catch the exception and retry which seems to fix it since it only happens at the start of the program but this is not a real solution.

@lezcano
Copy link
Collaborator

lezcano commented Feb 14, 2022

Could you try with the nightly version see if this still happens to you @DinoMan ?

@DinoMan
Copy link

DinoMan commented Feb 14, 2022

Yes just checked with the nightly version and it still happens.

@lezcano
Copy link
Collaborator

lezcano commented Feb 14, 2022

Could you provide further details on your set-up? GPU / versions of everything / a small repro (in case it is different to that in the OP)

@DinoMan
Copy link

DinoMan commented Feb 14, 2022

The GPUs are a100s the problem manifests whether I run on one or multiple.
The pytorch is installed through conda. Here are the versions of the packages
pytorch: 1.12.0.dev20220214
torchvision: 0.13.0.dev20220214
cudatoolkit: 11.3.1

This small snippet was enough to get the error originally:

import torch
a = torch.Tensor([[4,1,1],[1,2,0],[1,0,1]])
b = torch.cat(58*[a.unsqueeze(0)])
torch.linalg.svd(b)

However, as I mentioned the second time round it works without issues.

@lezcano
Copy link
Collaborator

lezcano commented Feb 14, 2022

I don't have access to an A100. Could you have a look at this one @xwang233?

@xwang233
Copy link
Collaborator

The script in #69203 (comment) runs SVD on CPU. I didn't get any error on that. Besides that, I added b.cuda() and tested on A100 with 1.12.0.dev20220214+cu113 nightly build. I still couldn't get any error.

@xwang233
Copy link
Collaborator

Hi @DinoMan , to help you better debug this issue on A100, can you try to isolate a code snippet that reproduces the issue in a docker container? For example, you can try the latest version of NGC pytorch from here https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch.

@DinoMan
Copy link

DinoMan commented Feb 15, 2022

Thanks for checking this out @xwang233 I have switched back to the previous setup (pytorch 1.9 and cu11.1) which does not have this issue but will try to reproduce in the container at some point and report back here.

@lezcano
Copy link
Collaborator

lezcano commented Mar 24, 2022

@DinoMan any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants