Error in SVD cusolver on Linux #69203

denix56 · 2021-12-01T09:04:14Z

🐛 Bug

The svd-functions (svd, svdvals) return an error when called on certain vectors on linux, while it works on Windows
RuntimeError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling cusolverDnSgesvdj( handle, jobz, econ, m, n, A, lda, S, U, ldu, V, ldv, static_cast<float*>(dataPtr.get()), lwork, info, params)

To Reproduce

cpt = torch.load('tensor.pth')
a = cpt['t']
torch.linalg.svd(a)

Expected behavior

No error should be given, function is executed

Environment

Windows Environment:

PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Home
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.22.0
Libc version: N/A

Python version: 3.8.11 (default, Aug  3 2021, 06:49:12) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22000-SP0
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 466.85
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] pytorch-lightning==1.4.1
[pip3] torch==1.10.0
[pip3] torchmetrics==0.4.1
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.11.0
[pip3] torchvision==0.11.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1               h59b6b97_2
[conda] mkl                       2021.3.0           haa95532_524
[conda] mkl-service               2.4.0            py38h2bbff1b_0
[conda] mkl_fft                   1.3.0            py38h277e83a_2
[conda] mkl_random                1.2.2            py38hf11a4ad_0
[conda] numpy                     1.20.3           py38ha4e8547_0
[conda] numpy-base                1.20.3           py38hc2deb75_0
[conda] pytorch                   1.10.0          py3.8_cuda11.3_cudnn8_0    pytorch
[conda] pytorch-lightning         1.4.1              pyhd8ed1ab_0    conda-forge
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchmetrics              0.4.1              pyhd8ed1ab_0    conda-forge
[conda] torchsummary              1.5.1                    pypi_0    pypi
[conda] torchtext                 0.11.0                     py38    pytorch
[conda] torchvision               0.11.0               py38_cu113    pytorch

Linux Environment:

PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.17

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB-LS
GPU 1: Tesla V100-SXM2-32GB-LS
GPU 2: Tesla V100-SXM2-32GB-LS
GPU 3: Tesla V100-SXM2-32GB-LS
GPU 4: Tesla V100-SXM2-32GB-LS
GPU 5: Tesla V100-SXM2-32GB-LS
GPU 6: Tesla V100-SXM2-32GB-LS
GPU 7: Tesla V100-SXM2-32GB-LS

Nvidia driver version: 450.156.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] pytorch-lightning==1.5.3
[pip3] torch==1.10.0
[pip3] torchmetrics==0.6.0
[pip3] torchvision==0.11.1
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libblas                   3.9.0            12_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            12_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            12_linux64_mkl    conda-forge
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] numpy                     1.20.3           py38h9894fe3_1    conda-forge
[conda] pytorch                   1.10.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-lightning         1.5.3              pyhd8ed1ab_0    conda-forge
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchmetrics              0.6.0              pyhd8ed1ab_0    conda-forge
[conda] torchvision               0.11.1               py38_cu113    pytorch

tensor.zip

cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano

The text was updated successfully, but these errors were encountered:

denix56 · 2021-12-01T09:23:59Z

Just checked on torch build on Linux with cuda 10.2 - it works fine

xwang233 · 2021-12-01T09:30:05Z

I tried this with 1.10+cuda11.3 pip wheel, and it worked fine on a Linux V100 machine.

In case interested, tensor size is (1152, 384), float, doesn't have nan or inf elements.

You may see CUSOLVER_STATUS_EXECUTION_FAILED error in GPU SVD calculations if your input matrix contains NaN or inf. This is a known issue in pytorch 1.10, and is already fixed in the latest master branch.

denix56 · 2021-12-01T10:51:09Z

I used pytorch version from pytorch conda channel

lezcano · 2021-12-01T11:02:05Z

Could you try using the nightly version see if this is fixed in master?

denix56 · 2021-12-03T11:46:14Z

Could you try using the nightly version see if this is fixed in master?

Same error.
Maybe the problem is in cuda itself?

denix56 · 2021-12-06T14:56:24Z

I will try again when torch with latest cuda 11.5 will be avaiable.

gleize · 2021-12-17T18:36:34Z

Could be related to #70122 ?

lezcano · 2021-12-18T16:08:07Z

I think they are not related, as solve does not use an svd internally

lezcano · 2022-02-08T18:25:46Z

What's the state of this @denix56 ? Did you try with a different cuda?

DinoMan · 2022-02-14T16:16:54Z

I also get a similar error when using torch.svd() or torch.linalg.svd(). This happened as I upgraded from pytorch 1.9 and cuda 11.1 to 1.10.2 and cuda 11.3.1. In my case, it is even weirder since the error only comes up the first time I run the svd. If I run it again it works and when I checked with some simple matrices it seems to give correct results. In theory, I can catch the exception and retry which seems to fix it since it only happens at the start of the program but this is not a real solution.

lezcano · 2022-02-14T16:33:22Z

Could you try with the nightly version see if this still happens to you @DinoMan ?

DinoMan · 2022-02-14T17:57:51Z

Yes just checked with the nightly version and it still happens.

lezcano · 2022-02-14T18:00:46Z

Could you provide further details on your set-up? GPU / versions of everything / a small repro (in case it is different to that in the OP)

DinoMan · 2022-02-14T18:31:36Z

The GPUs are a100s the problem manifests whether I run on one or multiple.
The pytorch is installed through conda. Here are the versions of the packages
pytorch: 1.12.0.dev20220214
torchvision: 0.13.0.dev20220214
cudatoolkit: 11.3.1

This small snippet was enough to get the error originally:

import torch
a = torch.Tensor([[4,1,1],[1,2,0],[1,0,1]])
b = torch.cat(58*[a.unsqueeze(0)])
torch.linalg.svd(b)

However, as I mentioned the second time round it works without issues.

lezcano · 2022-02-14T18:32:47Z

I don't have access to an A100. Could you have a look at this one @xwang233?

xwang233 · 2022-02-14T20:02:01Z

The script in #69203 (comment) runs SVD on CPU. I didn't get any error on that. Besides that, I added b.cuda() and tested on A100 with 1.12.0.dev20220214+cu113 nightly build. I still couldn't get any error.

xwang233 · 2022-02-14T22:59:29Z

Hi @DinoMan , to help you better debug this issue on A100, can you try to isolate a code snippet that reproduces the issue in a docker container? For example, you can try the latest version of NGC pytorch from here https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch.

DinoMan · 2022-02-15T10:52:44Z

Thanks for checking this out @xwang233 I have switched back to the previous setup (pytorch 1.9 and cu11.1) which does not have this issue but will try to reproduce in the container at some point and report back here.

lezcano · 2022-03-24T23:32:33Z

@DinoMan any updates on this?

xwang233 added the module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul label Dec 1, 2021

H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 1, 2021

mruberry added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in SVD cusolver on Linux #69203

Error in SVD cusolver on Linux #69203

denix56 commented Dec 1, 2021 •

edited by pytorch-probot bot

denix56 commented Dec 1, 2021

xwang233 commented Dec 1, 2021 •

edited

denix56 commented Dec 1, 2021

lezcano commented Dec 1, 2021

denix56 commented Dec 3, 2021 •

edited

denix56 commented Dec 6, 2021

gleize commented Dec 17, 2021

lezcano commented Dec 18, 2021

lezcano commented Feb 8, 2022

DinoMan commented Feb 14, 2022

lezcano commented Feb 14, 2022

DinoMan commented Feb 14, 2022

lezcano commented Feb 14, 2022

DinoMan commented Feb 14, 2022 •

edited

lezcano commented Feb 14, 2022

xwang233 commented Feb 14, 2022

xwang233 commented Feb 14, 2022

DinoMan commented Feb 15, 2022

lezcano commented Mar 24, 2022

Error in SVD cusolver on Linux #69203

Error in SVD cusolver on Linux #69203

Comments

denix56 commented Dec 1, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Windows Environment:

Linux Environment:

denix56 commented Dec 1, 2021

xwang233 commented Dec 1, 2021 • edited

denix56 commented Dec 1, 2021

lezcano commented Dec 1, 2021

denix56 commented Dec 3, 2021 • edited

denix56 commented Dec 6, 2021

gleize commented Dec 17, 2021

lezcano commented Dec 18, 2021

lezcano commented Feb 8, 2022

DinoMan commented Feb 14, 2022

lezcano commented Feb 14, 2022

DinoMan commented Feb 14, 2022

lezcano commented Feb 14, 2022

DinoMan commented Feb 14, 2022 • edited

lezcano commented Feb 14, 2022

xwang233 commented Feb 14, 2022

xwang233 commented Feb 14, 2022

DinoMan commented Feb 15, 2022

lezcano commented Mar 24, 2022

denix56 commented Dec 1, 2021 •

edited by pytorch-probot bot

xwang233 commented Dec 1, 2021 •

edited

denix56 commented Dec 3, 2021 •

edited

DinoMan commented Feb 14, 2022 •

edited