New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in SVD cusolver on Linux #69203
Comments
Just checked on torch build on Linux with cuda 10.2 - it works fine |
I tried this with 1.10+cuda11.3 pip wheel, and it worked fine on a Linux V100 machine. In case interested, tensor size is (1152, 384), float, doesn't have nan or inf elements. You may see |
I used pytorch version from pytorch conda channel |
I will try again when torch with latest cuda 11.5 will be avaiable. |
Could be related to #70122 ? |
I think they are not related, as solve does not use an svd internally |
What's the state of this @denix56 ? Did you try with a different cuda? |
I also get a similar error when using torch.svd() or torch.linalg.svd(). This happened as I upgraded from pytorch 1.9 and cuda 11.1 to 1.10.2 and cuda 11.3.1. In my case, it is even weirder since the error only comes up the first time I run the svd. If I run it again it works and when I checked with some simple matrices it seems to give correct results. In theory, I can catch the exception and retry which seems to fix it since it only happens at the start of the program but this is not a real solution. |
Could you try with the nightly version see if this still happens to you @DinoMan ? |
Yes just checked with the nightly version and it still happens. |
Could you provide further details on your set-up? GPU / versions of everything / a small repro (in case it is different to that in the OP) |
The GPUs are a100s the problem manifests whether I run on one or multiple. This small snippet was enough to get the error originally: import torch
a = torch.Tensor([[4,1,1],[1,2,0],[1,0,1]])
b = torch.cat(58*[a.unsqueeze(0)])
torch.linalg.svd(b) However, as I mentioned the second time round it works without issues. |
I don't have access to an A100. Could you have a look at this one @xwang233? |
The script in #69203 (comment) runs SVD on CPU. I didn't get any error on that. Besides that, I added |
Hi @DinoMan , to help you better debug this issue on A100, can you try to isolate a code snippet that reproduces the issue in a docker container? For example, you can try the latest version of NGC pytorch from here https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. |
Thanks for checking this out @xwang233 I have switched back to the previous setup (pytorch 1.9 and cu11.1) which does not have this issue but will try to reproduce in the container at some point and report back here. |
@DinoMan any updates on this? |
馃悰 Bug
The svd-functions (svd, svdvals) return an error when called on certain vectors on linux, while it works on Windows
RuntimeError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling cusolverDnSgesvdj( handle, jobz, econ, m, n, A, lda, S, U, ldu, V, ldv, static_cast<float*>(dataPtr.get()), lwork, info, params)
To Reproduce
Expected behavior
No error should be given, function is executed
Environment
Windows Environment:
Linux Environment:
tensor.zip
cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano
The text was updated successfully, but these errors were encountered: