-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The return of torch.inverse contains nan sometime #47272
Comments
I could not reproduce this (although I used CUDA 9) -- how many trials did this take for you? |
@gchanan I got error only in PyTorch 1.7.0. The loop in my code ensures that it happens every time. Almost every time |
The cusolver path is enabled only for cuda >= 10.1.243. For cuda versions lower than that, the much slower MAGMA path is used. |
Is the input matrix really a singular matrix in this case, as its determinant is not zero? |
Oops, sorry, I was looking at a different matrix. However, I can't reproduce this on my machine with cuda 11.0 using either 1070 or 2070. Update: can't reproduce on 10.2 for V100 either. |
Maybe should try on cuda 10.1 or 10.2. |
Thanks, I'll try that later on cuda 10.2. Can you also try if |
No errors were encountered after setting |
Strangely, it can't be reproduced on Ubuntu now. Randomness is a problem. |
Yeah, it uses CUDA multi-stream parallel execution for optimization purpose. I'll check if there is anything I can do about it. Thanks for reporting the issue.
|
@Lmy0217 I'm able to reproduce that on my machine with cuda 11.0 as well. The nan usually appears after 40k ~ 90k loops. Sorry for the inconvenience. This problem only occurs when you have a batch size of 2, that is, when your tensor has a shape of A temporary workaround is
cc @ngimel for visibility |
@xwang233 you allocate self_working_copy and self_inv_working_copy on the main stream pytorch/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu Lines 103 to 104 in 9a9383e
Can you please submit either that fix, or disable multistreaming operation? |
Thanks @ngimel , the original reason to add multi-stream execution was for better performance because cublas batched inverse is too slow. Since it's causing silent error, I would prefer disable multi-stream execution. I will submit a fix later. |
You can still use multi-stream if you properly register all the tensors to the correct streams |
Is this windows-specific by chance? (maybe if @peterjc123 can reproduce?) |
I confirmed this error at pytorch 1.7.1 at Ubuntu 20.04. Updating to pytorch 1.10 solve the issue. It seems that the |
…-stream issue (pytorch#47026) Summary: ### test_inverse_singular for cublas failure Related pytorch#46616 (comment) https://app.circleci.com/pipelines/github/pytorch/pytorch/232112/workflows/4131d4ca-cd51-44e3-8e6c-b1c3555c62fa/jobs/8523970/tests The cuda 11.1 CI container doesn't have MAGMA library, so cublas matrix inverse path is enabled. ``` Oct 27 23:13:47 -- MAGMA not found. Compiling without MAGMA support ``` The test_inverse_singular was introduced in pytorch#46625, but I forgot to fix that functionality for cublas path as well. ### cusolver inverse multi-stream failure fix pytorch#47272 The original cuda event record/block stream was wrong, which could cause NaN in output tensor. On my machine, the original code observes NaN in about 50k~500k loops. After this change, no NaN is observed in more than 2.5m loops. The performance for batch 2 matrix inverse is still the same as those in pytorch#42403. Pull Request resolved: pytorch#47026 Reviewed By: mruberry Differential Revision: D24838546 Pulled By: ngimel fbshipit-source-id: 3b83e4ab8e6b47a8273cba277251765bd6d97911
🐛 Bug
The return of torch.inverse contains nan sometime.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Return accurate results.
Environment
I got error in two environments:
cc @vishwakftw @jianyuh @nikitaved @pearu @mruberry @heitorschueroff
The text was updated successfully, but these errors were encountered: