-
Notifications
You must be signed in to change notification settings - Fork 25k
Fix NVRTC versioning for CUDA 11.X (X>=3), CUDA 12 and later #57204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 18b3b07 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix, but can you please rework it (using preprocessor or metaprogramming) to keep libname
static for performance reasons? (otherwise libname would be computed every time this function is called)
Or initialize explicitly using std::once
construct
@malfet Done |
Codecov Report
@@ Coverage Diff @@
## master #57204 +/- ##
==========================================
- Coverage 77.58% 77.56% -0.02%
==========================================
Files 1952 1952
Lines 194466 194468 +2
==========================================
- Hits 150868 150831 -37
- Misses 43598 43637 +39 |
@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…#57204) Summary: NVRTC versioning has changed starting 11.3, and will change again for CUDA 12.X. See comment in code for detail. As a result, jit on CUDA 11.3 is broken. Also, the error message is misleading: When both `libname` and `alt_libname` are non-empty, the error message is only reporting `alt_libname`, it should report both. To reproduce the error, you can use: ```python import torch torch._C._jit_set_profiling_mode(False) torch._C._jit_set_profiling_executor(False) torch._C._jit_override_can_fuse_on_cpu(True) torch._C._jit_override_can_fuse_on_gpu(True) torch.jit.script def jit_relu_dropout(x, prob) : # type: (Tensor, float) -> Tensor x = torch.nn.functional.relu(x) x = torch.nn.functional.dropout(x, p=prob, training=True) return x x = torch.randn((64, 40, 12, 1024), device="cuda:0", dtype=torch.float16, requires_grad=True) y = jit_relu_dropout(x, 0.5) ``` with CUDA 11.3, and you will see ``` Traceback (most recent call last): File "/home/gaoxiang/misc/nvrtc-failure.py", line 16, in <module> y = jit_relu_dropout(x, 0.5) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: Error in dlopen or dlsym: libnvrtc-8aa72235.so.11.3: cannot open shared object file: No such file or directory ``` Pull Request resolved: pytorch#57204 Reviewed By: ngimel Differential Revision: D28122083 Pulled By: malfet fbshipit-source-id: fd387cf79f33a6d5a5b93d54c9f21e9c23731045
NVRTC versioning has changed starting 11.3, and will change again for CUDA 12.X. See comment in code for detail. As a result, jit on CUDA 11.3 is broken.
Also, the error message is misleading: When both
libname
andalt_libname
are non-empty, the error message is only reportingalt_libname
, it should report both.To reproduce the error, you can use:
with CUDA 11.3, and you will see