Fix NVRTC versioning for CUDA 11.X (X>=3), CUDA 12 and later #57204

zasdfgbnm · 2021-04-28T22:25:46Z

NVRTC versioning has changed starting 11.3, and will change again for CUDA 12.X. See comment in code for detail. As a result, jit on CUDA 11.3 is broken.

Also, the error message is misleading: When both libname and alt_libname are non-empty, the error message is only reporting alt_libname, it should report both.

To reproduce the error, you can use:

import torch

torch._C._jit_set_profiling_mode(False)
torch._C._jit_set_profiling_executor(False)
torch._C._jit_override_can_fuse_on_cpu(True)
torch._C._jit_override_can_fuse_on_gpu(True)

@torch.jit.script
def jit_relu_dropout(x, prob) :
    # type: (Tensor, float) -> Tensor
    x = torch.nn.functional.relu(x)
    x = torch.nn.functional.dropout(x, p=prob, training=True)
    return x

x = torch.randn((64, 40, 12, 1024), device="cuda:0", dtype=torch.float16, requires_grad=True)
y = jit_relu_dropout(x, 0.5)

with CUDA 11.3, and you will see

Traceback (most recent call last):
  File "/home/gaoxiang/misc/nvrtc-failure.py", line 16, in <module>
    y = jit_relu_dropout(x, 0.5)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Error in dlopen or dlsym: libnvrtc-8aa72235.so.11.3: cannot open shared object file: No such file or directory

facebook-github-bot · 2021-04-28T22:25:52Z

💊 CI failures summary and remediations

As of commit 18b3b07 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

zasdfgbnm · 2021-04-29T00:12:39Z

cc: @ptrblck @mcarilli

malfet

Thank you for the fix, but can you please rework it (using preprocessor or metaprogramming) to keep libname static for performance reasons? (otherwise libname would be computed every time this function is called)
Or initialize explicitly using std::once construct

zasdfgbnm · 2021-04-29T02:45:09Z

@malfet Done

codecov · 2021-04-29T06:10:01Z

Codecov Report

Merging #57204 (18b3b07) into master (4cb534f) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #57204      +/-   ##
==========================================
- Coverage   77.58%   77.56%   -0.02%     
==========================================
  Files        1952     1952              
  Lines      194466   194468       +2     
==========================================
- Hits       150868   150831      -37     
- Misses      43598    43637      +39

facebook-github-bot · 2021-04-30T16:59:56Z

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-04-30T20:25:34Z

@malfet merged this pull request in 42b3fc2.

…#57204) Summary: NVRTC versioning has changed starting 11.3, and will change again for CUDA 12.X. See comment in code for detail. As a result, jit on CUDA 11.3 is broken. Also, the error message is misleading: When both `libname` and `alt_libname` are non-empty, the error message is only reporting `alt_libname`, it should report both. To reproduce the error, you can use: ```python import torch torch._C._jit_set_profiling_mode(False) torch._C._jit_set_profiling_executor(False) torch._C._jit_override_can_fuse_on_cpu(True) torch._C._jit_override_can_fuse_on_gpu(True) torch.jit.script def jit_relu_dropout(x, prob) : # type: (Tensor, float) -> Tensor x = torch.nn.functional.relu(x) x = torch.nn.functional.dropout(x, p=prob, training=True) return x x = torch.randn((64, 40, 12, 1024), device="cuda:0", dtype=torch.float16, requires_grad=True) y = jit_relu_dropout(x, 0.5) ``` with CUDA 11.3, and you will see ``` Traceback (most recent call last): File "/home/gaoxiang/misc/nvrtc-failure.py", line 16, in <module> y = jit_relu_dropout(x, 0.5) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: Error in dlopen or dlsym: libnvrtc-8aa72235.so.11.3: cannot open shared object file: No such file or directory ``` Pull Request resolved: pytorch#57204 Reviewed By: ngimel Differential Revision: D28122083 Pulled By: malfet fbshipit-source-id: fd387cf79f33a6d5a5b93d54c9f21e9c23731045

Fix NVRTC versioning for CUDA 11.3, CUDA 12 and later

4f36bc2

facebook-github-bot added the cla signed label Apr 28, 2021

fix

305b795

pytorchbot added the open source label Apr 28, 2021

zasdfgbnm added 3 commits April 28, 2021 15:35

no static

a45b503

fix error message

535b9bd

save

823194e

zasdfgbnm changed the title ~~Fix NVRTC versioning for CUDA 11.3, CUDA 12 and later~~ Fix NVRTC versioning for CUDA 11.X (X>=3), CUDA 12 and later Apr 28, 2021

zasdfgbnm marked this pull request as ready for review April 29, 2021 00:12

zasdfgbnm requested review from malfet and ngimel April 29, 2021 00:12

malfet requested changes Apr 29, 2021

View reviewed changes

callonce

7cc0841

gchanan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 29, 2021

fix

18b3b07

malfet approved these changes Apr 30, 2021

View reviewed changes

facebook-github-bot closed this in 42b3fc2 Apr 30, 2021

facebook-github-bot added the Merged label Apr 30, 2021

zasdfgbnm deleted the fix-nvrtc-versioning branch April 30, 2021 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix NVRTC versioning for CUDA 11.X (X>=3), CUDA 12 and later #57204

Fix NVRTC versioning for CUDA 11.X (X>=3), CUDA 12 and later #57204

Uh oh!

zasdfgbnm commented Apr 28, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 28, 2021 •

edited

Loading

Uh oh!

zasdfgbnm commented Apr 29, 2021 •

edited

Loading

Uh oh!

malfet left a comment

Uh oh!

zasdfgbnm commented Apr 29, 2021

Uh oh!

codecov bot commented Apr 29, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 30, 2021

Uh oh!

facebook-github-bot commented Apr 30, 2021

Uh oh!

Uh oh!

Fix NVRTC versioning for CUDA 11.X (X>=3), CUDA 12 and later #57204

Fix NVRTC versioning for CUDA 11.X (X>=3), CUDA 12 and later #57204

Uh oh!

Conversation

zasdfgbnm commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

zasdfgbnm commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Apr 29, 2021

Uh oh!

codecov bot commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

facebook-github-bot commented Apr 30, 2021

Uh oh!

facebook-github-bot commented Apr 30, 2021

Uh oh!

Uh oh!

zasdfgbnm commented Apr 28, 2021 •

edited

Loading

facebook-github-bot commented Apr 28, 2021 •

edited

Loading

zasdfgbnm commented Apr 29, 2021 •

edited

Loading

codecov bot commented Apr 29, 2021 •

edited

Loading