SDPA Tutorial - libcuda.so not found error for torch compile on Google Colab #113521

chauhang · 2023-11-11T16:55:00Z

🐛 Describe the bug

The SDPA tutorial is failing for torch.compile step in Google Colab

Code:

batch_size = 32
max_sequence_len = 256
x = torch.rand(batch_size, max_sequence_len,
               embed_dimension, device=device, dtype=dtype)
print(
    f"The non compiled module runs in  {benchmark_torch_function_in_microseconds(model, x):.3f} microseconds")

compiled_model = torch.compile(model)
# Let's compile it
compiled_model(x)
print(
    f"The compiled module runs in  {benchmark_torch_function_in_microseconds(compiled_model, x):.3f} microseconds")

Error:

BackendCompilerFailed Traceback (most recent call last)
in <cell line: 11>()
9 compiled_model = torch.compile(model)
10 # Let's compile it
---> 11 compiled_model(x)
12 print(
13 f"The compiled module runs in {benchmark_torch_function_in_microseconds(compiled_model, x):.3f} microseconds")

/usr/lib/python3.10/concurrent/futures/_base.py in __get_result(self)
401 if self._exception:
402 try:
--> 403 raise self._exception
404 finally:
405 # Break a reference cycle with the exception in self._exception

BackendCompilerFailed: backend='inductor' raised:
AssertionError: libcuda.so cannot found!

Full trace here

Versions

Collecting environment information...
PyTorch version: 2.1.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.27.7
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.120+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU @ 2.30GHz
CPU family: 6
Model: 63
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
Stepping: 0
BogoMIPS: 4599.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32 KiB (1 instance)
L1i cache: 32 KiB (1 instance)
L2 cache: 256 KiB (1 instance)
L3 cache: 45 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0,1
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Vulnerable; SMT Host state unknown
Vulnerability Meltdown: Vulnerable
Vulnerability Mmio stale data: Vulnerable
Vulnerability Retbleed: Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==2.1.0+cu118
[pip3] torchaudio==2.1.0+cu118
[pip3] torchdata==0.7.0
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.16.0
[pip3] torchvision==0.16.0+cu118
[pip3] triton==2.1.0
[conda] Could not collect

cc: @malfet @driss

cc @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

The text was updated successfully, but these errors were encountered:

bhack · 2023-11-12T00:05:50Z

is it a duplicate of #107960?
As we have the same issue on official pytorch cuda 11.x docker image.

malfet · 2023-11-15T18:32:55Z

This is not a PyTorch problem, but rather a triton one see triton-lang/triton#2507

bhack · 2023-12-02T12:50:10Z

@malfet But in the pytorch devel container libcuda.so is in /usr/local/cuda-11.8/compat/.
Also where is in the pytorch runtime container?

dbl001 · 2023-12-08T00:51:44Z

On Colab Pro In github project llama2.c setting 'compile=True' in train.py:

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: libcuda.so cannot found!

https://github.com/karpathy/llama2.c.git

Traceback (most recent call last):
  File "/content/llama2.c/train.py", line 263, in <module>
    losses = estimate_loss()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/content/llama2.c/train.py", line 222, in estimate_loss
    logits = model(X, Y)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
    result = inner_convert(frame, cache_size, hooks, frame_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
    return _compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 458, in transform
    tracer.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run
    super().run()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2162, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py", line 857, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py", line 957, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py", line 1024, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py", line 1009, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/__init__.py", line 1568, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 1150, in compile_fx
    return aot_autograd(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py", line 3891, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py", line 3429, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py", line 2212, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py", line 2392, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py", line 1573, in aot_dispatch_base
    compiled_fw = compiler(fw_module, flat_args)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 1092, in fw_compiler_base
    return inner_compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/repro/after_aot.py", line 80, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/debug.py", line 228, in inner
    return fn(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 54, in newFunction
    return old_func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 341, in compile_fx_inner
    compiled_graph: CompiledFxGraph = fx_codegen_and_compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 565, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py", line 970, in compile_to_fn
    return self.compile_to_module().call
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py", line 941, in compile_to_module
    mod = PyCodeCache.load_by_key_path(key, path, linemap=linemap)
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py", line 1139, in load_by_key_path
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_root/kw/ckwyocnsb6bydlr7fpg6ffcotesguwnithfrui2nsma7k3hwmcew.py", line 1138, in <module>
    async_compile.wait(globals())
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py", line 1418, in wait
    scope[key] = result.result()
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py", line 1277, in result
    self.future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: libcuda.so cannot found!


Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

and

!find /usr -name libcuda.so -ls
4587540      0 lrwxrwxrwx   1 root     root           12 Sep 29  2022 /usr/local/cuda-11.8/compat/libcuda.so -> libcuda.so.1
  1186091     64 -rw-r--r--   1 root     root        62176 Sep 21  2022 /usr/local/cuda-11.8/targets/x86_64-linux/lib/stubs/libcuda.so
  3016993      0 lrwxrwxrwx   1 root     root           12 Dec  4 15:12 /usr/lib64-nvidia/libcuda.so -> libcuda.so.1

!find /usr -name libcuda.so.1 -ls
 4587540      0 lrwxrwxrwx   1 root     root           12 Sep 29  2022 /usr/local/cuda-11.8/compat/libcuda.so -> libcuda.so.1
  1186091     64 -rw-r--r--   1 root     root        62176 Sep 21  2022 /usr/local/cuda-11.8/targets/x86_64-linux/lib/stubs/libcuda.so
  3016993      0 lrwxrwxrwx   1 root     root           12 Dec  4 15:12 /usr/lib64-nvidia/libcuda.so -> libcuda.so.1

!export LD_LIBRARY_PATH=/usr/local/cuda-11.8/compat:$LD_LIBRARY_PATH
!export LD_LIBRARY_PATH=/usr/local/cuda-11.8/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
!export LD_LIBRARY_PATH=/usr/lib64-nvidia:$LD_LIBRARY_PATH

and

import torch
print(torch.__version__)
2.1.0+cu118

chauhang · 2024-02-05T01:28:16Z

Triton fix for Colab is part of PyTorch 2.2 release. Example works after upgrading to torch>=2.2.0

cpuhrsch added the oncall: pt2 label Nov 14, 2023

wconstab added bug module: inductor labels Nov 14, 2023

yf225 added upstream triton Upstream Triton Issue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed module: inductor labels Nov 17, 2023

dbl001 mentioned this issue Dec 8, 2023

Training nanoGPT on COVID-19 Dataset karpathy/nanoGPT#391

Open

bhack mentioned this issue Jan 3, 2024

trition-2.1.0 does not work in collab triton-lang/triton#2507

Closed

chauhang closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDPA Tutorial - libcuda.so not found error for torch compile on Google Colab #113521

SDPA Tutorial - libcuda.so not found error for torch compile on Google Colab #113521

chauhang commented Nov 11, 2023 •

edited by pytorch-bot bot

bhack commented Nov 12, 2023

malfet commented Nov 15, 2023

bhack commented Dec 2, 2023

dbl001 commented Dec 8, 2023 •

edited

chauhang commented Feb 5, 2024

SDPA Tutorial - libcuda.so not found error for torch compile on Google Colab #113521

SDPA Tutorial - libcuda.so not found error for torch compile on Google Colab #113521

Comments

chauhang commented Nov 11, 2023 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

bhack commented Nov 12, 2023

malfet commented Nov 15, 2023

bhack commented Dec 2, 2023

dbl001 commented Dec 8, 2023 • edited

chauhang commented Feb 5, 2024

chauhang commented Nov 11, 2023 •

edited by pytorch-bot bot

dbl001 commented Dec 8, 2023 •

edited