Shared library loading logic breaks when CUDA packages are installed in a non-standard location #101314

qxcv · 2023-05-12T21:56:07Z

🐛 Describe the bug

tl;dr: Some CUDA libraries are distributed alongside Torch via PyPI packages. These packages include nvidia-cudnn-cu11, nvidia-cusparse-cu11, and so on. Torch's __init__.py has various tricks to find and load these libraries, but one of these tricks break when Torch is installed in a different location to the nvidia-* packages. This could be fixed by linking all of Torch's CUDA dependencies into libtorch_global_deps.so.

Longer version:

I'm using Torch PyPI with the pants build system, which creates Python environments with a slightly weird layout. Specifically, each package ends up in its own directory, rather than everything landing in site-packages like it would in a virtualenv. This causes problems when I attempt to import PyTorch 2.0.0:

ImportError                               Traceback (most recent call last)
<ipython-input-20-eb42ca6e4af3> in <cell line: 1>()
----> 1 import torch

~/.cache/pants/named_caches/pex_root/installed_wheels/6befaad784004b7af357e3d87fa0863c1f642866291f12a4c2af2de435e8ac5c/torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl/torch/__init__.py in <module>
--> 239     from torch._C import *  # noqa: F403
    240 
    241 # Appease the type checker; ordinarily this binding is inserted by the

ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory

I think this may point at an issue with the shared library loading logic in Torch. Specifically, _load_global_deps() in Torch's __init__.py has this logic that first attempts to load globals deps from libtorch_global_deps.so, and then attempts to load any missing libraries if the CDLL() call fails:

# See Note [Global dependencies]
def _load_global_deps():
    # ... snip ...

    lib_name = 'libtorch_global_deps' + ('.dylib' if platform.system() == 'Darwin' else '.so')
    here = os.path.abspath(__file__)
    lib_path = os.path.join(os.path.dirname(here), 'lib', lib_name)

    try:
        ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
    except OSError as err:
        cuda_libs: Dict[str, str] = {
            'cublas': 'libcublas.so.*[0-9]',
            'cudnn': 'libcudnn.so.*[0-9]',
            'cuda_nvrtc': 'libnvrtc.so.*[0-9].*[0-9]',
            'cuda_runtime': 'libcudart.so.*[0-9].*[0-9]',
            'cuda_cupti': 'libcupti.so.*[0-9].*[0-9]',
            'cufft': 'libcufft.so.*[0-9]',
            'curand': 'libcurand.so.*[0-9]',
            'cusolver': 'libcusolver.so.*[0-9]', X
            'cusparse': 'libcusparse.so.*[0-9]', X
            'nccl': 'libnccl.so.*[0-9]', X
            'nvtx': 'libnvToolsExt.so.*[0-9]',
        }
        is_cuda_lib_err = [lib for lib in cuda_libs.values() if(lib.split('.')[0] in err.args[0])]
        # ... some more logic to load libs by looking through `sys.path` ...

On my system, the CDLL() call succeeds at loading torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl/torch/lib/libtorch_global_deps.so, so it returns immediately without attempting to load the libraries in the cuda_libs dict. However, that .so file only links to a subset of the libraries listed above:

$ ldd /long/path/to/torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl/torch/lib/libtorch_global_deps.so
        linux-vdso.so.1 (0x00007ffe3b7d1000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6d85c92000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6d85b41000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6d85b3b000)
        libcurand.so.10 => /lib/x86_64-linux-gnu/libcurand.so.10 (0x00007f6d7ff4b000)
        libcufft.so.10 => /lib/x86_64-linux-gnu/libcufft.so.10 (0x00007f6d774be000)
        libcublas.so.11 => /lib/x86_64-linux-gnu/libcublas.so.11 (0x00007f6d6dd40000)
        libcublasLt.so.11 => /lib/x86_64-linux-gnu/libcublasLt.so.11 (0x00007f6d58cda000)
        libcudart.so.11.0 => /lib/x86_64-linux-gnu/libcudart.so.11.0 (0x00007f6d58a34000)
        libnvToolsExt.so.1 => /lib/x86_64-linux-gnu/libnvToolsExt.so.1 (0x00007f6d5882a000)
        libgomp-a34b3233.so.1 => /long/path/to/torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl/torch/lib/libgomp-a34b3233.so.1 (0x00007f6d58600000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6d5840e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f6d85cd9000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f6d58404000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6d583e7000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f6d58205000)

Some libraries from cuda_libs are missing from the ldd output. This is fine when the nvidia-* Python packages are installed in the same directory as Torch, because Python can Torch's RPATH to find the packages. Specifically, the RPATH has a bunch of relative paths to the nvidia libraries, which look like this:

$ORIGIN/../../nvidia/cublas/lib:$ORIGIN/../../nvidia/cuda_cupti/lib:$ORIGIN/../../nvidia/cuda_nvrtc/lib:$ORIGIN/../../nvidia/cuda_runtime/lib:$ORIGIN/../../nvidia/cudnn/lib:$ORIGIN/../../nvidia/cufft/lib:$ORIGIN/../../nvidia/curand/lib:$ORIGIN/../../nvidia/cusolver/lib:$ORIGIN/../../nvidia/cusparse/lib:$ORIGIN/../../nvidia/nccl/lib:$ORIGIN/../../nvidia/nvtx/lib:$ORIGIN

Unfortunately these relative paths do not work when Torch is installed in a different directory to the nvidia-* packages, which is the case for me.

__init__.py already has the logic necessary to fix this problem by scanning sys.path for the missing libraries. However, that logic currently only gets triggered when the libtorch_global_deps import fails. When I modify the code to always look for these libraries, I can import PyTorch again:

    try:
        ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
        raise OSError("libcudnn libnvrtc libcupti libcusolver libcusparse libnccl")  # always look for these libraries
    except OSError as err:
        cuda_libs: Dict[str, str] = {
            # ... etc. ...

Ideally __init__.py should use a more robust test to determine whether libcudnn and friends can be loaded. Probably the easiest fix is to link all the libs from cuda_libs into libtorch_global_deps.

Versions

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.9.5 (default, Nov 23 2021, 15:27:38) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-125-generic-x86_64-with-glibc2.31
Is CUDA available: N/A
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000
GPU 5: NVIDIA RTX A6000
GPU 6: NVIDIA RTX A6000
GPU 7: NVIDIA RTX A6000

Nvidia driver version: 510.60.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7763 64-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 3249.791
CPU max MHz: 2450.0000
CPU min MHz: 1500.0000
BogoMIPS: 4900.34
Virtualization: AMD-V
L1d cache: 4 MiB
L1i cache: 4 MiB
L2 cache: 64 MiB
L3 cache: 512 MiB
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and _user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tc
e topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v
vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] flake8==3.7.9
[pip3] numpy==1.17.4
[conda] No relevant packages

The text was updated successfully, but these errors were encountered:

qxcv · 2023-05-12T22:09:38Z

One way to fix this problem would be to link libtorch_global_deps to ALL the CUDA dependencies. This Caffe2 CMake file shows how the linking is being done right now:

if(BUILD_SHARED_LIBS)
  add_library(torch_global_deps SHARED ${TORCH_SRC_DIR}/csrc/empty.c)

  # ...snip ...

  # The CUDA libraries are linked here for a different reason: in some
  # cases we load these libraries with ctypes, and if they weren't opened
  # with RTLD_GLOBAL, we'll do the "normal" search process again (and
  # not find them, because they're usually in non-standard locations)
  if(USE_CUDA)
    target_link_libraries(torch_global_deps ${Caffe2_PUBLIC_CUDA_DEPENDENCY_LIBS})
    target_link_libraries(torch_global_deps torch::cudart torch::nvtoolsext)
  endif()

  # ... snip ...

  install(TARGETS torch_global_deps DESTINATION "${TORCH_INSTALL_LIB_DIR}")
endif()

Caffe2_PUBLIC_CUDA_DEPENDENCY_LIBS seems to get populated in this other Caffe2 CMake file.

  if(CAFFE2_USE_CUDA)
    # A helper variable recording the list of Caffe2 dependent libraries
    # torch::cudart is dealt with separately, due to CUDA_ADD_LIBRARY
    # design reason (it adds CUDA_LIBRARIES itself).
    set(Caffe2_PUBLIC_CUDA_DEPENDENCY_LIBS
      caffe2::cufft caffe2::curand caffe2::cublas)
    if(CAFFE2_USE_NVRTC)
      list(APPEND Caffe2_PUBLIC_CUDA_DEPENDENCY_LIBS caffe2::cuda caffe2::nvrtc)
    else()
      caffe2_update_option(USE_NVRTC OFF)
    endif()
    if(CAFFE2_USE_CUDNN)
      list(APPEND Caffe2_CUDA_DEPENDENCY_LIBS torch::cudnn)
    else()
      caffe2_update_option(USE_CUDNN OFF)
    endif()
    if(CAFFE2_USE_TENSORRT)
      list(APPEND Caffe2_PUBLIC_CUDA_DEPENDENCY_LIBS caffe2::tensorrt)
    else()
      caffe2_update_option(USE_TENSORRT OFF)
    endif()
  else()
    # ... snip ...
  endif()

This is missing a few libraries which could probably be added directly to the first makefile by extending the target_link_libraries(torch_global_deps torch::cudart torch::nvtoolsext) directive.

AugustKarlstedt · 2023-05-16T01:05:39Z

+1 same issue in the bazel build system

qxcv · 2023-05-16T01:31:37Z

As a workaround, I'm running load_cuda_deps() from this file before importing Torch.
(Note that this only helps when the corresponding nvidia deps have been installed from PyPI.)

AugustKarlstedt · 2023-05-16T19:18:31Z

thanks @qxcv, we can make that work in our setup too ~ would love to have these paths fixed as you described

This PR adds string tensor support to `carton-runner-py`. It also adds a test that packs and runs `bert-base-uncased` (with string input and output). Some minor additional changes were included to make running and debugging the model easier: - Added python tracebacks to error messages - Added tracing support - Added `slowlog` for copying large files Finally, there's a bug in Torch 2.x when python packages are laid out in independent directories (pytorch/pytorch#101314). This PR addresses that issue by attempting to preload cuda if nvidia libs are on `sys.path`.

drisspg added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module topic: build bug module: bazel labels May 17, 2023

VivekPanyam mentioned this issue May 23, 2023

[Python Runner] Add string tensor support VivekPanyam/carton#98

Merged

georgevreilly mentioned this issue Jan 12, 2024

CUDA deps cannot be preloaded under Bazel #117350

Open

WannaTen mentioned this issue Mar 5, 2024

can not find all the deps when install the native CUDA-toolkit and torch in Redhat-like systems #121207

Open

matthewdouglas mentioned this issue Mar 13, 2024

Linux binaries should ship with appropriate RPATH TimDettmers/bitsandbytes#1126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared library loading logic breaks when CUDA packages are installed in a non-standard location #101314

Shared library loading logic breaks when CUDA packages are installed in a non-standard location #101314

qxcv commented May 12, 2023 •

edited

qxcv commented May 12, 2023 •

edited

AugustKarlstedt commented May 16, 2023

qxcv commented May 16, 2023

AugustKarlstedt commented May 16, 2023

Shared library loading logic breaks when CUDA packages are installed in a non-standard location #101314

Shared library loading logic breaks when CUDA packages are installed in a non-standard location #101314

Comments

qxcv commented May 12, 2023 • edited

🐛 Describe the bug

Versions

qxcv commented May 12, 2023 • edited

AugustKarlstedt commented May 16, 2023

qxcv commented May 16, 2023

AugustKarlstedt commented May 16, 2023

qxcv commented May 12, 2023 •

edited

qxcv commented May 12, 2023 •

edited