_preload_cuda_deps does not work if cublas and cudnn are in different nvidia folders #92096

dannyjeck · 2023-01-12T19:25:47Z

🐛 Describe the bug

I'm using bazel to install pytorch and bazel installs the cudnn and cublas dependencies in different folders.

The following raises an error

import torch

with the following exception

Traceback (most recent call last):
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_torch/site-packages/torch/__init__.py", line 172, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../main.py", line 2, in <module>
    import torch
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_torch/site-packages/torch/__init__.py", line 217, in <module>
    _load_global_deps()
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_torch/site-packages/torch/__init__.py", line 178, in _load_global_deps
    _preload_cuda_deps()
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_torch/site-packages/torch/__init__.py", line 158, in _preload_cuda_deps
    ctypes.CDLL(cublas_path)
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_nvidia_cudnn_cu11/site-packages/nvidia/cublas/lib/libcublas.so.11: cannot open shared object file: No such file or directory

This is because I have a separate folder where the cublas folder is installed. /home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_nvidia_cublas_cu11/site-packages/nvidia/cublas/lib/libcublas.so.11, which is also in my sys.path.

The following works without raising an error.

import os, sys, ctypes
def _preload_cuda_deps():
    """Preloads cudnn/cublas deps if they could not be found otherwise."""
    cublas_path = None
    cudnn_path = None
    for path in sys.path:
        nvidia_path = os.path.join(path, 'nvidia')
        if not os.path.exists(nvidia_path):
            continue
        candidate_cublas_path = os.path.join(nvidia_path, 'cublas', 'lib', 'libcublas.so.11')
        if os.path.exists(candidate_cublas_path) and not cublas_path:
            cublas_path = candidate_cublas_path
        candidate_cudnn_path = os.path.join(nvidia_path, 'cudnn', 'lib', 'libcudnn.so.8')
        if os.path.exists(candidate_cudnn_path) and not cudnn_path:
            cudnn_path = candidate_cudnn_path
        if cublas_path and cudnn_path:
            break
    if not cublas_path or not cudnn_path:
        raise ValueError(f"cublas and cudnn not found in the system path {sys.path}")

    ctypes.CDLL(cublas_path)
    ctypes.CDLL(cudnn_path)


_preload_cuda_deps()
import torch

I tried to put together a PR to modify this code https://github.com/pytorch/pytorch/blob/master/torch/__init__.py#L147 but it looks like I'm not an approved user.

Versions

Some of these are probably not correct, as the collect_env.py script doesn't use the bazel-installed stuff. Bazel in my set-up is using python 3.10 with the nightly pytorch version.

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun  4 2021, 15:09:15)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.14.0-1054-oem-x86_64-with-glibc2.17
Is CUDA available: N/A
CUDA runtime version: 10.1.243
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: NVIDIA TITAN RTX
GPU 1: NVIDIA TITAN RTX

Nvidia driver version: 470.161.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

Versions of relevant libraries:
[pip3] mypy==0.910
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.4
[pip3] numpy-quaternion==2020.11.2.17.0.49
[pip3] numpy-stl==2.11.2
[pip3] pytorch-lightning==1.2.7
[pip3] torch==1.13.1
[pip3] torchmetrics==0.6.0
[pip3] torchvision==0.14.1
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] numpy-quaternion          2020.11.2.17.0.49          pypi_0    pypi
[conda] numpy-stl                 2.11.2                   pypi_0    pypi
[conda] pytorch-lightning         1.2.7                    pypi_0    pypi
[conda] torch                     1.13.1                   pypi_0    pypi
[conda] torchmetrics              0.6.0                    pypi_0    pypi
[conda] torchvision               0.14.1                   pypi_0    pypi

cc @malfet @seemethere

The text was updated successfully, but these errors were encountered:

malfet · 2023-01-13T00:32:47Z

@dannyjeck, thank you very much for reporting the problem.
Your change makes sense to me, please propose a PR and I'll accept it.
Can you please elaborate, what do you mean you are not an approved user? Anyone can propose a PR, by creating the fork of the repo in their organization, authoring a change and the posting a PR for it.

dannyjeck · 2023-01-13T00:52:41Z

Oh got it. I was trying to push a branch to this repo. Will do.

dannyjeck · 2023-01-13T01:07:59Z

@malfet see #92122

shahsmit1 · 2023-04-01T00:09:52Z

can we create a new patch for this fix on 1.13? 2.0.0 is a big version upgrade for us and we found a critical CVE on 1.13: https://nvd.nist.gov/vuln/detail/CVE-2022-45907

chakpak · 2023-04-01T00:49:28Z

@atalman @malfet ^^ about the CVE fix for 1.13.1. Jumping to v2.0 is a non-starter for a few enterprises right away so a potential 1.14 with the fix would go a long way. Thank you so very much for considering this.

malfet · 2023-04-01T01:07:26Z

@chakpak can you please create a new issue asking for what exactly changed needs to be picked and why updating to 2.0 is not an option, so that others can participate in a discussion.
IMO it should be ok to to cherry pick a few changes into the dedicated branch, but probably to much of a hassle going forward with full dot release

shahsmit1 · 2023-04-01T01:18:47Z

let me create a new issue with the relevant details

shahsmit1 · 2023-04-01T01:28:06Z

@malfet filed #98115 with details

drisspg added module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 13, 2023

malfet added the module: bazel label Jan 13, 2023

dannyjeck mentioned this issue Jan 13, 2023

Allow cublas an cudnn to be in different nvidia folders #92122

Closed

pytorchmergebot closed this as completed in a799ace Jan 24, 2023

shahsmit1 mentioned this issue Apr 1, 2023

Request to cherrypick a fix into v1.13.1 (v1.8 has a CVE) #98115

Open

georgevreilly mentioned this issue Jan 12, 2024

CUDA deps cannot be preloaded under Bazel #117350

Open

emrebayramc mentioned this issue Mar 29, 2024

OSError: libcublas.so.11: cannot open shared object file: No such file or directory #89417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_preload_cuda_deps does not work if cublas and cudnn are in different nvidia folders #92096

_preload_cuda_deps does not work if cublas and cudnn are in different nvidia folders #92096

dannyjeck commented Jan 12, 2023 •

edited by pytorch-bot bot

malfet commented Jan 13, 2023

dannyjeck commented Jan 13, 2023

dannyjeck commented Jan 13, 2023

shahsmit1 commented Apr 1, 2023

chakpak commented Apr 1, 2023

malfet commented Apr 1, 2023

shahsmit1 commented Apr 1, 2023

shahsmit1 commented Apr 1, 2023

_preload_cuda_deps does not work if cublas and cudnn are in different nvidia folders #92096

_preload_cuda_deps does not work if cublas and cudnn are in different nvidia folders #92096

Comments

dannyjeck commented Jan 12, 2023 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

malfet commented Jan 13, 2023

dannyjeck commented Jan 13, 2023

dannyjeck commented Jan 13, 2023

shahsmit1 commented Apr 1, 2023

chakpak commented Apr 1, 2023

malfet commented Apr 1, 2023

shahsmit1 commented Apr 1, 2023

shahsmit1 commented Apr 1, 2023

dannyjeck commented Jan 12, 2023 •

edited by pytorch-bot bot