Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

_preload_cuda_deps does not work if cublas and cudnn are in different nvidia folders #92096

Closed
dannyjeck opened this issue Jan 12, 2023 · 8 comments
Labels
module: bazel module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@dannyjeck
Copy link
Contributor

dannyjeck commented Jan 12, 2023

馃悰 Describe the bug

I'm using bazel to install pytorch and bazel installs the cudnn and cublas dependencies in different folders.

The following raises an error

import torch

with the following exception

Traceback (most recent call last):
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_torch/site-packages/torch/__init__.py", line 172, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../main.py", line 2, in <module>
    import torch
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_torch/site-packages/torch/__init__.py", line 217, in <module>
    _load_global_deps()
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_torch/site-packages/torch/__init__.py", line 178, in _load_global_deps
    _preload_cuda_deps()
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_torch/site-packages/torch/__init__.py", line 158, in _preload_cuda_deps
    ctypes.CDLL(cublas_path)
  File "/home/djeck/.cache/bazel/_bazel_djeck/.../external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_nvidia_cudnn_cu11/site-packages/nvidia/cublas/lib/libcublas.so.11: cannot open shared object file: No such file or directory

This is because I have a separate folder where the cublas folder is installed. /home/djeck/.cache/bazel/_bazel_djeck/.../deps_sensing_nvidia_cublas_cu11/site-packages/nvidia/cublas/lib/libcublas.so.11, which is also in my sys.path.

The following works without raising an error.

import os, sys, ctypes
def _preload_cuda_deps():
    """Preloads cudnn/cublas deps if they could not be found otherwise."""
    cublas_path = None
    cudnn_path = None
    for path in sys.path:
        nvidia_path = os.path.join(path, 'nvidia')
        if not os.path.exists(nvidia_path):
            continue
        candidate_cublas_path = os.path.join(nvidia_path, 'cublas', 'lib', 'libcublas.so.11')
        if os.path.exists(candidate_cublas_path) and not cublas_path:
            cublas_path = candidate_cublas_path
        candidate_cudnn_path = os.path.join(nvidia_path, 'cudnn', 'lib', 'libcudnn.so.8')
        if os.path.exists(candidate_cudnn_path) and not cudnn_path:
            cudnn_path = candidate_cudnn_path
        if cublas_path and cudnn_path:
            break
    if not cublas_path or not cudnn_path:
        raise ValueError(f"cublas and cudnn not found in the system path {sys.path}")

    ctypes.CDLL(cublas_path)
    ctypes.CDLL(cudnn_path)


_preload_cuda_deps()
import torch

I tried to put together a PR to modify this code https://github.com/pytorch/pytorch/blob/master/torch/__init__.py#L147 but it looks like I'm not an approved user.

Versions

Some of these are probably not correct, as the collect_env.py script doesn't use the bazel-installed stuff. Bazel in my set-up is using python 3.10 with the nightly pytorch version.

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun  4 2021, 15:09:15)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.14.0-1054-oem-x86_64-with-glibc2.17
Is CUDA available: N/A
CUDA runtime version: 10.1.243
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: NVIDIA TITAN RTX
GPU 1: NVIDIA TITAN RTX

Nvidia driver version: 470.161.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

Versions of relevant libraries:
[pip3] mypy==0.910
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.4
[pip3] numpy-quaternion==2020.11.2.17.0.49
[pip3] numpy-stl==2.11.2
[pip3] pytorch-lightning==1.2.7
[pip3] torch==1.13.1
[pip3] torchmetrics==0.6.0
[pip3] torchvision==0.14.1
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] numpy-quaternion          2020.11.2.17.0.49          pypi_0    pypi
[conda] numpy-stl                 2.11.2                   pypi_0    pypi
[conda] pytorch-lightning         1.2.7                    pypi_0    pypi
[conda] torch                     1.13.1                   pypi_0    pypi
[conda] torchmetrics              0.6.0                    pypi_0    pypi
[conda] torchvision               0.14.1                   pypi_0    pypi

cc @malfet @seemethere

@drisspg drisspg added module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 13, 2023
@malfet
Copy link
Contributor

malfet commented Jan 13, 2023

@dannyjeck, thank you very much for reporting the problem.
Your change makes sense to me, please propose a PR and I'll accept it.
Can you please elaborate, what do you mean you are not an approved user? Anyone can propose a PR, by creating the fork of the repo in their organization, authoring a change and the posting a PR for it.

@dannyjeck
Copy link
Contributor Author

Oh got it. I was trying to push a branch to this repo. Will do.

@dannyjeck
Copy link
Contributor Author

@malfet see #92122

@shahsmit1
Copy link

can we create a new patch for this fix on 1.13? 2.0.0 is a big version upgrade for us and we found a critical CVE on 1.13: https://nvd.nist.gov/vuln/detail/CVE-2022-45907

@chakpak
Copy link

chakpak commented Apr 1, 2023

@atalman @malfet ^^ about the CVE fix for 1.13.1. Jumping to v2.0 is a non-starter for a few enterprises right away so a potential 1.14 with the fix would go a long way. Thank you so very much for considering this.

@malfet
Copy link
Contributor

malfet commented Apr 1, 2023

@chakpak can you please create a new issue asking for what exactly changed needs to be picked and why updating to 2.0 is not an option, so that others can participate in a discussion.
IMO it should be ok to to cherry pick a few changes into the dedicated branch, but probably to much of a hassle going forward with full dot release

@shahsmit1
Copy link

let me create a new issue with the relevant details

@shahsmit1
Copy link

@malfet filed #98115 with details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: bazel module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants