Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU inference in Docker container fails due to missing libdevice directory #2201

Closed
MartijnVanbiervliet opened this issue Feb 1, 2024 · 4 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:bug

Comments

@MartijnVanbiervliet
Copy link

Bug Report

System information

  • OS Platform and Distribution: RockyLinux 9.2
  • TensorFlow Serving installed from (source or binary): Docker image
  • TensorFlow Serving version: tensorflow/serving:2.14.1-gpu
  • Docker 24.0.5
  • NVIDIA R535 drivers 535.86.10
  • NVIDIA Container Toolkit 1.13.5

Describe the problem

With the latest version of Docker image tensorflow/serving:2.14.1-gpu, I cannot run inference of my model with GPU using Docker image tensorflow/serving. The following error is shown in the logs:

2024-01-30 10:15:19.247458: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:521] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  /usr/bin/../nvidia/cuda_nvcc
  /usr/bin/../../nvidia/cuda_nvcc
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2024-01-30 10:15:19.259311: I external/org_tensorflow/tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-01-30 10:15:19.260050: I external/org_tensorflow/tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-01-30 10:15:19.260095: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:109] Couldn't get ptxas version : FAILED_PRECONDITION: Couldn't get ptxas/nvlink version string: INTERNAL: Couldn't invoke ptxas --version
...
2024-01-30 10:15:19.770155: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:559] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
error: libdevice not found at ./libdevice.10.bc

It appears that the CUDA libraries are not installed completely. The libdevice directory doesn't exist in the Docker image. I expected CUDA to be fully installed to support serving models with GPU.

I encounter no problems with tensorflow/serving:2.11.0-gpu.

I considered the following solutions before raising this issue:

Workaround

Install the cuda-toolkit package in the Docker image.

FROM tensorflow/serving:2.14.1-gpu
RUN apt-get update && apt-get install -y cuda-toolkit-11-8

This increases the size of the Docker image by ~4GB (uncompressed).

Alternatively, it also works with the tensorflow/serving:2.14.1-devel-gpu Docker image, but this is even larger in size.

Exact Steps to Reproduce

docker run -u root:root -ti --entrypoint bash tensorflow/serving:2.14.1-gpu

ptxas is not available:

$ ptxas --version
bash: ptxas: command not found

Searching for a directory nvvm or libdevice returns nothing:

find / -type d -name nvvm 2>/dev/null

When using 2.11.0, it does work:

docker run -u root:root -ti --entrypoint bash tensorflow/serving:2.11.0-gpu

ptxas is available:

$ ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:21_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

Searching for a directory nvvm returns the directory in the cuda installation directory:

$ find / -type d -name nvvm 2>/dev/null
/usr/local/cuda-11.2/nvvm
@singhniraj08
Copy link

singhniraj08 commented Feb 6, 2024

@MartijnVanbiervliet,

I have created and attached a PR which will resolve this issue. Thank you for bringing this up to our attention.

Copy link

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Feb 14, 2024
Copy link

This issue was closed due to lack of activity after being marked stale for past 7 days.

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:bug
Projects
None yet
Development

No branches or pull requests

2 participants