Inconsistency in CUDA compatibility check #3382

kpedro88 · 2021-09-20T22:32:59Z

Description
Datacenter GPUs should be able to run the Triton server using forward compatibility drivers. However, if extra libraries are specified in LD_PRELOAD (as suggested in https://github.com/triton-inference-server/server/blob/main/docs/custom_operations.md), the compatibility check can fail because of inconsistencies between LD_LIBRARY_PATH and LD_PRELOAD.

Triton Information
What version of Triton are you using? 2.11.0

Are you using the Triton container or did you build it yourself? Started from Triton container and added PyTorch extension libraries on top: fastml/triton-torchgeo:21.06-py3-geometric

To Reproduce
Run the container interactively and enter the following commands:

(LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real; /usr/local/bin/cudaCheck)

The output is:

/usr/local/bin/cudaCheck: error while loading shared libraries: libc10.so: cannot open shared object file: No such file or directory

This command is obtained from /etc/shinit_v2 in the Triton container:

  # Run cudaCheck with the compat library on LD_LIBRARY_PATH to see if it will initialize
  export _CUDA_COMPAT_STATUS="$(LD_LIBRARY_PATH="${_CUDA_COMPAT_REALLIB}" \
                                timeout -s KILL ${TIMEOUT} /usr/local/bin/cudaCheck 2>/dev/null)"

Expected behavior
The right answer can be obtained from this command:

(LD_PRELOAD=""; LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real; /usr/local/bin/cudaCheck)

The output is:

CUDA Driver OK

If the compatibility check is going to clear everything out of LD_LIBRARY_PATH, it should also clear LD_PRELOAD for consistency, as indicated above. (Or, it should append/prepend to LD_LIBRARY_PATH for the check.)

The text was updated successfully, but these errors were encountered:

GuanLuo · 2021-09-22T18:59:29Z

Is this directly related to Triton? Or is this a check that the container runs before starting Triton?

kpedro88 · 2021-09-22T19:09:34Z

It may be in the container, but I don't know where to report problems in Nvidia's containers.

It's also worth keeping in mind that Triton explicitly recommends using LD_PRELOAD for certain features, whereas non-Triton users of Nvidia containers are probably much less likely to encounter this.

GuanLuo · 2021-09-22T20:22:50Z

What is the command you use to start the container?

kpedro88 · 2021-09-22T20:30:26Z

docker run -it --gpus all fastml/triton-torchgeo:21.06-py3-geometric suffices to demonstrate this issue.

cliffwoolley · 2022-01-04T23:49:23Z

Thanks for the report! This is now fixed this for our 22.01 containers.

kpedro88 mentioned this issue Sep 21, 2021

Triton test fixes cms-sw/cmssw#35328

Merged

dyastremsky closed this as completed Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in CUDA compatibility check #3382

Inconsistency in CUDA compatibility check #3382

kpedro88 commented Sep 20, 2021

GuanLuo commented Sep 22, 2021

kpedro88 commented Sep 22, 2021

GuanLuo commented Sep 22, 2021

kpedro88 commented Sep 22, 2021

cliffwoolley commented Jan 4, 2022

Inconsistency in CUDA compatibility check #3382

Inconsistency in CUDA compatibility check #3382

Comments

kpedro88 commented Sep 20, 2021

GuanLuo commented Sep 22, 2021

kpedro88 commented Sep 22, 2021

GuanLuo commented Sep 22, 2021

kpedro88 commented Sep 22, 2021

cliffwoolley commented Jan 4, 2022