Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in CUDA compatibility check #3382

Closed
kpedro88 opened this issue Sep 20, 2021 · 5 comments
Closed

Inconsistency in CUDA compatibility check #3382

kpedro88 opened this issue Sep 20, 2021 · 5 comments

Comments

@kpedro88
Copy link
Contributor

Description
Datacenter GPUs should be able to run the Triton server using forward compatibility drivers. However, if extra libraries are specified in LD_PRELOAD (as suggested in https://github.com/triton-inference-server/server/blob/main/docs/custom_operations.md), the compatibility check can fail because of inconsistencies between LD_LIBRARY_PATH and LD_PRELOAD.

Triton Information
What version of Triton are you using? 2.11.0

Are you using the Triton container or did you build it yourself? Started from Triton container and added PyTorch extension libraries on top: fastml/triton-torchgeo:21.06-py3-geometric

To Reproduce
Run the container interactively and enter the following commands:

(LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real; /usr/local/bin/cudaCheck)

The output is:

/usr/local/bin/cudaCheck: error while loading shared libraries: libc10.so: cannot open shared object file: No such file or directory

This command is obtained from /etc/shinit_v2 in the Triton container:

  # Run cudaCheck with the compat library on LD_LIBRARY_PATH to see if it will initialize
  export _CUDA_COMPAT_STATUS="$(LD_LIBRARY_PATH="${_CUDA_COMPAT_REALLIB}" \
                                timeout -s KILL ${TIMEOUT} /usr/local/bin/cudaCheck 2>/dev/null)"

Expected behavior
The right answer can be obtained from this command:

(LD_PRELOAD=""; LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real; /usr/local/bin/cudaCheck)

The output is:

CUDA Driver OK

If the compatibility check is going to clear everything out of LD_LIBRARY_PATH, it should also clear LD_PRELOAD for consistency, as indicated above. (Or, it should append/prepend to LD_LIBRARY_PATH for the check.)

@GuanLuo
Copy link
Contributor

GuanLuo commented Sep 22, 2021

Is this directly related to Triton? Or is this a check that the container runs before starting Triton?

@kpedro88
Copy link
Contributor Author

It may be in the container, but I don't know where to report problems in Nvidia's containers.

It's also worth keeping in mind that Triton explicitly recommends using LD_PRELOAD for certain features, whereas non-Triton users of Nvidia containers are probably much less likely to encounter this.

@GuanLuo
Copy link
Contributor

GuanLuo commented Sep 22, 2021

What is the command you use to start the container?

@kpedro88
Copy link
Contributor Author

docker run -it --gpus all fastml/triton-torchgeo:21.06-py3-geometric suffices to demonstrate this issue.

@cliffwoolley
Copy link
Contributor

Thanks for the report! This is now fixed this for our 22.01 containers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants