Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error in the AzureML tests with CUBLAS #1883

Closed
miguelgfierro opened this issue Feb 20, 2023 · 3 comments
Closed

[BUG] Error in the AzureML tests with CUBLAS #1883

miguelgfierro opened this issue Feb 20, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@miguelgfierro
Copy link
Collaborator

Description

OSError: /azureml-envs/azureml_429f6cafd5683df5fc7618d73f22aa25/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11

See more details here: https://github.com/microsoft/recommenders/actions/runs/4201653721/jobs/7288875889

In which platform does it happen?

How do we replicate the issue?

Expected behavior (i.e. solution)

Other Comments

@miguelgfierro miguelgfierro added the bug Something isn't working label Feb 20, 2023
@miguelgfierro
Copy link
Collaborator Author

@pradnyeshjoshi any pointer here? it seems it is a problem with some cuda libraries, could it be that we need to change the docker image?

@pradnyeshjoshi
Copy link
Collaborator

@pradnyeshjoshi any pointer here? it seems it is a problem with some cuda libraries, could it be that we need to change the docker image?

https://stackoverflow.com/a/75095447 says that the latest PyTorch (version 1.13) installs CUBLAS, cudatoolkit, cudnn by default. We install cudnn and cudatoolkit explicitly in our docker image, which seems to cause the CUBLAS error. Removing this should resolve the issue, PR for the fix: #1886

@miguelgfierro
Copy link
Collaborator Author

It seems there are other CUDA libraries that are causing problems:

----------------------------- Captured stderr call -----------------------------
2023-02-21 09:09:42.748082: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-21 09:09:42.784647: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_c9661d3ceab68a8b8561495400ba5[590](https://github.com/microsoft/recommenders/actions/runs/4230807120/jobs/7348569112#step:3:599)/lib:
2023-02-21 09:09:42.784798: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_c9661d3ceab68a8b8561495400ba5590/lib:
2023-02-21 09:09:42.784909: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_c9661d3ceab68a8b8561495400ba5590/lib:
2023-02-21 09:09:42.785017: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_c9661d3ceab68a8b8561495400ba5590/lib:
2023-02-21 09:09:42.785087: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-02-21 09:09:43.669384: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

0it [00:00, ?it/s]
0it [00:01, ?it/s]

Source: https://github.com/microsoft/recommenders/actions/runs/4230807120/jobs/7348569112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants