Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory #364

Closed
apoorvagnihotri opened this issue Apr 17, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@apoorvagnihotri
Copy link

apoorvagnihotri commented Apr 17, 2023

Describe the bug
Not able train the model after successfully after running the following commands.

svc pre-resample
svc pre-config
svc pre-hubert -fm crepe
svc train -t # <- this gives me the following error.
❯ svc train -t
[07:44:07] INFO     [07:44:07] Version: 3.9.3                                                                                                                                                   __main__.py:22
[07:44:12] INFO     [07:44:12] Created a temporary directory at /tmp/tmpmhzyrk2f                                                                                                            instantiator.py:21
           INFO     [07:44:12] Writing /tmp/tmpmhzyrk2f/_remote_module_non_scriptable.py                                                                                                    instantiator.py:76
[07:44:13] INFO     [07:44:13] Server binary (from Python package v0.7.0):                                                                                                              server_ingester.py:290
                    /home/apoorvagnihotri/miniconda3/envs/so-vits/lib/python3.10/site-packages/tensorboard_data_server/bin/server
[07:44:16] WARNING  [07:44:16] Failed to communicate with data server at localhost:36617: <_InactiveRpcError of RPC that terminated with:                                               server_ingester.py:187
                            status = StatusCode.UNAVAILABLE
                            details = "DNS resolution failed for localhost:36617: C-ares status is not ARES_SUCCESS qtype=AAAA name=localhost is_balancer=0: Could not contact DNS
                    servers"
                            debug_error_string = "UNKNOWN:DNS resolution failed for localhost:36617: C-ares status is not ARES_SUCCESS qtype=AAAA name=localhost is_balancer=0: Could
                    not contact DNS servers {created_time:"2023-04-17T07:44:16.0903992+00:00", grpc_status:14}"
                    >
[07:44:19] INFO     [07:44:19] Using strategy: auto                                                                                                                                                train.py:82
INFO: GPU available: True (cuda), used: True
           INFO     [07:44:19] GPU available: True (cuda), used: True                                                                                                                          rank_zero.py:48
INFO: TPU available: False, using: 0 TPU cores
           INFO     [07:44:19] TPU available: False, using: 0 TPU cores                                                                                                                        rank_zero.py:48
INFO: IPU available: False, using: 0 IPUs
           INFO     [07:44:19] IPU available: False, using: 0 IPUs                                                                                                                             rank_zero.py:48
INFO: HPU available: False, using: 0 HPUs
           INFO     [07:44:19] HPU available: False, using: 0 HPUs                                                                                                                             rank_zero.py:48
[07:44:20] WARNING  [07:44:20] /home/apoorvagnihotri/miniconda3/envs/so-vits/lib/python3.10/site-packages/so_vits_svc_fork/modules/synthesizers.py:81: UserWarning: Unused arguments:          warnings.py:109
                    {'n_layers_q': 3, 'use_spectral_norm': False}
                      warnings.warn(f"Unused arguments: {kwargs}")

           INFO     [07:44:20] Decoder type: hifi-gan                                                                                                                                      synthesizers.py:100
[07:44:21] WARNING  [07:44:21] /home/apoorvagnihotri/miniconda3/envs/so-vits/lib/python3.10/site-packages/so_vits_svc_fork/utils.py:190: UserWarning: Keys not found in checkpoint state       warnings.py:109
                    dict:['emb_g.weight']
                      warnings.warn(f"Keys not found in checkpoint state dict:" f"{not_in_from}")

           INFO     [07:44:21] Loaded checkpoint 'logs/44k/G_0.pth' (iteration 0)                                                                                                                 utils.py:247
           INFO     [07:44:21] Loaded checkpoint 'logs/44k/D_0.pth' (iteration 0)                                                                                                                 utils.py:247
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[07:44:23] INFO     [07:44:23] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]                                                                                                                            cuda.py:57
┏━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃   ┃ Name  ┃ Type                     ┃ Params ┃
┡━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ net_g │ SynthesizerTrn           │ 45.2 M │
│ 1 │ net_d │ MultiPeriodDiscriminator │ 46.7 M │
└───┴───────┴──────────────────────────┴────────┘
Trainable params: 91.9 M
Non-trainable params: 0
Total params: 91.9 M
Total estimated model params size (MB): 367
           WARNING  [07:44:23] /home/apoorvagnihotri/miniconda3/envs/so-vits/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:430: PossibleUserWarning: The warnings.py:109
                    dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number
                    of cpus on this machine) in the `DataLoader` init to improve performance.
                      rank_zero_warn(

Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory
[1]    18763 IOT instruction (core dumped)  svc train -t

To Reproduce
I am on the latest version of arch linux with latest nvidia-drivers installed. When running nvidia-smi, I get the following output.

image

Additional context
Images of the error I am getting.

image

image

Further, I have observed that I get some semaphore error after the failed training script. The errors I get are pasted below.

╰─ /home/apoorvagnihotri/miniconda3/envs/so-vits/lib/python3.10/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 8 leaked semlock objects to clean up at shutdown
  warnings.warn(
/home/apoorvagnihotri/miniconda3/envs/so-vits/lib/python3.10/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 1 leaked folder objects to clean up at shutdown
  warnings.warn(
@apoorvagnihotri apoorvagnihotri added the bug Something isn't working label Apr 17, 2023
@34j 34j changed the title CUDA Issues while training. Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory [1] 18763 IOT instruction (core dumped) svc train -t Apr 18, 2023
@34j 34j changed the title Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory [1] 18763 IOT instruction (core dumped) svc train -t Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory Apr 18, 2023
@34j
Copy link
Collaborator

34j commented Apr 18, 2023

Sorry but it seemes there is nothing we can do about it. If you can verify that other applications using cudnn work, and you are confident that this is a problem with this repository, please reopen it.

@34j 34j closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2023
@apoorvagnihotri
Copy link
Author

apoorvagnihotri commented Apr 18, 2023

People might find this useful. I followed the instructions as given here and I am able to train the model.

pytorch/pytorch#97041 (comment)

I think this is an issue with Pytorch 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants