Stuck after printing 'Successfully opened dynamic library libcublas.so.10.0' #1643
Comments
DescriptionI am having the same issue for both tensorflow version 1.14.0 and tensorflow version 1.14.1 that were built using CUDA 10.1. Environment information
For bugs: reproduction and error logs
|
I stuck at the same step when trying running LibriSpeechCleanSmall+Transformer. And I can't receive any log like 'Starting optimization of tunable parameters'. It just gets stuck, stop logging and keep occupying GPU without unexpected termination. Enviroment:
For bugs: reproduction and error logs
|
Did anyone find a fix for this? Getting the exact same problem using |
DescriptionI also have the same issue. After "Successfully opened dynamic library libcublas.so.10.0", nothing even after 3 days. EnvironmentOS: Ubuntu 18.04.2 LTS mesh-tensorflow==0.0.5 Python 3.6.8 CUDA Version 10.0.130 For bugs: reproduction and error logs
|
Same here,
|
I think the best idea is to report on the TF and google colab lists as this does not look like an error specific to T2T. |
Continued here: |
@rachellim at TF was able to reproduce and resolve the hang issue (parallel_interleave_dataset_op.cc doesn't handle iterator creation errors correctly when the With her fix in place (or setting Other interesting clues (reported by @huang-haijie ) are setting It still seems like it may be a TF issue, as the same T2T code works fine with TF 1.13.2, but fails with the Conv2D issues on TF 1.14.0. Any suggestions for next steps would be appreciated... |
Continued here: |
Did the issue resolve? |
The same happens with |
@ramonemiliani93 , what version of tensorflow are you running? I was not able to reproduce this issue with the following dataset:
So it doesn't seem to be an issue with using |
Based on tensorflow/tensorflow#38100 and f90/FactorGAN#1, I suspect this may be a problem with your CUDA installation. |
+1. Same issue. It used to work and died without warning on EC2. Later when I reload the pretrained model it hangs with no output. |
@harishkashyap - what version of tensorflow as you using? If you use an older version, does it still work? (Trying to diagnose whether it's an issue with your CUDA installation or a regression in TF) |
EC2 pytorch AMI tensorflow 2.3 |
No idea. I just used an AMI instance with linux 2 preinstalled with pytorch. Was working fine and now fails to load the pre-trained model. |
@sanjoy, can you reassign this to someone on the GPU team to investigate? |
Same issue. |
I'm trying to run tf object detection model. Getting the same issue; Stuck after Successfully opened dynamic library libcuda.so.1 Please someone helps. |
I have this problem using the anaconda cudatoolkit. I ended up using nvidia-docker instead for my cuda/cudnn installation and now it works. |
same issue on Ubuntu 20.04 with RTX3090. TF was installed using anaconda. |
Same issue with Ubuntu 20.04. Python version:
Tensorflow version:
❯ python
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-04-30 14:32:10.400682: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
>>> tf.add(1, 2)
2021-04-30 14:32:26.991352: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-30 14:32:26.993581: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-30 14:32:27.026516: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.027085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 74.65GiB/s
2021-04-30 14:32:27.027104: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-04-30 14:32:27.028731: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-04-30 14:32:27.028771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-04-30 14:32:27.030183: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-04-30 14:32:27.030438: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-04-30 14:32:27.032093: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-04-30 14:32:27.033044: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-04-30 14:32:27.036642: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-04-30 14:32:27.036780: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.037699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.038214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-30 14:32:27.038509: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-30 14:32:27.038916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.039462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 74.65GiB/s
2021-04-30 14:32:27.039482: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-04-30 14:32:27.039507: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-04-30 14:32:27.039522: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-04-30 14:32:27.039536: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-04-30 14:32:27.039550: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-04-30 14:32:27.039563: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-04-30 14:32:27.039577: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-04-30 14:32:27.039590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-04-30 14:32:27.039645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.040215: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.040651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-30 14:32:27.040677: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 |
Same for me |
same issue with CUDA=10.0 CUDNN=7.4 Tensorflow=1.14.0. Object detection algo stuck at: and sometimes stuck at: Anyone figue it out? REALLY NEED HELP! |
Hi @sanjoy , thx for your advice. I've tried to find out the process id of running python program and Then I tried I've also tried with PyTorch on the same machine, and it reflects a similar situation that it also takes a long time to load some libraries from Cuda. Below are the details of my situation:GPU: GeForce RTX 3060 Tensorflow: tensorflow-gpu 1.14.0 I've tried the following test about TensorFlow and it all works: My situation is the program will stuck at After waiting for around 30 mins, the program will continue running and WORK WELL. |
@alanzyt311 I missed that you're running TF 1.14. 1.14 is very old and does not have native support for your GPU (which I believe is Ampere based). So TensorFlow blocks at startup as it JIT compiles PTX to SASS for your GPU which can take 30+ minutes. Can you please try running with TF 2.5? |
Thx. I've tried TF-gpu 2.0 just now, still not worked (same problem as above). Now gonna try for TF 2.5. |
Same problem, I have to wait more than 15 mins to see my training. I have to quit training and check files for another training. Any help? I also decreased lr_rate, batch size, nothing works. I am using docker with following details
This is my output while waiting for training to run:
|
@Laudarisd Are you using TF 2.5? If not, can you please try with TF 2.5? |
Dealing with this problem here as well, tried with both TF 2.4 and 2.5. Will continue execution after ~30 minutes or so. After it continues execution, I am also running into a different error But this seems like a version error (tf.compat.v1 -> v2) that is aside from this issue. Would appreciate any suggestions on how to deal with the hanging. |
@sanjoy right now I am using tf 2.2 , I will try to use 2.5. But I am not sure whether tf 2.5 will use my gpu or not. |
@sanjoy @michyzhu I think this is a problem from mismatched versions of libraries.
Right now I am running training in colab. |
Sorry for the late update. I reset my environment (including the correct matching among GPU version, cuda, cudnn and TensorFlow version), then the problem gets solved. By the way, TF version is 2.4. |
@alanzyt311 You are right, need to match all the settings and libraries to run training quicker and smoothly. Thanks. |
@sanjoy Thx for your advice, I follow your tip and solve this problem successfully. Same issue for me, which was solved after updating tf == 2.4 to tf ==2.5. Environment information: only tensorflow>=2.5.0 supports CUDA11.2 on https://tensorflow.google.cn/install/gpu |
Same problem for me with a configuration: 2022-02-28 03:38:14.118011: I tensorflow/stream_executor/plugin_registry.cc:247] Selecting default DNN plugin, cuDNN |
I realized people are still getting this issue. To run the training successfully, need to match all libraries version including CUDA and CUDNN. To confirm training and version, I suggest to use Google Colab for training with small dataset. I am commenting here because I am getting emails whenever comment comes here (with cc). |
error this because
Solving ..
after edit .config file ( such as use faster_rcnn_inception_v2_pets.config) and run training model again with then it will run successfully .. |
I got the same problem, Did anyone solve it? |
Same issue here for TF-gpu 1.14. Any fix? I would really like to use this tf version as i want to run old code an i dont want to fiddle around with tf.compat.v1. |
I had this issue, changed dataset's, and the issue disappeared. So at least in my case, something was wonky with the dataset.Turns out my .tfrecord files were completely empty. Lol. |
请问你解决这个问题了吗 |
I am facing the same issue. Have you solved it? |
Description
I run this command 't2t-trainer --problem=librispeech --model=transformer --data_dir=~/dataset/t2t/librispeech/ --output_dir=. --hparams_set=transformer_librispeech --worker_gpu=1' and it's stuck after printing 'Successfully opened dynamic library libcublas.so.10.0'.
Then I set TF_CPP_MIN_VLOG_LEVEL=2, it keeps printing
'tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 20480 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 40960 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms.'
...
Environment information
For bugs: reproduction and error logs
The text was updated successfully, but these errors were encountered: