Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Stuck after printing 'Successfully opened dynamic library libcublas.so.10.0' #1643

Open
zhez6 opened this issue Jul 24, 2019 · 45 comments
Open

Comments

@zhez6
Copy link

zhez6 commented Jul 24, 2019

Description

I run this command 't2t-trainer --problem=librispeech --model=transformer --data_dir=~/dataset/t2t/librispeech/ --output_dir=. --hparams_set=transformer_librispeech --worker_gpu=1' and it's stuck after printing 'Successfully opened dynamic library libcublas.so.10.0'.
Then I set TF_CPP_MIN_VLOG_LEVEL=2, it keeps printing
'tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 20480 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 40960 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms.'

...

Environment information

OS: <your answer here>

$ pip freeze | grep tensor
# your output here
mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.14.0
tensorflow-datasets==1.0.2
tensorflow-estimator==1.14.0
tensorflow-gpu==1.14.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0
$ python -V
# your output here
Python 3.6.5 :: Anaconda, Inc.

For bugs: reproduction and error logs

# Steps to reproduce:
# Error logs:
See descriptions
@cantwbr
Copy link

cantwbr commented Aug 2, 2019

Description

I am having the same issue for both tensorflow version 1.14.0 and tensorflow version 1.14.1 that were built using CUDA 10.1.

Environment information

OS: Ubuntu 16.04.6 LTS

$ pip freeze | grep tensor
mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0


$ python -V
Python 2.7.12

For bugs: reproduction and error logs

# Steps to reproduce:
t2t-trainer --problem=librispeech --model=transformer --data_dir=~/datasets/t2t/librispeech/ --output_dir=~/trainoutput/librispeech/ --hparams_set=transformer_librispeech --worker_gpu=1
# Error logs:
[...]
session_manager.py:500] Running local_init_op.
session_manager.py:502] Done running local_init_op.
basic_session_run_hooks.py:606] Saving checkpoints for 0 into ~/trainoutput/librispeech/model.ckpt.
tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 20480 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 40960 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms.

@tenghaha
Copy link

tenghaha commented Aug 6, 2019

I stuck at the same step when trying running LibriSpeechCleanSmall+Transformer. And I can't receive any log like 'Starting optimization of tunable parameters'. It just gets stuck, stop logging and keep occupying GPU without unexpected termination.

Enviroment:

OS: Ubuntu 16.04.5 LTS

$ pip freeze | grep tensor
mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.14.0
tensorflow-datasets==1.1.0
tensorflow-estimator==1.14.0rc1
tensorflow-gpu==1.14.0rc1
tensorflow-hub==0.4.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0

$ python -V
Python 3.5.2

$ cat /usr/local/cuda/version.txt
CUDA Version 10.0.130

For bugs: reproduction and error logs

# Steps to reproduce:
t2t-trainer --problem=librispeech_clean_small --model=transformer --hparams_set=transformer_librispeech --data_dir=/data/tensor2tensor/data/librispeech_clean_small --output_dir=/data/tensor2tensor/exp --train_steps=1000 --eval_steps=100 --verbosity=0
# Error logs:
[...]
2019-08-06 06:28:42.670483: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-06 06:28:43.090613: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-06 06:28:43.266243: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-08-06 06:28:43.328620: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-08-06 06:28:43.830164: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-08-06 06:28:44.081212: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-08-06 06:28:44.819371: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-06 06:28:44.826114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-06 06:28:44.834796: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-06 06:28:44.839837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-06 06:28:44.839860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-08-06 06:28:44.839867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-08-06 06:28:44.854920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11596 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:8c:00.0, compute capability: 3.7)
2019-08-06 06:28:49.295630: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0806 06:28:56.309820 139654260696832 session_manager.py:500] Running local_init_op.
I0806 06:28:56.744457 139654260696832 session_manager.py:502] Done running local_init_op.
I0806 06:30:06.414134 139654260696832 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /data/tensor2tensor/exp/librispeech_transformer_clean_small/model.ckpt.
2019-08-06 06:30:50.528524: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

@amin-nejad
Copy link

Did anyone find a fix for this? Getting the exact same problem using t2t-decoder, depending on the model, it either hangs on Successfully opened dynamic library libcublas.so.10.0 or Successfully opened dynamic library libcudnn.so.7. This was not happening for me 2-3 weeks ago - I'm not sure what's changed

@AaronSeunghi
Copy link

Description

I also have the same issue. After "Successfully opened dynamic library libcublas.so.10.0", nothing even after 3 days.

Environment

OS: Ubuntu 18.04.2 LTS

mesh-tensorflow==0.0.5
tensor2tensor==1.14.0
tensorboard==1.14.0
tensorflow-datasets==1.2.0
tensorflow-estimator==1.14.0
tensorflow-gan==1.0.0.dev0
tensorflow-gpu==1.14.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0

Python 3.6.8

CUDA Version 10.0.130

For bugs: reproduction and error logs

# Steps to reproduct:
t2t-trainer --worker_gpu=4 --model=transformer --hparams="batch_size=32" --hparams_set=transformer_librispeech_v1 --problem=librispeech_clean_small --train_steps=100000 --eval_steps=100 --local_eval_frequency=1000 --data_dir=/home/Librispeech/data --output_dir=/tmp/t2t.work/librispeech_clean_small.20190823
# Error logs:
[...]
2019-08-23 03:15:28.608309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3
2019-08-23 03:15:28.608342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y N N
2019-08-23 03:15:28.608361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N N N
2019-08-23 03:15:28.608391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   N N N Y
2019-08-23 03:15:28.608439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   N N Y N
2019-08-23 03:15:28.614141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10619 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-08-23 03:15:28.615641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10619 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-08-23 03:15:28.617012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10619 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-08-23 03:15:28.618385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10619 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)
2019-08-23 03:15:31.222250: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0823 03:15:32.888142 140151858374464 session_manager.py:500] Running local_init_op.
I0823 03:15:33.431387 140151858374464 session_manager.py:502] Done running local_init_op.
I0823 03:15:55.611809 140151858374464 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /tmp/t2t.work/librispeech_clean_small.20190823/model.ckpt.
2019-08-23 03:16:33.482639: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

@clementbmn
Copy link

clementbmn commented Aug 26, 2019

Same here,
then I get a crash with these logs :

runtime/cgo: pthread_create failed: Resource temporarily unavailable
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7f4c5672ee97 m=14 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7f4c5672ee97
stack: frame={sp:0x7f4c237fd800, fp:0x0} stack=[0x7f4c22ffe290,0x7f4c237fde90)
00007f4c237fd700:  0000000000000000  0000000000000000
00007f4c237fd710:  0000000000000000  0000000000000000
00007f4c237fd720:  0000000000000000  0000000000000000
00007f4c237fd730:  0000000000000000  0000000000000000
00007f4c237fd740:  0000000000000000  0000000000000000
00007f4c237fd750:  0000000000000000  0000000000000000
00007f4c237fd760:  0000000000000000  0000000000000000
00007f4c237fd770:  0000000000000000  0000000000000000
00007f4c237fd780:  0000000000000000  0000000000000000
00007f4c237fd790:  0000000000000000  0000000000000000
00007f4c237fd7a0:  0000000000000000  0000000000000000
00007f4c237fd7b0:  0000000000000000  0000000000000000
00007f4c237fd7c0:  0000000000000000  0000000000000000
00007f4c237fd7d0:  0000000000000000  0000000000000000
00007f4c237fd7e0:  0000000000000000  0000000000000000
00007f4c237fd7f0:  0000000000000000  0000000000000000
00007f4c237fd800: <0000000000000000  0000000000000000
00007f4c237fd810:  0000000000000000  0000000000000000
00007f4c237fd820:  0000000000000000  0000000000000000
00007f4c237fd830:  0000000000000000  0000000000000000
00007f4c237fd840:  0000000000000000  0000000000000000
00007f4c237fd850:  0000000000000000  0000000000000000
00007f4c237fd860:  0000000000000000  0000000000000000
00007f4c237fd870:  0000000000000000  0000000000000000
00007f4c237fd880:  fffffffe7fffffff  ffffffffffffffff
00007f4c237fd890:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8a0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8b0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8c0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8d0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8e0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8f0:  ffffffffffffffff  ffffffffffffffff
runtime: unknown pc 0x7f4c5672ee97
stack: frame={sp:0x7f4c237fd800, fp:0x0} stack=[0x7f4c22ffe290,0x7f4c237fde90)
00007f4c237fd700:  0000000000000000  0000000000000000
00007f4c237fd710:  0000000000000000  0000000000000000
00007f4c237fd720:  0000000000000000  0000000000000000
00007f4c237fd730:  0000000000000000  0000000000000000
00007f4c237fd740:  0000000000000000  0000000000000000
00007f4c237fd750:  0000000000000000  0000000000000000
00007f4c237fd760:  0000000000000000  0000000000000000
00007f4c237fd770:  0000000000000000  0000000000000000
00007f4c237fd780:  0000000000000000  0000000000000000
00007f4c237fd790:  0000000000000000  0000000000000000
00007f4c237fd7a0:  0000000000000000  0000000000000000
00007f4c237fd7b0:  0000000000000000  0000000000000000
00007f4c237fd7c0:  0000000000000000  0000000000000000
00007f4c237fd7d0:  0000000000000000  0000000000000000
00007f4c237fd7e0:  0000000000000000  0000000000000000
00007f4c237fd7f0:  0000000000000000  0000000000000000
00007f4c237fd800: <0000000000000000  0000000000000000
00007f4c237fd810:  0000000000000000  0000000000000000
00007f4c237fd820:  0000000000000000  0000000000000000
00007f4c237fd830:  0000000000000000  0000000000000000
00007f4c237fd840:  0000000000000000  0000000000000000
00007f4c237fd850:  0000000000000000  0000000000000000
00007f4c237fd860:  0000000000000000  0000000000000000
00007f4c237fd870:  0000000000000000  0000000000000000
00007f4c237fd880:  fffffffe7fffffff  ffffffffffffffff
00007f4c237fd890:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8a0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8b0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8c0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8d0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8e0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8f0:  ffffffffffffffff  ffffffffffffffff

goroutine 1 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0x7f4c57084f00, 0x72, 0xc4205e7550)
	/usr/local/go/src/runtime/netpoll.go:173 +0x59
internal/poll.(*pollDesc).wait(0xc42020c118, 0x72, 0xffffffffffffff00, 0x560ca4e9aaa0, 0x560ca592daa8)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9d
internal/poll.(*pollDesc).waitRead(0xc42020c118, 0xc4204ed000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc42020c100, 0xc4204ed000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:157 +0x17f
net.(*netFD).Read(0xc42020c100, 0xc4204ed000, 0x1000, 0x1000, 0x9c5, 0x0, 0x0)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc420394878, 0xc4204ed000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:176 +0x6c
net/http.(*persistConn).Read(0xc420288fc0, 0xc4204ed000, 0x1000, 0x1000, 0xc420618050, 0xc4204ed9c3, 0x2)
	/usr/local/go/src/net/http/transport.go:1453 +0x138
bufio.(*Reader).fill(0xc420388c00)
	/usr/local/go/src/bufio/bufio.go:100 +0x120
bufio.(*Reader).ReadSlice(0xc420388c00, 0xc42000010a, 0x300000002, 0xc420000180, 0xc4205e77c0, 0x560ca31cf92b, 0xc420000180)
	/usr/local/go/src/bufio/bufio.go:341 +0x2e
net/http/internal.readChunkLine(0xc420388c00, 0x1, 0x3, 0xc420050a70, 0xc420050a00, 0xc4200aa0d8)
	/usr/local/go/src/net/http/internal/chunked.go:122 +0x36
net/http/internal.(*chunkedReader).beginChunk(0xc420618030)
	/usr/local/go/src/net/http/internal/chunked.go:48 +0x34
net/http/internal.(*chunkedReader).Read(0xc420618030, 0xc420652000, 0x8009, 0x8009, 0xc4205e78e0, 0x560ca31c56db, 0x560c00000008)
	/usr/local/go/src/net/http/internal/chunked.go:93 +0x115
net/http.(*body).readLocked(0xc42061e000, 0xc420652000, 0x8009, 0x8009, 0x0, 0x0, 0xc4205e79a0)
	/usr/local/go/src/net/http/transfer.go:778 +0x63
net/http.(*body).Read(0xc42061e000, 0xc420652000, 0x8009, 0x8009, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/transfer.go:770 +0xdf
net/http.(*bodyEOFSignal).Read(0xc42061e040, 0xc420652000, 0x8009, 0x8009, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/transport.go:2187 +0xde
github.com/docker/cli/vendor/github.com/docker/docker/pkg/stdcopy.StdCopy(0x560ca4e96240, 0xc4204371d0, 0x560ca4e98760, 0xc42000e020, 0x560ca4e984c0, 0xc42061e040, 0x560ca59b9c20, 0x0, 0x0)
	/go/src/github.com/docker/cli/vendor/github.com/docker/docker/pkg/stdcopy/stdcopy.go:108 +0xe2
github.com/docker/cli/cli/command/container.runLogs(0x560ca4ed7820, 0xc4203abb00, 0xc42003bef0, 0x0, 0x0)
	/go/src/github.com/docker/cli/cli/command/container/logs.go:77 +0x442
github.com/docker/cli/cli/command/container.NewLogsCommand.func1(0xc42040d680, 0xc4201fa900, 0x1, 0x2, 0x0, 0x0)
	/go/src/github.com/docker/cli/cli/command/container/logs.go:35 +0x6e
github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).execute(0xc42040d680, 0xc42003a170, 0x2, 0x2, 0xc42040d680, 0xc42003a170)
	/go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:762 +0x46a
github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4203b1680, 0xc42026bfb0, 0x560ca4b916c0, 0xc42026bfc0)
	/go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:852 +0x30c
github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4203b1680, 0xc4203b1680, 0x560ca4e98760)
	/go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:800 +0x2d
main.main()
	/go/src/github.com/docker/cli/cmd/docker/docker.go:180 +0xde

goroutine 5 [syscall, 2 minutes]:
os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:139 +0xa8
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:22 +0x24
created by os/signal.init.0
	/usr/local/go/src/os/signal/signal_unix.go:28 +0x43

goroutine 40 [chan receive, 2 minutes]:
github.com/docker/cli/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x560ca599b2e0)
	/go/src/github.com/docker/cli/vendor/github.com/golang/glog/glog.go:882 +0x8d
created by github.com/docker/cli/vendor/github.com/golang/glog.init.0
	/go/src/github.com/docker/cli/vendor/github.com/golang/glog/glog.go:410 +0x205

goroutine 15 [select, 2 minutes]:
net/http.(*persistConn).readLoop(0xc420288fc0)
	/usr/local/go/src/net/http/transport.go:1717 +0x745
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1237 +0x95c

goroutine 16 [select, 2 minutes]:
net/http.(*persistConn).writeLoop(0xc420288fc0)
	/usr/local/go/src/net/http/transport.go:1822 +0x14d
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1238 +0x981

goroutine 43 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0x7f4c57084e30, 0x72, 0xc4200889a8)
	/usr/local/go/src/runtime/netpoll.go:173 +0x59
internal/poll.(*pollDesc).wait(0xc420622198, 0x72, 0xffffffffffffff00, 0x560ca4e9aaa0, 0x560ca592daa8)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9d
internal/poll.(*pollDesc).waitRead(0xc420622198, 0xc420626000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc420622180, 0xc420626000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:157 +0x17f
net.(*netFD).Read(0xc420622180, 0xc420626000, 0x1000, 0x1000, 0x560ca31efc10, 0xc420000180, 0x4)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc42000e048, 0xc420626000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:176 +0x6c
net/http.(*persistConn).Read(0xc42028ad80, 0xc420626000, 0x1000, 0x1000, 0xc420088b98, 0x560ca319fde5, 0xc4203ca360)
	/usr/local/go/src/net/http/transport.go:1453 +0x138
bufio.(*Reader).fill(0xc420616300)
	/usr/local/go/src/bufio/bufio.go:100 +0x120
bufio.(*Reader).Peek(0xc420616300, 0x1, 0x0, 0x0, 0x0, 0xc4203ca2a0, 0x0)
	/usr/local/go/src/bufio/bufio.go:132 +0x3c
net/http.(*persistConn).readLoop(0xc42028ad80)
	/usr/local/go/src/net/http/transport.go:1601 +0x187
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1237 +0x95c

goroutine 44 [select, 2 minutes]:
net/http.(*persistConn).writeLoop(0xc42028ad80)
	/usr/local/go/src/net/http/transport.go:1822 +0x14d
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1238 +0x981

rax    0x0
rbx    0x7f4c56adc840
rcx    0x7f4c5672ee97
rdx    0x0
rdi    0x2
rsi    0x7f4c237fd800
rbp    0x560ca464f220
rsp    0x7f4c237fd800
r8     0x0
r9     0x7f4c237fd800
r10    0x8
r11    0x246
r12    0x560ca629a1b0
r13    0xf1
r14    0x11
r15    0x0
rip    0x7f4c5672ee97
rflags 0x246
cs     0x33
fs     0x0
gs     0x0```

@lukaszkaiser
Copy link
Contributor

I think the best idea is to report on the TF and google colab lists as this does not look like an error specific to T2T.

@cantwbr
Copy link

cantwbr commented Aug 28, 2019

Continued here:
tensorflow/tensorflow#32017

@mschonwe
Copy link
Contributor

@rachellim at TF was able to reproduce and resolve the hang issue (parallel_interleave_dataset_op.cc doesn't handle iterator creation errors correctly when the sloppy=True).

With her fix in place (or setting sloppy=False) the training now halts with Conv2D errors.

Other interesting clues (reported by @huang-haijie ) are setting audio_add_delta_deltas from True to False, OR
seting audio_preproc_in_bottom from False to True prevents the Conv2D error.

It still seems like it may be a TF issue, as the same T2T code works fine with TF 1.13.2, but fails with the Conv2D issues on TF 1.14.0. Any suggestions for next steps would be appreciated...

@mschonwe
Copy link
Contributor

Continued here:
tensorflow/tensorflow#32691

@Arvindia
Copy link

Arvindia commented Feb 22, 2020

Description

I also have the same issue. After "Successfully opened dynamic library libcublas.so.10.0", nothing even after 3 days.

Environment

OS: Ubuntu 18.04.2 LTS

mesh-tensorflow==0.0.5
tensor2tensor==1.14.0
tensorboard==1.14.0
tensorflow-datasets==1.2.0
tensorflow-estimator==1.14.0
tensorflow-gan==1.0.0.dev0
tensorflow-gpu==1.14.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0

Python 3.6.8

CUDA Version 10.0.130

For bugs: reproduction and error logs

# Steps to reproduct:
t2t-trainer --worker_gpu=4 --model=transformer --hparams="batch_size=32" --hparams_set=transformer_librispeech_v1 --problem=librispeech_clean_small --train_steps=100000 --eval_steps=100 --local_eval_frequency=1000 --data_dir=/home/Librispeech/data --output_dir=/tmp/t2t.work/librispeech_clean_small.20190823
# Error logs:
[...]
2019-08-23 03:15:28.608309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3
2019-08-23 03:15:28.608342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y N N
2019-08-23 03:15:28.608361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N N N
2019-08-23 03:15:28.608391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   N N N Y
2019-08-23 03:15:28.608439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   N N Y N
2019-08-23 03:15:28.614141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10619 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-08-23 03:15:28.615641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10619 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-08-23 03:15:28.617012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10619 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-08-23 03:15:28.618385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10619 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)
2019-08-23 03:15:31.222250: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0823 03:15:32.888142 140151858374464 session_manager.py:500] Running local_init_op.
I0823 03:15:33.431387 140151858374464 session_manager.py:502] Done running local_init_op.
I0823 03:15:55.611809 140151858374464 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /tmp/t2t.work/librispeech_clean_small.20190823/model.ckpt.
2019-08-23 03:16:33.482639: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

Did the issue resolve?
Running in Google Colab with GPU with small dataset ~3GB (basic speech commands)

@ramonemiliani93
Copy link

ramonemiliani93 commented Aug 8, 2020

The same happens with tf.nn.conv3d used inside a map(...) on a tf.data.Dataset. Any updates on how to solve this?

@rachellim
Copy link

@ramonemiliani93 , what version of tensorflow are you running? I was not able to reproduce this issue with the following dataset:

def make_tensor(sizes):
  return np.asarray([f * 1.0 for f in range(1, np.prod(sizes) + 1)]).reshape(sizes)

filter = make_tensor([1, 1, 1, 3, 3])
x = make_tensor([10, 2, 3, 1, 3])
dataset = tf.data.Dataset.from_tensors((x, filter))
dataset = dataset.map(lambda input, filter: tf.nn.conv3d(input, filter, strides=[1, 1, 1, 1, 1], padding="VALID")
print(list(dataset))

So it doesn't seem to be an issue with using tf.nn.conv3d inside map. Can you provide a minimal repro?

@rachellim
Copy link

Based on tensorflow/tensorflow#38100 and f90/FactorGAN#1, I suspect this may be a problem with your CUDA installation.

@harishkashyap
Copy link

+1. Same issue. It used to work and died without warning on EC2. Later when I reload the pretrained model it hangs with no output.

@rachellim
Copy link

@harishkashyap - what version of tensorflow as you using? If you use an older version, does it still work? (Trying to diagnose whether it's an issue with your CUDA installation or a regression in TF)

@harishkashyap
Copy link

EC2 pytorch AMI tensorflow 2.3

@harishkashyap
Copy link

harishkashyap commented Sep 15, 2020

No idea. I just used an AMI instance with linux 2 preinstalled with pytorch. Was working fine and now fails to load the pre-trained model.

@rachellim
Copy link

@sanjoy, can you reassign this to someone on the GPU team to investigate?

@skaldek
Copy link

skaldek commented Sep 24, 2020

Same issue.

@AzinPoshtyar
Copy link

I'm trying to run tf object detection model. Getting the same issue; Stuck after Successfully opened dynamic library libcuda.so.1

Please someone helps.

@lminer
Copy link

lminer commented Nov 3, 2020

I have this problem using the anaconda cudatoolkit. I ended up using nvidia-docker instead for my cuda/cudnn installation and now it works.

@flint-xf-fan
Copy link

same issue on Ubuntu 20.04 with RTX3090. TF was installed using anaconda.

@reza-ebrahimi
Copy link

Same issue with Ubuntu 20.04.

Python version:

Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)

Tensorflow version:

❯ conda list | grep tensorflow
tensorflow                2.4.1           gpu_py39h8236f22_0  
tensorflow-base           2.4.1           gpu_py39h29c2da4_0  
tensorflow-estimator      2.4.1              pyheb71bc4_0  
tensorflow-gpu            2.4.1                h30adc30_0 
python
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow as tf
2021-04-30 14:32:10.400682: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

>>> tf.add(1, 2)
2021-04-30 14:32:26.991352: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-30 14:32:26.993581: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-30 14:32:27.026516: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.027085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 74.65GiB/s
2021-04-30 14:32:27.027104: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-04-30 14:32:27.028731: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-04-30 14:32:27.028771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-04-30 14:32:27.030183: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-04-30 14:32:27.030438: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-04-30 14:32:27.032093: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-04-30 14:32:27.033044: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-04-30 14:32:27.036642: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-04-30 14:32:27.036780: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.037699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.038214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-30 14:32:27.038509: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-30 14:32:27.038916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.039462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 74.65GiB/s
2021-04-30 14:32:27.039482: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-04-30 14:32:27.039507: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-04-30 14:32:27.039522: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-04-30 14:32:27.039536: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-04-30 14:32:27.039550: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-04-30 14:32:27.039563: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-04-30 14:32:27.039577: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-04-30 14:32:27.039590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-04-30 14:32:27.039645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.040215: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.040651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-30 14:32:27.040677: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

@ORippler
Copy link

same issue on Ubuntu 20.04 with RTX3090. TF was installed using anaconda.

Same for me

@alanzyt311
Copy link

same issue with CUDA=10.0 CUDNN=7.4 Tensorflow=1.14.0. Object detection algo stuck at:
I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10

and sometimes stuck at:
I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

Anyone figue it out? REALLY NEED HELP!

@sanjoy
Copy link

sanjoy commented Jun 4, 2021

For such cases, it'd be useful to get a stacktrace of where TF is stuck.

You can obtain this using gdb: start gdb, attach to the hung TF process, and then get a backtrace.

@alanzyt311
Copy link

Hi @sanjoy , thx for your advice.

I've tried to find out the process id of running python program and gdb attach PID; then use bt to get backtrace. However, it returns No stack.

Then I tried ps -ef | grep tensorflow-gpu | grep -v grep
and this line returns nothing.
I am wondering does it mean such a problem has nothing to do with tensorflow?

I've also tried with PyTorch on the same machine, and it reflects a similar situation that it also takes a long time to load some libraries from Cuda.

Below are the details of my situation:

GPU: GeForce RTX 3060
Driver Version: 460.73.01
CUDA Driver Veresion: 11.2

Tensorflow: tensorflow-gpu 1.14.0
CUDA Runtime Version: 10.0
cudnn: 7.4.1
(CUDA Runtime and cudnn version fits the guide from Tensorflow official documentation)

I've tried the following test about TensorFlow and it all works:
tf.test.is_built_with_cuda() and tf.test.is_gpu_available()

My situation is the program will stuck at 2021-06-05 12:16:54.099778: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10 for several minutes.
and sometimes stuck at another loading process 2021-06-05 12:21:22.212818: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 for even longer time.
You can check the attached for details.
log.txt

After waiting for around 30 mins, the program will continue running and WORK WELL.
So the MAJOR PROBLEM is that it takes a long time in loading libraries related to cuda (I guess).
And I don't know how to locate the problem and resolve it.

@sanjoy
Copy link

sanjoy commented Jun 5, 2021

@alanzyt311 I missed that you're running TF 1.14. 1.14 is very old and does not have native support for your GPU (which I believe is Ampere based). So TensorFlow blocks at startup as it JIT compiles PTX to SASS for your GPU which can take 30+ minutes.

Can you please try running with TF 2.5?

@alanzyt311
Copy link

Thx. I've tried TF-gpu 2.0 just now, still not worked (same problem as above). Now gonna try for TF 2.5.

@Laudarisd
Copy link

Same problem, I have to wait more than 15 mins to see my training.
And it throws loss==Nan.

I have to quit training and check files for another training.
Same thing occurs again.
Been two days stuck here, 1) takes around 10 ~ 15 mins to start training 2) Throws loss=Nan.

Any help?

I also decreased lr_rate, batch size, nothing works.

I am using docker with following details

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
 0  GeForce RTX 3060    Off  | 00000000:0B:00.0  On |                  

N/A |
|  0%   60C    P2    41W / 180W |  11391MiB / 12031MiB

This is my output while waiting for training to run:

021-06-10 07:48:07.953350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2021-06-10 07:48:07.953360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2021-06-10 07:48:07.953478: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-10 07:48:07.953996: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-10 07:48:07.954461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10923 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3060, pci bus id: 0000:0b:00.0, compute capability: 8.6)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0610 07:48:07.957259 139983451281216 mirrored_strategy.py:500] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0610 07:48:07.961450 139983451281216 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0610 07:48:07.961535 139983451281216 config_util.py:552] Maybe overwriting use_bfloat16: False

@sanjoy
Copy link

sanjoy commented Jun 10, 2021

@Laudarisd Are you using TF 2.5? If not, can you please try with TF 2.5?

@michyzhu
Copy link

Dealing with this problem here as well, tried with both TF 2.4 and 2.5. Will continue execution after ~30 minutes or so.

After it continues execution, I am also running into a different error
TypeError: Tensors are unhashable. (KerasTensor(type_spec=TensorSpec(shape=(None, 32, 32, 32, 1), dtype=tf.float32, name=‘input_2’), name=‘input_2’, description=“created by layer ‘input_2’“))Instead, use tensor.ref() as the key.

But this seems like a version error (tf.compat.v1 -> v2) that is aside from this issue. Would appreciate any suggestions on how to deal with the hanging.

@Laudarisd
Copy link

@sanjoy right now I am using tf 2.2 , I will try to use 2.5. But I am not sure whether tf 2.5 will use my gpu or not.
Thanks for you answer.
Will be back after training.

@Laudarisd
Copy link

@sanjoy @michyzhu I think this is a problem from mismatched versions of libraries.
I upgraded tf to 2.5 but I could't run training.
So to confirm, I trained my data in colab everything runs perfectly and smoothly. This is my output in colab while running training:

INFO:tensorflow:{'Loss/classification_loss': 0.7730327,
 'Loss/localization_loss': 0.2685254,
 'Loss/regularization_loss': 0.2532684,
 'Loss/total_loss': 1.2948265,
 'learning_rate': 0.001}
I0611 07:42:00.161126 140074406885248 model_lib_v2.py:701] {'Loss/classification_loss': 0.7730327,
 'Loss/localization_loss': 0.2685254,
 'Loss/regularization_loss': 0.2532684,
 'Loss/total_loss': 1.2948265,
 'learning_rate': 0.001}
.........
..........

Right now I am running training in colab.

@alanzyt311
Copy link

alanzyt311 commented Jun 11, 2021

Sorry for the late update. I reset my environment (including the correct matching among GPU version, cuda, cudnn and TensorFlow version), then the problem gets solved. By the way, TF version is 2.4.

@Laudarisd
Copy link

@alanzyt311 You are right, need to match all the settings and libraries to run training quicker and smoothly. Thanks.
For me tf 2.5.0
cuda 11.2

@vikinglee16
Copy link

vikinglee16 commented Jul 28, 2021

@sanjoy Thx for your advice, I follow your tip and solve this problem successfully.

Same issue for me, which was solved after updating tf == 2.4 to tf ==2.5.

Environment information:
OS ubuntu 18.04
python 3.7
CUDA Version 11.4.48
CuDNN Version 8.2.1
Tensorflow-gpu == 2.4

only tensorflow>=2.5.0 supports CUDA11.2 on https://tensorflow.google.cn/install/gpu

@AndreyOrb
Copy link

AndreyOrb commented Feb 28, 2022

Same problem for me with a configuration:
GPU: Tesla T4 | NVIDIA driver: 460.106.00 | CUDA: 11.2.152 | CUDNN: 8.1.1 | Tensorflow: 2.5.0
GPU: Tesla T4 | NVIDIA driver: 460.106.00 | CUDA: 11.2.152 | CUDNN: 8.1.1 | Tensorflow: 2.6.0
GPU: Tesla T4 | NVIDIA driver: 460.106.00 | CUDA: 11.2.152 | CUDNN: 8.1.1 | Tensorflow: 2.7.0
GPU: Tesla T4 | NVIDIA driver: 460.106.00 | CUDA: 11.2.152 | CUDNN: 8.1.1 | Tensorflow: 2.8.0

2022-02-28 03:38:14.118011: I tensorflow/stream_executor/plugin_registry.cc:247] Selecting default DNN plugin, cuDNN
2022-02-28 03:38:14.118120: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-02-28 03:38:14.867374: I tensorflow/core/framework/model.cc:1873] Starting optimization of tunable parameters with Hill Climb.
2022-02-28 03:38:14.867435: I tensorflow/core/framework/model.cc:1880] Number of tunable parameters: 1
2022-02-28 03:38:14.867501: I tensorflow/core/framework/model.cc:147] Setting tunable parameter ParallelMapV2(id:1):: parallelism to 4
2022-02-28 03:38:14.867540: I tensorflow/core/framework/model.cc:1787] Optimized for 0 ms.
2022-02-28 03:38:14.867557: I tensorflow/core/framework/model.cc:1774] Waiting for 2560 ms.
2022-02-28 03:38:17.211820: I tensorflow/core/framework/model.cc:1873] Starting optimization of tunable parameters with Hill Climb.
2022-02-28 03:38:17.211878: I tensorflow/core/framework/model.cc:1880] Number of tunable parameters: 1
2022-02-28 03:38:17.211934: I tensorflow/core/framework/model.cc:147] Setting tunable parameter MapAndBatch(id:2):: parallelism to 4
2022-02-28 03:38:17.211965: I tensorflow/core/framework/model.cc:1787] Optimized for 0 ms.
2022-02-28 03:38:17.211978: I tensorflow/core/framework/model.cc:1774] Waiting for 20480 ms.
2022-02-28 03:38:17.427702: I tensorflow/core/framework/model.cc:1873] Starting optimization of tunable parameters with Hill Climb.
2022-02-28 03:38:17.427751: I tensorflow/core/framework/model.cc:1880] Number of tunable parameters: 1
2022-02-28 03:38:17.427807: I tensorflow/core/framework/model.cc:147] Setting tunable parameter ParallelMapV2(id:1):: parallelism to 4
2022-02-28 03:38:17.427849: I tensorflow/core/framework/model.cc:1787] Optimized for 0 ms.
2022-02-28 03:38:17.427867: I tensorflow/core/framework/model.cc:1774] Waiting for 5120 ms.
2022-02-28 03:38:22.548024: I tensorflow/core/framework/model.cc:1873] Starting optimization of tunable parameters with Hill Climb.
2022-02-28 03:38:22.548080: I tensorflow/core/framework/model.cc:1880] Number of tunable parameters: 1
2022-02-28 03:38:22.548130: I tensorflow/core/framework/model.cc:147] Setting tunable parameter ParallelMapV2(id:1):: parallelism to 4
2022-02-28 03:38:22.548162: I tensorflow/core/framework/model.cc:1787] Optimized for 1 ms.
2022-02-28 03:38:22.548176: I tensorflow/core/framework/model.cc:1774] Waiting for 10240 ms.
2022-02-28 03:38:32.788310: I tensorflow/core/framework/model.cc:1873] Starting optimization of tunable parameters with Hill Climb.
2022-02-28 03:38:32.788363: I tensorflow/core/framework/model.cc:1880] Number of tunable parameters: 1
2022-02-28 03:38:32.788417: I tensorflow/core/framework/model.cc:147] Setting tunable parameter ParallelMapV2(id:1):: parallelism to 4
2022-02-28 03:38:32.788448: I tensorflow/core/framework/model.cc:1787] Optimized for 0 ms.
2022-02-28 03:38:32.788478: I tensorflow/core/framework/model.cc:1774] Waiting for 20480 ms.
2022-02-28 03:38:37.692113: I tensorflow/core/framework/model.cc:1873] Starting optimization of tunable parameters with Hill Climb.
2022-02-28 03:38:37.692169: I tensorflow/core/framework/model.cc:1880] Number of tunable parameters: 1
2022-02-28 03:38:37.692223: I tensorflow/core/framework/model.cc:147] Setting tunable parameter MapAndBatch(id:2):: parallelism to 4
2022-02-28 03:38:37.692252: I tensorflow/core/framework/model.cc:1787] Optimized for 0 ms.
2022-02-28 03:38:37.692264: I tensorflow/core/framework/model.cc:1774] Waiting for 40960 ms.
2022-02-28 03:38:45.229087: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8101
2022-02-28 03:38:53.268617: I tensorflow/core/framework/model.cc:1873] Starting optimization of tunable parameters with Hill Climb.
2022-02-28 03:38:53.268674: I tensorflow/core/framework/model.cc:1880] Number of tunable parameters: 1
2022-02-28 03:38:53.268727: I tensorflow/core/framework/model.cc:147] Setting tunable parameter ParallelMapV2(id:1):: parallelism to 4
2022-02-28 03:38:53.268758: I tensorflow/core/framework/model.cc:1787] Optimized for 0 ms.
2022-02-28 03:38:53.268786: I tensorflow/core/framework/model.cc:1774] Waiting for 40960 ms.
2022-02-28 03:39:18.652405: I tensorflow/core/framework/model.cc:1873] Starting optimization of tunable parameters with Hill Climb.
2022-02-28 03:39:18.652461: I tensorflow/core/framework/model.cc:1880] Number of tunable parameters: 1
2022-02-28 03:39:18.652515: I tensorflow/core/framework/model.cc:147] Setting tunable parameter MapAndBatch(id:2):: parallelism to 4
2022-02-28 03:39:18.652544: I tensorflow/core/framework/model.cc:1787] Optimized for 0 ms.
2022-02-28 03:39:18.652555: I tensorflow/core/framework/model.cc:1774] Waiting for 60000 ms.
2022-02-28 03:39:34.228917: I tensorflow/core/framework/model.cc:1873] Starting optimization of tunable parameters with Hill Climb.
2022-02-28 03:39:34.228966: I tensorflow/core/framework/model.cc:1880] Number of tunable parameters: 1
2022-02-28 03:39:34.229020: I tensorflow/core/framework/model.cc:147] Setting tunable parameter ParallelMapV2(id:1):: parallelism to 4
2022-02-28 03:39:34.229048: I tensorflow/core/framework/model.cc:1787] Optimized for 1 ms.
2022-02-28 03:39:34.229081: I tensorflow/core/framework/model.cc:1774] Waiting for 60000 ms.
2022-02-28 03:39:37.490747: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.000798: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.000922: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.000958: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.000988: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.268880: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.268963: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.269095: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.269120: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.269139: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.269157: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.269176: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.269199: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.270723: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.270755: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1
2022-02-28 03:39:41.270776: I tensorflow/stream_executor/cuda/cuda_dnn.cc:781] Requesting grouped convolution: 1

@Laudarisd
Copy link

I realized people are still getting this issue. To run the training successfully, need to match all libraries version including CUDA and CUDNN. To confirm training and version, I suggest to use Google Colab for training with small dataset.
I hope this will help.

I am commenting here because I am getting emails whenever comment comes here (with cc).

@nicejava
Copy link

error this because

  1. test.record and train.record not create or have 0 byte in object_detection/
  2. not have train_labels.csv and test_labels.csv in object_detection/image/
  3. not have images train/test folder in object_detection/images

Solving ..
try run

  1. python xml_to_csv.py
  2. python generate_tfrecord.py --csv_input=images/train_labels.csv --image_dir=images/train --output_path=train.record
  3. python generate_tfrecord.py --csv_input=images/test_labels.csv --image_dir=images/test --output_path=test.record

after edit .config file ( such as use faster_rcnn_inception_v2_pets.config) and run training model again with
python model_main.py --logtostderr --model_dir=training/ --ipeline_config_path=training/faster_rcnn_inception_v2_pets.config

then it will run successfully ..

@hiepbk
Copy link

hiepbk commented Jul 27, 2022

same issue on Ubuntu 20.04 with RTX3090. TF was installed using anaconda.

I got the same problem, Did anyone solve it?

@nicolasgoedert97
Copy link

Same issue here for TF-gpu 1.14. Any fix? I would really like to use this tf version as i want to run old code an i dont want to fiddle around with tf.compat.v1.

@iaverypadberg
Copy link

iaverypadberg commented Aug 4, 2022

error this because

1. test.record and train.record not create or have 0 byte in object_detection/

2. not have train_labels.csv and test_labels.csv in object_detection/image/

3. not have images train/test folder in object_detection/images

Solving .. try run

1. python xml_to_csv.py

2. python generate_tfrecord.py --csv_input=images/train_labels.csv --image_dir=images/train --output_path=train.record

3. python generate_tfrecord.py --csv_input=images/test_labels.csv --image_dir=images/test --output_path=test.record

after edit .config file ( such as use faster_rcnn_inception_v2_pets.config) and run training model again with python model_main.py --logtostderr --model_dir=training/ --ipeline_config_path=training/faster_rcnn_inception_v2_pets.config

then it will run successfully ..

I had this issue, changed dataset's, and the issue disappeared. So at least in my case, something was wonky with the dataset.Turns out my .tfrecord files were completely empty. Lol.

@xiaoxusanheyi
Copy link

在带有 RTX3090 的 Ubuntu 20.04 上也有同样的问题。TF 是使用 anaconda 安装的。

我也是

请问你解决这个问题了吗

@ws-x-sw
Copy link

ws-x-sw commented Mar 22, 2023

在带有 RTX3090 的 Ubuntu 20.04 上也有同样的问题。TF 是使用 anaconda 安装的。

我也是

请问你解决这个问题了吗

I am facing the same issue. Have you solved it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests