Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hang in cuEventSynchronize in background thread #323

Open
albertz opened this issue Jul 14, 2020 · 23 comments
Open

hang in cuEventSynchronize in background thread #323

albertz opened this issue Jul 14, 2020 · 23 comments

Comments

@albertz
Copy link
Member

albertz commented Jul 14, 2020

Recent RETURNN:

RETURNN starting up, version 20200706.123141--git-1bbf93a4, date/time 2020-07-12-09-57-02 (UTC+0200), pid 1594, cwd /work/asr4/zeyer/setups-data/switchboard/2020-06-09--e2e-multi-gpu/data-train/base2.conv2l.specaug4a.wdrop03.adrop01.l2a_1e_4.ctc.devtrain.lrwa.lrt_0005.mgpu4.htd100, Python /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/bin/python3

Horovod via SGE -pe mpi 4 and then mpirun:

cluster-cn-275-pid1594: use_horovod, CUDA_VISIBLE_DEVICES: 1
cluster-cn-220-pid5101: use_horovod, CUDA_VISIBLE_DEVICES: 1
cluster-cn-238-pid14531: use_horovod, CUDA_VISIBLE_DEVICES: 1
cluster-cn-224-pid12931: use_horovod, CUDA_VISIBLE_DEVICES: 3
New maximum RSS usage: 4.5 GB
Horovod initialized. Hostname cluster-cn-238, pid 14531, rank 2 / size 4, local rank 0 / local size 1.
Horovod initialized. Hostname cluster-cn-275, pid 1594, rank 3 / size 4, local rank 0 / local size 1.
cluster-cn-238-pid14531: Local rank/size: 0 1
cluster-cn-275-pid1594: Local rank/size: 0 1
Horovod initialized. Hostname cluster-cn-224, pid 12931, rank 0 / size 4, local rank 0 / local size 1.
cluster-cn-224-pid12931: Local rank/size: 0 1
Horovod initialized. Hostname cluster-cn-220, pid 5101, rank 1 / size 4, local rank 0 / local size 1.
cluster-cn-220-pid5101: Local rank/size: 0 1 

Horovod settings:

horovod_dataset_distribution = "random_seed_offset"
horovod_reduce_type = "param"
horovod_param_sync_time_diff = 100.

This started fine, in epoch 1, and continued for many epochs, many hours, up until the end of epoch 101:

...
train epoch 101, step 2276, cost:ctc 0.38683144498930844, cost:ctc:exp 1.4723083054150774, cost:output/output_prob 0.2457886053112546, cost:output/output_prob:exp 1.2786292495379887, error:ctc 0.09923664154484868, error:output/output_prob 0.052434457466006286, loss 246.3717, max_size:bpe 61, max_size:bpe0 60, max_size:data 1372, mem_usage:GPU:0 9.8GB, num_seqs 5, 1.203 sec/step, elapsed 0:34:37, exp. remaining 0:00:00, complete 99.99%
train epoch 101, step 2277, cost:ctc 0.5404130568496726, cost:ctc:exp 1.7167158169826067, cost:output/output_prob 0.3057799498138394, cost:output/output_prob:exp 1.3576835151642892, error:ctc 0.14335664175450802, error:output/output_prob 0.06872852332890034, loss 322.9364, max_size:bpe 75, max_size:bpe0 74, max_size:data 1434, mem_usage:GPU:0 9.8GB, num_seqs 5, 1.341 sec/step, elapsed 0:34:38, exp. remaining 0:00:00, complete 99.99%
[2020-07-14 18:54:00.136244: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ranks:
3: [global_tensor_horovod_sum_have_data/HorovodAllreduce_globals_horovod_have_more_data_placeholder_0, global_tensor_horovod_sum_have_error/HorovodAllreduce_globals_horovod_have_error_placeholder_0]
[2020-07-14 18:55:00.137283: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
...

The last message repeats then every 60 seconds, and nothing else happens anymore:

[2020-07-14 22:32:00.724578: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ranks:
3: [global_tensor_horovod_sum_have_data/HorovodAllreduce_globals_horovod_have_more_data_placeholder_0, global_tensor_horovod_sum_have_error/HorovodAllreduce_globals_horovod_have_error_placeholder_0]

When I login to the node of rank 3, and send a SIGUSR1, I get this traceback (via faulthandler; other threads omitted because irrelevant), i.e. we can see that it hangs in sess.run:

Current thread 0x00007f4cf0377700 (most recent call first): 
  File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429 in _call_tf_sessionrun 
  File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341 in _run_fn 
  File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356 in _do_call 
  File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350 in _do_run 
  File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173 in _run
  File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950 in run
  File "/u/zeyer/setups/switchboard/2020-06-09--e2e-multi-gpu/crnn/TFEngine.py", line 680 in run
  File "/u/zeyer/setups/switchboard/2020-06-09--e2e-multi-gpu/crnn/TFEngine.py", line 1535 in train_epoch
  File "/u/zeyer/setups/switchboard/2020-06-09--e2e-multi-gpu/crnn/TFEngine.py", line 1427 in train
  File "crnn/rnn.py", line 449 in execute_main_task
  File "crnn/rnn.py", line 639 in main
  File "crnn/rnn.py", line 651 in <module>

Note that this is the standard step sess.run, i.e. nothing specific about Horovod. Actually, with these settings, there should be no Horovod op involved at all in this call.

Then, the C traceback (via gdb -p 1594 -ex 'thread apply all bt' -ex="set confirm off" -ex quit > gdblog.p1494.txt) is
here. Some of the (maybe interesting) threads (excluded are WaitForWork or Python threads):

Thread 52 (Thread 0x7f4b857fe700 (LWP 1758)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f4cbc79ed73 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00007f4cbc79e491 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007f4cbc79b752 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007f4cbc79bc63 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007f4cbb2359d2 in tensorflow::EventMgr::PollLoop() ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x00007f4cb22088ca in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()

Thread 48 (Thread 0x7f4bb37fe700 (LWP 1752)):
#0  0x00007ffe58bddb6d in clock_gettime ()
#1  0x00007f4cef511936 in __GI___clock_gettime (clock_id=4, tp=0x7f4bb37fb070) at ../sysdeps/unix/clock_gettime.c:115
#2  0x00007f4bfa8a1d0e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f4bfa95cc77 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f4bfa85e437 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007f4bfa7800c6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007f4bfa8f0e60 in cuEventSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007f4cbcc8e144 in stream_executor::gpu::GpuDriver::GetEventElapsedTime(stream_executor::gpu::GpuContext*, float*, CUevent_st*, CUevent_st*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8  0x00007f4cbb395b77 in stream_executor::gpu::GpuTimer::GetElapsedMilliseconds() const ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9  0x00007f4cb26d62c8 in stream_executor::gpu::CudnnSupport::DoConvolve(stream_executor::dnn::ConvolutionKind, stream_executor::dnn::DataType, stream_executor::Stream*, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::AlgorithmDesc, stream_executor::DeviceMemory<unsigned char>, stream_executor::dnn::ProfileResult*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#10 0x00007f4cbcf46506 in stream_executor::Stream::ThenConvolveWithAlgorithm(stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>*, stream_executor::ScratchAllocator*, stream_executor::dnn::AlgorithmConfig const&, stream_executor::dnn::ProfileResult*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f4cba29d180 in tensorflow::LaunchConv2DOp<Eigen::GpuDevice, float>::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, std::vector<long long, std::allocator<long long> > const&, tensorflow::Tensor*, tensorflow::TensorFormat) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f4cba29ddd4 in tensorflow::Conv2DOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f4cb2114a0a in tensorflow::BaseGPUDevice::ComputeHelper(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#14 0x00007f4cb2115605 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#15 0x00007f4cb216f2c1 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#16 0x00007f4cb216f37f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#17 0x00007f4cb22088ca in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()

Thread 46 (Thread 0x7f4bb3fff700 (LWP 1750)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x00007f4bfa8a57d7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007f4bfa84af27 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f4bfa8a4a58 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f4cef7cd6ba in start_thread (arg=0x7f4bb3fff700) at pthread_create.c:333
#5  0x00007f4cef5034dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 42 (Thread 0x7f4c00ff9700 (LWP 1744)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f4cbcf76361 in absl::synchronization_internal::Waiter::Wait(absl::synchronization_internal::KernelTimeout) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00007f4cbcf761c1 in AbslInternalPerThreadSemWait ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007f4cbcf778e5 in absl::Mutex::Block(absl::base_internal::PerThreadSynch*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007f4cbcf7887c in absl::Mutex::AwaitCommon(absl::Condition const&, absl::synchronization_internal::KernelTimeout)
    ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007f4cbcf7890d in absl::Mutex::Await(absl::Condition const&) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x00007f4cbce5d75c in stream_executor::host::HostStream::WorkLoop() ()

Thread 3 (Thread 0x7f4c8e1da700 (LWP 1626)):
#0  0x00007f4cef4f780d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f4c902e9e58 in ?? () from /usr/lib/libopen-pal.so.13
#2  0x00007f4c902e06fb in opal_libevent2021_event_base_loop () from /usr/lib/libopen-pal.so.13
#3  0x00007f4c9055bb8e in ?? () from /usr/lib/libopen-rte.so.12
#4  0x00007f4cef7cd6ba in start_thread (arg=0x7f4c8e1da700) at pthread_create.c:333

Thread 2 (Thread 0x7f4c90208700 (LWP 1625)):
#0  0x00007f4cef4f780d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f4c902e9e58 in ?? () from /usr/lib/libopen-pal.so.13
#2  0x00007f4c902e06fb in opal_libevent2021_event_base_loop () from /usr/lib/libopen-pal.so.13
#3  0x00007f4c902aa238 in opal_progress () from /usr/lib/libopen-pal.so.13
#4  0x00007f4c909eef65 in ompi_request_default_wait_all () from /usr/lib/libmpi.so.12
#5  0x00007f4c843f9426 in ompi_coll_tuned_allreduce_intra_recursivedoubling ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#6  0x00007f4c909fef23 in PMPI_Allreduce () from /usr/lib/libmpi.so.12
#7  0x00007f4c90f3eadc in horovod::common::MPIController::CrossRankBitwiseAnd (this=<optimized out>, bitvector=..., 
    count=<optimized out>) at horovod/common/mpi/mpi_controller.cc:90
#8  0x00007f4c90f08cd2 in horovod::common::CacheCoordinator::sync (this=this@entry=0x7f4c90206fe0, controller=
    std::shared_ptr (count 2, weak 1) 0x26158e0, timeline_enabled=<optimized out>) at horovod/common/response_cache.cc:390
#9  0x00007f4c90ed8c6b in horovod::common::Controller::CoordinateCacheAndState (this=this@entry=0x26158e0, 
    cache_coordinator=...) at horovod/common/controller.cc:615
#10 0x00007f4c90ee002a in horovod::common::Controller::ComputeResponseList (this=0x26158e0, shut_down=..., state=...)
    at horovod/common/controller.cc:137
#11 0x00007f4c90ef949b in horovod::common::(anonymous namespace)::RunLoopOnce (state=...)
    at horovod/common/operations.cc:568
#12 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:509

Thread 1 (Thread 0x7f4cf0377700 (LWP 1594)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f4cbc79ed73 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00007f4cbc79e491 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007f4cbc79b752 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007f4cbc79bc63 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007f4cb7c1026b in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x00007f4cb7c102cb in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x00007f4cb7c18351 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8  0x00007f4cb7c2395b in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9  0x00007f4cb54383aa in tensorflow::SessionRef::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f4cb7c55a69 in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Tensor**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Buffer*, TF_Status*) [clone .constprop.672] ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f4cb7c562ed in TF_SessionRun ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f4cb5434301 in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f4cb54343a2 in tensorflow::TF_SessionRun_wrapper(TF_Session*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f4cb53f2923 in _wrap_TF_SessionRun_wrapper ()
   from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x00007f4cefafa1b5 in _PyMethodDef_RawFastCallKeywords () at ../Objects/call.c:694
  • I wonder why it hangs in CudnnSupport::DoConvolve, or more specifically in GpuDriver::GetEventElapsedTime, or even more specifically in cuEventSynchronize, which looks like a CUDA issue? Or maybe I misunderstand how this GpuTimer works. Or how is it used in DoConvolve. Maybe this issue, or this? Also, if this is really a CUDA-related issue, why has it never occurred so far without Horovod?
  • I also wonder, why do I see opal_libevent2021_event_base_loop in two different threads?
  • As mentioned above, the sess.run of the main thread actually should not involve any Horovod op (with the given settings). So no MPI communication should run currently. However, there is PMPI_Allreduce in thread 2. Why?

Note that this is not the first time I'm seeing this. I already saw it a couple of times. Originally I hoped that this is some temporary issue in our cluster but there seems to be a real problem or bug. I don't really know whether it is on our side, or on OpenMPI (we have a quite old version), or on Horovod.

@albertz
Copy link
Member Author

albertz commented Jul 15, 2020

I just realized that the hang itself is in a sess.run which is not related at all to Horovod.
However:

  • Searching for similar TensorFlow related problems (hangs in DoConvolve, or cuEventSynchronize, or so) yields nothing, which is a bit strange.
  • I only see this together with Horovod. Without Horovod, I have never had such a problem. But then, I also did not run non-Horovod experiments since a while. Maybe the GPU driver or so was updated in the meantime. I don't think this is actually related to Horovod.

@Spotlight0xff
Copy link
Contributor

Spotlight0xff commented Aug 27, 2020

Another instance of the same problem (probably),
this time without Horovod, single GPU, with TF 2.3 and CUDA 10.1.
The gdb+python log can be found here

So this problem seems to be disconnected from Multi-GPU training.

I just looked at the training dirs and found several more instances where the training was hanging:

Log: /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-d5-tf1-14/qdir/q.log/train.o1425449.1:

RETURNN starting up, version 20200630.143255--git-8fc19e7-dirty, date/time 2020-07-25-11-44-37 (UTC+0200), pid 6349, cwd /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-d5-tf1-14, Python /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/bin/python3
Hostname: cluster-cn-254
TensorFlow: 1.14.0 (v1.14.0-0-g87989f6959) (<site-package> in /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow)
[...]
2020-07-25 11:44:41.348288: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1

(killed after three days near the end of epoch 41, 93.81%)


Log: /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-b/qdir/q.log/train.o1357539.1

RETURNN starting up, version 20200630.143255--git-8fc19e7, date/time 2020-07-06-16-24-31 (UTC+0200), pid 29917, cwd /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-
b, Python /usr/bin/python3
RETURNN command line options: ['config-train/rna-transducer.fs_pretrain-b.config']
Hostname: cluster-cn-273
TensorFlow: 1.15.2 (v1.15.0-92-g5d80e1e) (<site-package> in /u/merboldt/.local/lib/python3.5/site-packages/tensorflow)
[...]
2020-07-06 16:24:32.224258: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1

(killed after three days in epoch 40, at 11.49%)
In this case, the maximum RSS usage went up (watched with some script), from 23.2GB to 75.7 GB, then it was killed.
So this maybe is interesting.


Log: /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-e/qdir/q.log/train.o1379626.1

RETURNN starting up, version 20200630.143255--git-8fc19e7, date/time 2020-07-12-00-01-23 (UTC+0200), pid 30716, cwd /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-e, Python /usr/bin/python3
RETURNN command line options: ['config-train/rna-transducer.fs_pretrain-e.config']
Hostname: cluster-cn-269
TensorFlow: 1.15.2 (v1.15.0-92-g5d80e1e) (<site-package> in /u/merboldt/.local/lib/python3.5/site-packages/tensorflow)
[...]
2020-07-12 00:01:24.010922: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1

Log: /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-gather.w_retrain.pt24.tf2-3/qdir/q.log/train.o1541649.1

RETURNN starting up, version 20200728.184724--git-126ebac-dirty, date/time 2020-08-10-17-24-59 (UTC+0200), pid 9723, cwd /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pret
rain-gather.w_retrain.pt24.tf2-3, Python /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/bin/python3
RETURNN command line options: ['config-train/rna-transducer.fs_pretrain-gather.w_retrain.pt24.tf2-3.config']
Hostname: cluster-cn-257
TensorFlow: 2.3.0 (v2.3.0-rc2-23-gb36436b087) (<site-package> in /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow)
Use num_threads=4 (but min 2) via SGE num_proc.
Setup TF inter and intra global thread pools, num_threads 4, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 4, 'inter_op_parallelism_threads': 4}.

(killed after three days and 24/24GB usage, at epoch 46, step 4206, 89.39% of the ep)

So this issue happened to me with TF 1.14, 1.15 and 2.3.

@Spotlight0xff
Copy link
Contributor

Also, the following CUDA/nvidia libs were loaded:

/usr/lib/x86_64-linux-gnu/coreutils/libstdbuf.so
/usr/lib/x86_64-linux-gnu/libcuda.so.418.67
/usr/lib/x86_64-linux-gnu/libmpdec.so.2.4.2
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.67
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.67
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21
/usr/local/cuda-10.0/lib64/libcublas.so.10.0.130
/usr/local/cuda-10.0/lib64/libcudart.so.10.0.130
/usr/local/cuda-10.0/lib64/libcufft.so.10.0.145
/usr/local/cuda-10.0/lib64/libcurand.so.10.0.130
/usr/local/cuda-10.0/lib64/libcusolver.so.10.0.130
/usr/local/cuda-10.0/lib64/libcusparse.so.10.0.130
/usr/local/cudnn-10.1-v7.6/lib64/libcudnn.so.7.6.3

@albertz albertz changed the title Horovod hang with stalled rank, after some time hang in cuEventSynchronize in background thread Aug 27, 2020
@albertz
Copy link
Member Author

albertz commented Aug 27, 2020

To add, dmesg shows these messages, which might be related:

[Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0: AER: Corrected error received: id=8018
[Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=8018(Transmitter ID)
[Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0:   device [8086:6f08] error status/mask=00001000/00002000
[Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0:    [12] Replay Timer Timeout  

@albertz
Copy link
Member Author

albertz commented Nov 6, 2020

@Spotlight0xff reported that this might be related to OMP_NUM_THREADS. Specifically, he observed the hangs with OMP_NUM_THREADS=6 but not anymore with OMP_NUM_THREADS=1.

RETURNN also might use this value for intra_op_parallelism_threads/inter_op_parallelism_threads (there is SGE num_proc: FileNotFoundError(2, 'No such file or directory') in the log, so it failed to use the SGE num_proc value, so this is usually the fallback).

It is not clear currently whether this is really due to OMP, or due to TF intra_op_parallelism_threads/inter_op_parallelism_threads.

@Spotlight0xff do you have more insights on this?

@Spotlight0xff
Copy link
Contributor

Unfortunately I still get sometimes stuck error, but only on Switchboard, so this have something to do with ExternSprintDataset maybe?
But setting OMP_NUM_THREADS helped to lower the frequency of these hangs.

@albertz
Copy link
Member Author

albertz commented Nov 7, 2020

@Spotlight0xff Is that still used for intra_op_parallelism_threads/inter_op_parallelism_threads as well, or now independent (because it correctly uses your SGE num_proc setting)? In the first case, it means this is somehow related to multithreading in TF, and this might still be a CUDA related problem. In the latter case, it means it is related to sth with OMP (but what exactly uses OMP?).

ExternSprintDataset should have almost no influence at all to the main proc, as Sprint runs in an independent background proc, and the main proc just reads the data (received via OS pipe, pickle + raw Numpy for serialization). But of course, with multithreading involved, timings will be different, and this could trigger some race conditions.

@Spotlight0xff
Copy link
Contributor

Spotlight0xff commented Nov 11, 2020

I have encountered a similar issue during search (on LibriSpeech).
TF 2.3.0, CUDA 10.1, CudNN 7.6.3:
Loaded libs (only CUDA/CUDNN):

/usr/lib/x86_64-linux-gnu/libcuda.so.418.67
/usr/local/cuda-10.1/lib64/libcublasLt.so.10
/usr/local/cuda-10.1/lib64/libcublas.so.10
/usr/local/cuda-10.1/lib64/libcudart.so.10.1
/usr/local/cuda-10.1/lib64/libcufft.so.10
/usr/local/cuda-10.1/lib64/libcurand.so.10
/usr/local/cuda-10.1/lib64/libcusolver.so.10
/usr/local/cuda-10.1/lib64/libcusparse.so.10
/usr/local/cudnn-10.1-v7.6/lib64/libcudnn.so.7.6.3
/work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-34a18dc3.3.7.so

RETURNN options.

TensorFlow: 2.3.0 (v2.3.0-rc2-23-gb36436b087) (<site-package> in /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow)
Use num_threads=2 (but min 2) via SGE num_proc.
Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}.

The GDB stacktrace is here.
Corresponding python stack-trace here.

Relevant bits from the stacktrace:

Thread 23 (Thread 0x150b5fe21700 (LWP 20749)):
#0  0x0000150f51f168e7 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000150f4bdf3abb in inner_thread () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#2  0x0000150f4bf1cc59 in exec_blas () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#3  0x0000150f4bdf2f58 in gemm_driver () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#4  0x0000150f4bdf3bc6 in dgemm_thread_tn () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#5  0x0000150f4bd05609 in cblas_dgemm () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#6  0x0000150f4dc5cb90 in gemm (order=CblasRowMajor, R=0x150b5f945080, ldb=<optimized out>, B=<optimized out>, lda=<optimized out>, A=0x150b5fa05b70, k=<optimized out>, n=<optimized out>, m=<optimized out>, 
    transB=<optimized out>, transA=<optimized out>, typenum=12) at numpy/core/src/common/cblasfuncs.c:36
#7  cblas_matrixproduct (typenum=typenum@entry=12, ap1=ap1@entry=0x150b5fa05b70, ap2=<optimized out>, ap2@entry=0x150b5fa05da0, out=out@entry=0x0) at numpy/core/src/common/cblasfuncs.c:678
#8  0x0000150f4dac2737 in PyArray_MatrixProduct2 (op1=<optimized out>, op2=0x150b5fa05da0, out=0x0) at numpy/core/src/multiarray/multiarraymodule.c:942
#9  0x0000150f4dac308a in array_matrixproduct (__NPY_UNUSED_TAGGEDdummy=<optimized out>, args=<optimized out>, kwds=<optimized out>) at numpy/core/src/multiarray/multiarraymodule.c:2226
#10 0x0000150f5253d350 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x150f4df93f40) at ../Objects/call.c:742
#11 PyObject_Call () at ../Objects/call.c:232
#12 0x0000150f4da135d5 in array_implement_array_function (__NPY_UNUSED_TAGGEDdummy=<optimized out>, positional_args=<optimized out>) at numpy/core/src/multiarray/arrayfunction_override.c:265
#13 0x0000150f5253cd37 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x150f4df9b8b0) at ../Objects/call.c:757
#14 PyCFunction_Call () at ../Objects/call.c:772
#15 0x0000150f5253b634 in _PyObject_MakeTpCall () at ../Objects/call.c:159
...

...

Thread 10 (Thread 0x150f32ef9700 (LWP 19802)):
#0  0x0000150f521fe98d in pthread_join (threadid=23152700049152, thread_return=0x0) at pthread_join.c:90
#1  0x0000150f4bf1cfaa in blas_thread_shutdown_ () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#2  0x0000150f51ef8425 in __libc_fork () at ../sysdeps/nptl/fork.c:96
#3  0x0000150ee3aec187 in tensorflow::SubProcess::Start() () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#4  0x0000150ef167dc2f in stream_executor::CompileGpuAsm(int, int, char const*, stream_executor::GpuAsmOpts) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x0000150ef167f5ce in stream_executor::CompileGpuAsm(int, char const*, stream_executor::GpuAsmOpts) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x0000150ef167fed4 in stream_executor::CompileGpuAsmOrGetCached(int, char const*, stream_executor::GpuAsmOpts) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x0000150ef167a755 in stream_executor::RedzoneAllocator::CheckRedzones() const ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8  0x0000150ef1676b7a in tensorflow::CheckRedzones(stream_executor::RedzoneAllocator const&, tensorflow::AutotuneResult*) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9  0x0000150eee5f9db0 in tensorflow::LaunchConv2DOp<Eigen::GpuDevice, float>::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, std::vector<long long, std::allocator<long long> > const&, tensorflow::Tensor*, tensorflow::TensorFormat) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x0000150eee5fa42c in tensorflow::Conv2DOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x0000150ee3987923 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#12 0x0000150ee3a572f9 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::PropagatorState>::Process(tensorflow::PropagatorState::TaggedNode, long long) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#13 0x0000150ee3a584cf in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState<tensorflow::PropagatorState>::ScheduleReady(absl::lts_2020_02_25::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::PropagatorState::TaggedNode> >*, tensorflow::PropagatorState::TaggedNodeReadyQueue*)::{lambda()#2}>::_M_invoke(std::_Any_data const&) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#14 0x0000150ee3b0a591 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#15 0x0000150ee3b072a3 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#16 0x0000150ee3af63e5 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#17 0x0000150f521fd6ba in start_thread (arg=0x150f32ef9700) at pthread_create.c:333
#18 0x0000150f51f334dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 9 (Thread 0x150f32cf8700 (LWP 19801)):
#0  0x0000150f51f34988 in accept4 (fd=6, addr=..., addr_len=0x150f32cf7e58, flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:40
#1  0x0000150eb70a438a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x0000150eb709668d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x0000150eb70a5a58 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x0000150f521fd6ba in start_thread (arg=0x150f32cf8700) at pthread_create.c:333
#5  0x0000150f51f334dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

...

Thread 1 (Thread 0x150f52dbc700 (LWP 19750)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000150ef23b832b in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x0000150ef23b7949 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x0000150ef23b4f0b in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x0000150ef23b53e3 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x0000150ef18e540c in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) ()
   from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
...
#8  0x0000150ef18f7101 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&) () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
...
#12 0x0000150ee8161c2a in TF_SessionRun () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
...
#17 0x0000150f5253cca6 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x150f3a349450) at ../Objects/call.c:742
#18 PyCFunction_Call () at ../Objects/call.c:772
#19 0x0000150f5253b634 in _PyObject_MakeTpCall () at ../Objects/call.c:159
...

Thread 23 in exec_blas is via Numpy via Librosa via our background MFCC feature extraction.
The first thread is just the canonical session.run(...).

@albertz
Copy link
Member Author

albertz commented Nov 11, 2020

@Spotlight0xff very interesting. I'm not sure if this is the same hang though. This TF issue (deadlock in fork() because of OpenBlas) sounds very related. (I wonder now why I have not seen this problem myself anymore... Because I think I don't use the workaround described there.) You might want to try setting patch_atfork = True in your config.

In the other stacktraces reported here, there is never something with fork() or BLAS. Although, this was on Switchboard, so I used ExternSprintDataset, i.e. there were/are also fork()s (before every epoch). Maybe they did not cause an immediate hang but somehow trigger this delayed hang?

@mmz33
Copy link
Member

mmz33 commented Mar 20, 2021

I am also getting this issue. Here a gist link of threads strace: https://gist.github.com/mmz33/dfe0eddf8ecab9777ac63715547e516e

This is with single GPU, no Horovod.

@albertz
Copy link
Member Author

albertz commented Oct 27, 2021

To debug this, it would be helpful if we somehow can investigate what the GPU is doing at that time (similarly as we are seeing what the CPU is doing).

There should be tools for this.

E.g. there is Nvidia Nsight.
Some examples here for Attaching to a Running CUDA Process, Inspect State (Visual Studio / Windows specific, but this should also be possible on Linux).

There is also CUDA-GDB. Some doc here, here.

@albertz
Copy link
Member Author

albertz commented Oct 27, 2021

@JackTemaki I heard you also got this? You did not comment here.
@Marvin84 You as well? You also did not comment here.
(It would have been useful to know that you also get this. Also for single GPU.)

@JackTemaki
Copy link
Collaborator

@albertz yes correct I got this.

  • Always single GPU (never did multi-gpu training)
  • probably only attention ASR/TTS, I do not remember any Hybrid training crashing but also this were only a few experiments...
  • definitely no RASR processes involved
  • Always TF2.3 /Cuda 10.1 (I did not run any older version from last year on)

@albertz
Copy link
Member Author

albertz commented Oct 27, 2021

All of you always had this problem only when 2D convolution was involved, in the typical VGG-style prenet we have now in many models?

@albertz
Copy link
Member Author

albertz commented Oct 27, 2021

Also, you always get this only after min 1 day of runtime? Or even longer, more like 2.5 days? Is there some common pattern? Or is it more related to the absolute number of steps?

@albertz
Copy link
Member Author

albertz commented Oct 27, 2021

All the reports here are also for training, right?

@JackTemaki
Copy link
Collaborator

JackTemaki commented Oct 27, 2021

Or even longer, more like 2.5 days

Any point during training, can even be the last day out of a week of training

All of you always had this problem only when 2D convolution

Good point, TTS only has 1D convolutions and I do not remember a hang there...

@Marvin84
Copy link

Marvin84 commented Oct 27, 2021

@Marvin84 You as well?

No, I did not experience this during my training. I do single gpu no conv. layer. However, I was interested to know whether this was related to horovod or not, since I plan on doing multi-GPU training.

@albertz
Copy link
Member Author

albertz commented Nov 12, 2021

@curufinwe pointed out that tf_session_opts = {'inter_op_parallelism_threads' : 1} can potentially avoid the problem.

This is obviously not a real solution. Maybe it just hides the problem. But it is good to know if this would hide it, to further understand it.

inter_op_parallelism_threads=1 could also make it much slower in certain situations. You would never want to use that.

@mmz33
Copy link
Member

mmz33 commented Dec 6, 2022

I am getting freezed jobs again with single GPU training. Here is the threads stacktrace log.
Again CudnnSupport::DoConvolve, GpuTimer::GetElapsedMilliseconds, and then it hangs in libcuda.so.1, in clock_gettime.

Environment:
TF 2.3.4 and Cuda 10.1

source /work/tools/asr/python/3.8.0_tf_2.3.4-haswell+cuda10.1+mkl/bin/activate
export LD_LIBRARY_PATH="/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cudnn-10.1-v7.6/lib64:/usr/lib/nvidia-450" 
export CUDA_HOME=/usr/local/cuda-10.1/ 
export CUDNN=/usr/local/cudnn-10.1-v7.6/ 

export OMP_NUM_THREADS=4

@albertz
Copy link
Member Author

albertz commented Dec 6, 2022

Environment:
TF 2.3.4 and Cuda 10.1

Did we compile this TF? What GCC did we use?

@JackTemaki
Copy link
Collaborator

Environment:
TF 2.3.4 and Cuda 10.1

Did we compile this TF? What GCC did we use?

This question is the reason why I think it is not worth looking into this (The answer is yes and GCC 5.4). If this problem persists with new TF and new GCC versions then we should look at this again.

@mmz33
Copy link
Member

mmz33 commented Dec 6, 2022

I updated my comment above since I had the path for generic compiled TF version but i was using the one compiled for haswell. However, i had this hanging issue also when using the generic compiled TF version before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants