-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hang in cuEventSynchronize
in background thread
#323
Comments
I just realized that the hang itself is in a
|
Another instance of the same problem (probably), So this problem seems to be disconnected from Multi-GPU training. I just looked at the training dirs and found several more instances where the training was hanging: Log:
(killed after three days near the end of epoch 41, 93.81%) Log:
(killed after three days in epoch 40, at 11.49%) Log:
Log:
(killed after three days and 24/24GB usage, at epoch 46, step 4206, 89.39% of the ep) So this issue happened to me with TF 1.14, 1.15 and 2.3. |
Also, the following CUDA/nvidia libs were loaded:
|
cuEventSynchronize
in background thread
To add,
|
@Spotlight0xff reported that this might be related to RETURNN also might use this value for It is not clear currently whether this is really due to OMP, or due to TF @Spotlight0xff do you have more insights on this? |
Unfortunately I still get sometimes stuck error, but only on Switchboard, so this have something to do with |
@Spotlight0xff Is that still used for
|
I have encountered a similar issue during search (on LibriSpeech).
RETURNN options.
The GDB stacktrace is here. Relevant bits from the stacktrace:
Thread 23 in |
@Spotlight0xff very interesting. I'm not sure if this is the same hang though. This TF issue (deadlock in In the other stacktraces reported here, there is never something with |
I am also getting this issue. Here a gist link of threads strace: https://gist.github.com/mmz33/dfe0eddf8ecab9777ac63715547e516e This is with single GPU, no Horovod. |
To debug this, it would be helpful if we somehow can investigate what the GPU is doing at that time (similarly as we are seeing what the CPU is doing). There should be tools for this. E.g. there is Nvidia Nsight. |
@JackTemaki I heard you also got this? You did not comment here. |
@albertz yes correct I got this.
|
All of you always had this problem only when 2D convolution was involved, in the typical VGG-style prenet we have now in many models? |
Also, you always get this only after min 1 day of runtime? Or even longer, more like 2.5 days? Is there some common pattern? Or is it more related to the absolute number of steps? |
All the reports here are also for training, right? |
Any point during training, can even be the last day out of a week of training
Good point, TTS only has 1D convolutions and I do not remember a hang there... |
No, I did not experience this during my training. I do single gpu no conv. layer. However, I was interested to know whether this was related to horovod or not, since I plan on doing multi-GPU training. |
@curufinwe pointed out that This is obviously not a real solution. Maybe it just hides the problem. But it is good to know if this would hide it, to further understand it.
|
I am getting freezed jobs again with single GPU training. Here is the threads stacktrace log. Environment:
|
Did we compile this TF? What GCC did we use? |
This question is the reason why I think it is not worth looking into this (The answer is yes and GCC 5.4). If this problem persists with new TF and new GCC versions then we should look at this again. |
I updated my comment above since I had the path for generic compiled TF version but i was using the one compiled for haswell. However, i had this hanging issue also when using the generic compiled TF version before. |
Recent RETURNN:
Horovod via SGE
-pe mpi 4
and thenmpirun
:Horovod settings:
This started fine, in epoch 1, and continued for many epochs, many hours, up until the end of epoch 101:
The last message repeats then every 60 seconds, and nothing else happens anymore:
When I login to the node of rank 3, and send a SIGUSR1, I get this traceback (via
faulthandler
; other threads omitted because irrelevant), i.e. we can see that it hangs insess.run
:Note that this is the standard step
sess.run
, i.e. nothing specific about Horovod. Actually, with these settings, there should be no Horovod op involved at all in this call.Then, the C traceback (via
gdb -p 1594 -ex 'thread apply all bt' -ex="set confirm off" -ex quit > gdblog.p1494.txt
) ishere. Some of the (maybe interesting) threads (excluded are
WaitForWork
or Python threads):CudnnSupport::DoConvolve
, or more specifically inGpuDriver::GetEventElapsedTime
, or even more specifically incuEventSynchronize
, which looks like a CUDA issue? Or maybe I misunderstand how thisGpuTimer
works. Or how is it used inDoConvolve
. Maybe this issue, or this? Also, if this is really a CUDA-related issue, why has it never occurred so far without Horovod?opal_libevent2021_event_base_loop
in two different threads?sess.run
of the main thread actually should not involve any Horovod op (with the given settings). So no MPI communication should run currently. However, there isPMPI_Allreduce
in thread 2. Why?Note that this is not the first time I'm seeing this. I already saw it a couple of times. Originally I hoped that this is some temporary issue in our cluster but there seems to be a real problem or bug. I don't really know whether it is on our side, or on OpenMPI (we have a quite old version), or on Horovod.
The text was updated successfully, but these errors were encountered: