Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod single-node multi-GPU training, hangs on crash #314

Open
albertz opened this issue Jun 24, 2020 · 1 comment
Open

Horovod single-node multi-GPU training, hangs on crash #314

albertz opened this issue Jun 24, 2020 · 1 comment

Comments

@albertz
Copy link
Member

albertz commented Jun 24, 2020

When the training crashes (e.g. GPU out-of-memory, or got inf/nan, or whatever), it often happens that the process (SGE job) is just hanging and not exiting.

@albertz
Copy link
Member Author

albertz commented Jun 29, 2020

Commit a31f683 might have improved things. But not sure. With that commit, all the procs seems to reach the exit code. I see 4 times Trainer not finalized, quitting. (pid ...) in the log (for 4 GPUs). However, it still hangs. The last message in the log:

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------

When I login on the node, I also see that the procs are still running (via pstree -p):

           ├─sge_execd(2030)─┬─load_sensor.sh(2268)
           │                 ├─sge_shepherd(15966)───python3(16362)───bash(16529)─┬─mpirun(16554)───{mpirun}(16738)
           │                 │                                                    ├─python3(16530)
           │                 │                                                    ├─python3(16531)
           │                 │                                                    └─python3(16555)─┬─{python3}(16708)
           │                 │                                                                     └─{python3}(16710)
           │                 ├─{sge_execd}(2031)
           │                 ├─{sge_execd}(2032)
           │                 ├─{sge_execd}(2033)
           │                 └─{sge_execd}(2034)

I assume they hang at quit. Maybe at the atexit handler of Horovod or so? Once I send SIGUSR1 to them, they immediately quitted.

Maybe OpenMPI #3380 is related to that now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant