You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the training crashes (e.g. GPU out-of-memory, or got inf/nan, or whatever), it often happens that the process (SGE job) is just hanging and not exiting.
The text was updated successfully, but these errors were encountered:
Commit a31f683 might have improved things. But not sure. With that commit, all the procs seems to reach the exit code. I see 4 times Trainer not finalized, quitting. (pid ...) in the log (for 4 GPUs). However, it still hangs. The last message in the log:
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
When I login on the node, I also see that the procs are still running (via pstree -p):
When the training crashes (e.g. GPU out-of-memory, or got inf/nan, or whatever), it often happens that the process (SGE job) is just hanging and not exiting.
The text was updated successfully, but these errors were encountered: