-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed jobs make SoS hang on cluster #1078
Comments
Updates:
|
Still trying to reproduce the issue with my VM, please be patient. |
Sure, great! Yes we should do it one at a time -- try to reproduce and fix "failed" hang first (with my example running as is); then I'll test the patch on my end that "submitted" hang i mentioned in the 2nd thread and see if it still exists, then worry about reproducing and fixing that. |
I can reproduce some problems but solutions are hard. Generally speaking, we have focused on errors in scripts and tasks, but no errors of job submission etc. Whereas such errors will error out sos easily before, current the zmq implementation require all processes to shutdown properly, so the error has to be handled at a global level... |
I see. Indeed another type of hang is also very easy to reproduce but am reporting it here anways: when the run time exceeds required resource the job will be aborted and cause a hang. |
Previously, we
With zmq, the master cannot shutdown completely with worker still running. |
Should be better now, but not sure if it solves the job submission problem. |
The hang test example script now quits properly! That's an error from the script right? It would be nice to display the task ID for the failed tasks though so that people can use
I'm not sure what you meant by "I can reproduce some problems" and "Should be better now". I just messed up my job template to create syntax error for my |
This patch handles error from job submission command better. |
Okay the behavior of failed job is a lot better. It currently points to the exact sub-step that went wrong, the problematic script, and the task ID of it. Also instead of quitting altogether it continues to run other steps independent of the failed step and reports the failure at the very end. This is very good. I believe it is the case for non-task anyways. Good to have them consistent. However I do not think the problem is fixed for messed up template. It still hangs. Can you reproduce it? How about trying to change the folder for
and the hang:
This is because I changed the |
OK. Messed-up template is still not fixed because I only fixed case then the submission command fails to generate a job ID... I am working on a binder example on. |
This does not look like fixable. The shell scripts generated by the template are executed by the scheduler and |
Well, indeed the job scheduler itself will quit (or complete) silently, if a bad script is submitted. It does leave behind some stderr files that helps me understanding what's going on -- this is true even when submitted via SoS. But at least for the hanging behavior: shouldn't SoS check the current queue, keep submitting jobs until all jobs finish (success or fail), then quit? It does not bother if SoS can capture an error in that scenario. What matters is whether it hangs. The fact it hangs instead of keep submitting seem to suggest some error has been identified, right? |
SoS is simply waiting for the queue to free up, which unfortunately in this case is filled with submitted but actually failed tasks. This is one thing we can potentially do though. We allowed the user to specify a status command (e.g. |
Oh that explained it! I always thought SoS relies on |
So zmq architecture actually provides a good mechanism for the program to be "peeked" from outside. The idea was described in #942, and #1060 . Basically, when sos runs, many information are sent to a controller thread through sockets. As long as the IP address of the thread is exposed, an outside program can tap to the port and communicate with the controller. In this way it would be easy for users to check how many tasks are being queued, in which status etc. For example, for #1060, instead of passing verbosity levels around and let workers output their own logging messages, it is possible to send all logging information to the controller. The controller would output these messages to the terminal according to |
It seems a new problem with the
zmq
queue but I cannot (yet) come up with a good MWE for it. In short I have some tasks submitted to cluster and failed due to errors in the script not related to SoS. But instead of quitting properly (commit to signature DB and quit) it hangs. I have toctrl-c
, try to figure out what is wrong, and rerun them. When I rerun I noticed some previously successful substeps are not ignored, possibly due to 46e6f94. I've been wrestling with it the whole day trying to fix issues on my end, but I might have to eventually give it up and revert to pre-zmq version to get my work done ...@BoPeng I do not have a MWE, but with modification to our favorite
hang_test.sos
example here:hang_test_cluster.tar.gz
I was able to reproduce it reliably.
The hang-test example
It contains a
sos
file andyml
file.sos
file has on line 36 a variable that has not been definednot_exist_foo
. It should create a failureyml
file defines my cluster configuration. The key is that I submitted the job from the headnode.To run the example:
The output files
you can see the time stamps: 21:52 is when the last output
rds
file was generated, which is >10min ago.The hang screen
It is hanging for quite a while. My task queue is currently empty.
Other comments
I recall that previously the problem was that SoS reports failures and quits workflow but the jobs in question was in fact successful. We decided to look into that with
zmq
but now the hang behavior is worse ...The text was updated successfully, but these errors were encountered: