Failed jobs make SoS hang on cluster #1078

gaow · 2018-10-08T03:12:08Z

It seems a new problem with the zmq queue but I cannot (yet) come up with a good MWE for it. In short I have some tasks submitted to cluster and failed due to errors in the script not related to SoS. But instead of quitting properly (commit to signature DB and quit) it hangs. I have to ctrl-c, try to figure out what is wrong, and rerun them. When I rerun I noticed some previously successful substeps are not ignored, possibly due to 46e6f94. I've been wrestling with it the whole day trying to fix issues on my end, but I might have to eventually give it up and revert to pre-zmq version to get my work done ...

@BoPeng I do not have a MWE, but with modification to our favorite hang_test.sos example here:

hang_test_cluster.tar.gz

I was able to reproduce it reliably.

The hang-test example

It contains a sos file and yml file.

The sos file has on line 36 a variable that has not been defined not_exist_foo. It should create a failure
The yml file defines my cluster configuration. The key is that I submitted the job from the headnode.

To run the example:

sos run hang_test.sos.txt -c hang_test.localhost.yml

The output files

you can see the time stamps: 21:52 is when the last output rds file was generated, which is >10min ago.

The hang screen

It is hanging for quite a while. My task queue is currently empty.

Other comments

I recall that previously the problem was that SoS reports failures and quits workflow but the jobs in question was in fact successful. We decided to look into that with zmq but now the hang behavior is worse ...

The text was updated successfully, but these errors were encountered:

gaow · 2018-10-08T13:59:44Z

Updates:

if it helps, I ran another workflow which succeeded in every single substep. SoS quits nicely in this case without reporting false alarms that some steps failed (which was the case before).
Another way to reproduce the hang issue, which is simpler, is to mess up the yml configuration's job template, eg, make it a single line or remove the # whatever messes it up. Then interestingly at least on my end all jobs will report submitted and hang there. But for my example in the thread above the sos status check was indeed failed. So it is perhaps better to test that example first.

BoPeng · 2018-10-08T14:06:45Z

Still trying to reproduce the issue with my VM, please be patient.

gaow · 2018-10-08T14:57:43Z

Sure, great! Yes we should do it one at a time -- try to reproduce and fix "failed" hang first (with my example running as is); then I'll test the patch on my end that "submitted" hang i mentioned in the 2nd thread and see if it still exists, then worry about reproducing and fixing that.

BoPeng · 2018-10-09T15:55:55Z

I can reproduce some problems but solutions are hard. Generally speaking, we have focused on errors in scripts and tasks, but no errors of job submission etc. Whereas such errors will error out sos easily before, current the zmq implementation require all processes to shutdown properly, so the error has to be handled at a global level...

gaow · 2018-10-09T18:28:49Z

I see. Indeed another type of hang is also very easy to reproduce but am reporting it here anways: when the run time exceeds required resource the job will be aborted and cause a hang.

BoPeng · 2018-10-09T18:32:49Z

Previously, we

step send tasks to master
master submit and collect results
if all pass, master sends results to step and allow it to complete
if some tasks fail, the master would quit. The step workers would die silently.

With zmq, the master cannot shutdown completely with worker still running.

…and let it stop #1078

BoPeng · 2018-10-09T18:55:32Z

Should be better now, but not sure if it solves the job submission problem.

gaow · 2018-10-09T20:06:08Z

The hang test example script now quits properly! That's an error from the script right? It would be nice to display the task ID for the failed tasks though so that people can use sos status to track them. Currently it only says:

ERROR: Workflow failed due to error

I'm not sure what you meant by "I can reproduce some problems" and "Should be better now". I just messed up my job template to create syntax error for my sbatch command. Again I got a hanging screen. Messing up that template is straightforward: just remove all line break and make the job template a one liner, then you should see the hang problem.

BoPeng · 2018-10-09T21:56:28Z

This patch handles error from job submission command better.

gaow · 2018-10-09T22:39:27Z

Okay the behavior of failed job is a lot better. It currently points to the exact sub-step that went wrong, the problematic script, and the task ID of it. Also instead of quitting altogether it continues to run other steps independent of the failed step and reports the failure at the very end. This is very good. I believe it is the case for non-task anyways. Good to have them consistent.

However I do not think the problem is fixed for messed up template. It still hangs. Can you reproduce it? How about trying to change the folder for stdout and stderr to some non-existing directory? On my system this is also a good way to reproduce a type of hang. eg, my showq is empty but

 sos status M5_fe671fedde2a7928 
M5_fe671fedde2a7928     3646e05e00ca3c5a normal Ran for 1 day   submitted

and the hang:

INFO: M5_32d2ad5b00fab26e submitted to midway2_head with job id 50877470


INFO: M5_fe671fedde2a7928 submitted to midway2_head with job id 50877471
INFO: M5_be2d1662fc65d995 submitted to midway2_head with job id 50877473
INFO: M5_0401a7b7a8477c59 submitted to midway2_head with job id 50877474
INFO: M5_16efb430ecc21c8d submitted to midway2_head with job id 50877475
INFO: M5_3f1b18244d1729ea submitted to midway2_head with job id 50877476

This is because I changed the stdout and stderr path to some crazy directory.

BoPeng · 2018-10-09T22:41:51Z

OK. Messed-up template is still not fixed because I only fixed case then the submission command fails to generate a job ID... I am working on a binder example on.

BoPeng · 2018-10-10T04:49:39Z

This does not look like fixable. The shell scripts generated by the template are executed by the scheduler and sos would not be executed if the shell script itself is problematic, and there is no chance for sos to do anything.

gaow · 2018-10-10T13:24:32Z

Well, indeed the job scheduler itself will quit (or complete) silently, if a bad script is submitted. It does leave behind some stderr files that helps me understanding what's going on -- this is true even when submitted via SoS.

But at least for the hanging behavior: shouldn't SoS check the current queue, keep submitting jobs until all jobs finish (success or fail), then quit? It does not bother if SoS can capture an error in that scenario. What matters is whether it hangs. The fact it hangs instead of keep submitting seem to suggest some error has been identified, right?

BoPeng · 2018-10-10T15:24:54Z

SoS is simply waiting for the queue to free up, which unfortunately in this case is filled with submitted but actually failed tasks. This is one thing we can potentially do though. We allowed the user to specify a status command (e.g. qstat). It is not used now because the output can be inconsistent with sos term (failture etc), but we could use it to check if a job is still in queue. That is to say, after the job has been in submitted status for a while (which is hard to define), sos could run the command and see if it returns succ (0).

gaow · 2018-10-10T15:28:38Z

Oh that explained it! I always thought SoS relies on qstat so I am always a bit confused when I notice inconsistencies between qstat and what SoS determines. Then seems SoS works reasonably well independently without info from qstat, until this issue comes up! Maybe as you suggest, we can make status command a "secondary" (hard to define) mechanism to help determine SoS status.

BoPeng · 2018-10-10T18:23:02Z

So zmq architecture actually provides a good mechanism for the program to be "peeked" from outside. The idea was described in #942, and #1060 . Basically, when sos runs, many information are sent to a controller thread through sockets. As long as the IP address of the thread is exposed, an outside program can tap to the port and communicate with the controller. In this way it would be easy for users to check how many tasks are being queued, in which status etc.

For example, for #1060, instead of passing verbosity levels around and let workers output their own logging messages, it is possible to send all logging information to the controller. The controller would output these messages to the terminal according to -v flag and send the messages to the tapping socket where detailed debug information could be displayed in a browser, possibly categorized by topics.

BoPeng pushed a commit that referenced this issue Oct 9, 2018

Passing exception from failed tasks to executors #1078

2c5f542

BoPeng pushed a commit that referenced this issue Oct 9, 2018

Proper handling of undefined queue error #1078

cf52cbe

BoPeng pushed a commit that referenced this issue Oct 9, 2018

Stop killing master when task fails. Pass exception to step executor …

975b9db

…and let it stop #1078

BoPeng added a commit to vatlab/sos-pbs that referenced this issue Oct 9, 2018

Better handling error from job submission script. vatlab/sos#1078

d272152

BoPeng closed this as completed Oct 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed jobs make SoS hang on cluster #1078

Failed jobs make SoS hang on cluster #1078

gaow commented Oct 8, 2018

gaow commented Oct 8, 2018

BoPeng commented Oct 8, 2018

gaow commented Oct 8, 2018

BoPeng commented Oct 9, 2018

gaow commented Oct 9, 2018

BoPeng commented Oct 9, 2018

BoPeng commented Oct 9, 2018

gaow commented Oct 9, 2018

BoPeng commented Oct 9, 2018

gaow commented Oct 9, 2018

BoPeng commented Oct 9, 2018

BoPeng commented Oct 10, 2018

gaow commented Oct 10, 2018 •

edited

BoPeng commented Oct 10, 2018

gaow commented Oct 10, 2018

BoPeng commented Oct 10, 2018 •

edited

Failed jobs make SoS hang on cluster #1078

Failed jobs make SoS hang on cluster #1078

Comments

gaow commented Oct 8, 2018

The hang-test example

The output files

The hang screen

Other comments

gaow commented Oct 8, 2018

BoPeng commented Oct 8, 2018

gaow commented Oct 8, 2018

BoPeng commented Oct 9, 2018

gaow commented Oct 9, 2018

BoPeng commented Oct 9, 2018

BoPeng commented Oct 9, 2018

gaow commented Oct 9, 2018

BoPeng commented Oct 9, 2018

gaow commented Oct 9, 2018

BoPeng commented Oct 9, 2018

BoPeng commented Oct 10, 2018

gaow commented Oct 10, 2018 • edited

BoPeng commented Oct 10, 2018

gaow commented Oct 10, 2018

BoPeng commented Oct 10, 2018 • edited

gaow commented Oct 10, 2018 •

edited

BoPeng commented Oct 10, 2018 •

edited