Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed jobs make SoS hang on cluster #1078

Closed
gaow opened this issue Oct 8, 2018 · 16 comments
Closed

Failed jobs make SoS hang on cluster #1078

gaow opened this issue Oct 8, 2018 · 16 comments

Comments

@gaow
Copy link
Member

gaow commented Oct 8, 2018

It seems a new problem with the zmq queue but I cannot (yet) come up with a good MWE for it. In short I have some tasks submitted to cluster and failed due to errors in the script not related to SoS. But instead of quitting properly (commit to signature DB and quit) it hangs. I have to ctrl-c, try to figure out what is wrong, and rerun them. When I rerun I noticed some previously successful substeps are not ignored, possibly due to 46e6f94. I've been wrestling with it the whole day trying to fix issues on my end, but I might have to eventually give it up and revert to pre-zmq version to get my work done ...

@BoPeng I do not have a MWE, but with modification to our favorite hang_test.sos example here:

hang_test_cluster.tar.gz

I was able to reproduce it reliably.

The hang-test example

It contains a sos file and yml file.

  • The sos file has on line 36 a variable that has not been defined not_exist_foo. It should create a failure
  • The yml file defines my cluster configuration. The key is that I submitted the job from the headnode.

To run the example:

sos run hang_test.sos.txt -c hang_test.localhost.yml

The output files

2018-10-07-22-08-24_scrot

you can see the time stamps: 21:52 is when the last output rds file was generated, which is >10min ago.

The hang screen

2018-10-07-22-09-52_scrot

It is hanging for quite a while. My task queue is currently empty.

Other comments

I recall that previously the problem was that SoS reports failures and quits workflow but the jobs in question was in fact successful. We decided to look into that with zmq but now the hang behavior is worse ...

@gaow
Copy link
Member Author

gaow commented Oct 8, 2018

Updates:

  1. if it helps, I ran another workflow which succeeded in every single substep. SoS quits nicely in this case without reporting false alarms that some steps failed (which was the case before).
  2. Another way to reproduce the hang issue, which is simpler, is to mess up the yml configuration's job template, eg, make it a single line or remove the # whatever messes it up. Then interestingly at least on my end all jobs will report submitted and hang there. But for my example in the thread above the sos status check was indeed failed. So it is perhaps better to test that example first.

@BoPeng
Copy link
Contributor

BoPeng commented Oct 8, 2018

Still trying to reproduce the issue with my VM, please be patient.

@gaow
Copy link
Member Author

gaow commented Oct 8, 2018

Sure, great! Yes we should do it one at a time -- try to reproduce and fix "failed" hang first (with my example running as is); then I'll test the patch on my end that "submitted" hang i mentioned in the 2nd thread and see if it still exists, then worry about reproducing and fixing that.

@BoPeng
Copy link
Contributor

BoPeng commented Oct 9, 2018

I can reproduce some problems but solutions are hard. Generally speaking, we have focused on errors in scripts and tasks, but no errors of job submission etc. Whereas such errors will error out sos easily before, current the zmq implementation require all processes to shutdown properly, so the error has to be handled at a global level...

@gaow
Copy link
Member Author

gaow commented Oct 9, 2018

I see. Indeed another type of hang is also very easy to reproduce but am reporting it here anways: when the run time exceeds required resource the job will be aborted and cause a hang.

@BoPeng
Copy link
Contributor

BoPeng commented Oct 9, 2018

Previously, we

  1. step send tasks to master
  2. master submit and collect results
  3. if all pass, master sends results to step and allow it to complete
  4. if some tasks fail, the master would quit. The step workers would die silently.

With zmq, the master cannot shutdown completely with worker still running.

BoPeng pushed a commit that referenced this issue Oct 9, 2018
@BoPeng
Copy link
Contributor

BoPeng commented Oct 9, 2018

Should be better now, but not sure if it solves the job submission problem.

@gaow
Copy link
Member Author

gaow commented Oct 9, 2018

The hang test example script now quits properly! That's an error from the script right? It would be nice to display the task ID for the failed tasks though so that people can use sos status to track them. Currently it only says:

ERROR: Workflow failed due to error

I'm not sure what you meant by "I can reproduce some problems" and "Should be better now". I just messed up my job template to create syntax error for my sbatch command. Again I got a hanging screen. Messing up that template is straightforward: just remove all line break and make the job template a one liner, then you should see the hang problem.

BoPeng added a commit to vatlab/sos-pbs that referenced this issue Oct 9, 2018
@BoPeng
Copy link
Contributor

BoPeng commented Oct 9, 2018

This patch handles error from job submission command better.

@gaow
Copy link
Member Author

gaow commented Oct 9, 2018

Okay the behavior of failed job is a lot better. It currently points to the exact sub-step that went wrong, the problematic script, and the task ID of it. Also instead of quitting altogether it continues to run other steps independent of the failed step and reports the failure at the very end. This is very good. I believe it is the case for non-task anyways. Good to have them consistent.

However I do not think the problem is fixed for messed up template. It still hangs. Can you reproduce it? How about trying to change the folder for stdout and stderr to some non-existing directory? On my system this is also a good way to reproduce a type of hang. eg, my showq is empty but

 sos status M5_fe671fedde2a7928 
M5_fe671fedde2a7928     3646e05e00ca3c5a normal Ran for 1 day   submitted

and the hang:

INFO: M5_32d2ad5b00fab26e submitted to midway2_head with job id 50877470


INFO: M5_fe671fedde2a7928 submitted to midway2_head with job id 50877471
INFO: M5_be2d1662fc65d995 submitted to midway2_head with job id 50877473
INFO: M5_0401a7b7a8477c59 submitted to midway2_head with job id 50877474
INFO: M5_16efb430ecc21c8d submitted to midway2_head with job id 50877475
INFO: M5_3f1b18244d1729ea submitted to midway2_head with job id 50877476



This is because I changed the stdout and stderr path to some crazy directory.

@BoPeng
Copy link
Contributor

BoPeng commented Oct 9, 2018

OK. Messed-up template is still not fixed because I only fixed case then the submission command fails to generate a job ID... I am working on a binder example on.

@BoPeng
Copy link
Contributor

BoPeng commented Oct 10, 2018

This does not look like fixable. The shell scripts generated by the template are executed by the scheduler and sos would not be executed if the shell script itself is problematic, and there is no chance for sos to do anything.

@gaow
Copy link
Member Author

gaow commented Oct 10, 2018

Well, indeed the job scheduler itself will quit (or complete) silently, if a bad script is submitted. It does leave behind some stderr files that helps me understanding what's going on -- this is true even when submitted via SoS.

But at least for the hanging behavior: shouldn't SoS check the current queue, keep submitting jobs until all jobs finish (success or fail), then quit? It does not bother if SoS can capture an error in that scenario. What matters is whether it hangs. The fact it hangs instead of keep submitting seem to suggest some error has been identified, right?

@BoPeng
Copy link
Contributor

BoPeng commented Oct 10, 2018

SoS is simply waiting for the queue to free up, which unfortunately in this case is filled with submitted but actually failed tasks. This is one thing we can potentially do though. We allowed the user to specify a status command (e.g. qstat). It is not used now because the output can be inconsistent with sos term (failture etc), but we could use it to check if a job is still in queue. That is to say, after the job has been in submitted status for a while (which is hard to define), sos could run the command and see if it returns succ (0).

@gaow
Copy link
Member Author

gaow commented Oct 10, 2018

Oh that explained it! I always thought SoS relies on qstat so I am always a bit confused when I notice inconsistencies between qstat and what SoS determines. Then seems SoS works reasonably well independently without info from qstat, until this issue comes up! Maybe as you suggest, we can make status command a "secondary" (hard to define) mechanism to help determine SoS status.

@BoPeng
Copy link
Contributor

BoPeng commented Oct 10, 2018

So zmq architecture actually provides a good mechanism for the program to be "peeked" from outside. The idea was described in #942, and #1060 . Basically, when sos runs, many information are sent to a controller thread through sockets. As long as the IP address of the thread is exposed, an outside program can tap to the port and communicate with the controller. In this way it would be easy for users to check how many tasks are being queued, in which status etc.

For example, for #1060, instead of passing verbosity levels around and let workers output their own logging messages, it is possible to send all logging information to the controller. The controller would output these messages to the terminal according to -v flag and send the messages to the tapping socket where detailed debug information could be displayed in a browser, possibly categorized by topics.

@BoPeng BoPeng closed this as completed Oct 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants