Failure to save results and signatures of completed subtasks when master tasks are killed #1323

gaow · 2019-11-19T20:07:51Z

For my earlier submission of thousands of substeps (about 20K) as about 1K tasks each having 20 substeps, I got some failures. When I fix my code and resubmit, I see more submitted jobs than expected:

INFO: M20_8250b19e9a9b5435 submitted to midway2 with job id 63825899
INFO: M20_7c665000478155e5 submitted to midway2 with job id 63825901
INFO: M20_c2dbff87ce2436c9 submitted to midway2 with job id 63825904
INFO: Waiting for the completion of 50 tasks before submitting 1521 pending ones.
INFO: M20_366084e8befddcbb submitted to midway2 with job id 63825930
INFO: M20_2e099a81d3f2c361 submitted to midway2 with job id 63825932
INFO: M20_612b2a2a35f4d851 submitted to midway2 with job id 63825933
INFO: Waiting for the completion of 50 tasks before submitting 1518 pending ones.
INFO: M20_06aec99666ae8966 submitted to midway2 with job id 63825984
INFO: Waiting for the completion of 50 tasks before submitting 1517 pending ones.
INFO: M20_ee9a017c3c6a6af7 submitted to midway2 with job id 63826018
INFO: Waiting for the completion of 50 tasks before submitting 1516 pending ones.

INFO: Waiting for the completion of 50 tasks before submitting 1516 pending ones.

as you can see what happens is that the same failed task will still be resubmitted even though maybe 19/20 of them were successful they will only be skipped when the M20 task is executed. That means they claim the same resource and stay in the queue. It thus takes lots of time to recover maybe only a handful of failed runs. What I have to do now is to remove all task from ~/.sos and start with -s build in order to skip existing valid output. I think it would make sense to regroup substeps and generate new tasks to avoid such overhead.

The text was updated successfully, but these errors were encountered:

gaow · 2019-11-19T22:10:29Z

Hmm actually I think for this case, for some reason the previously completed task were also submitted, because I see this in the .err file:

INFO: All 20 tasks in M20_2b4c01d441cc279f ignored or skipped

This means a job M20_2b4c01d441cc279f was submitted when all the sub-tasks have previously been completed. It went through the queue system as one of the 1.5K jobs, wait in the queue, got allocated resource but only to found that the 20 tasks have previously been completed. I wonder why it did not skip the outputs when they are there. But better yet I think we should regroup and submit only failed substeps -- I thought that's once the case at least? Because this is a pretty major issue that I should have noted before!

My current solution is to remove ~/.sos and .sos and use --touch to rebuild signature for existing outputs, then create new jobs from the failed ones (essentially regroup). This resulted in submission of a very reasonable number of jobs: 75 jobs instead of 1.5K! Although it takes a while to rebuild the signatures.

BoPeng · 2019-11-22T19:25:09Z

The reason is that the signatures are kept in the master task file so sos has to re-execute the entire master task to decide which ones have been completed. Not sure how to fix this right now.

gaow · 2019-11-22T19:30:38Z

Is it also behavior in the past? And what if we don't use task but just send workflow scripts to the cluster as proposed in #1321 ? Somehow I remember in the past there were some regroup behavior. My current "fix" is to remove ~/.sos and .sos and use -s build to resubmit; otherwise it is impossible to efficiently salvage a few failed substeps from thousands of substeps.

BoPeng · 2019-11-22T19:33:44Z

The regroup behavior was caused by ignored substeps. When there is no output, we have to rely on task signature to ignore subtasks, and as I said subtask signatures are saved with master task.

gaow · 2019-11-22T19:38:14Z

When there is no output

What do you mean by this? For the 20 tasks in M20_2b4c01d441cc279f ignored or skipped their output file do exist -- that is why I can use -s build to skip them. It is true that substeps can be ignored when they appear to run successfully, eg, when you rerun a successful workflow all of them will be ignored or skipped. But when a step is partially successful, even all substeps in some tasks are successful like above when 20 substeps in M20_2b4c01d441cc279f was good and generated valid output, they still get resubmitted when I rerun to finish up only a few failed jobs.

BoPeng · 2019-11-22T19:52:57Z

I meant that when there is no _output, there is no substep signature so sos cannot ignore substeps before tasks are generated.

gaow · 2019-11-22T19:58:01Z

Okay my point is that for my case there is _output. I edited my post above. Maybe it is a bug?

BoPeng · 2019-11-22T19:59:22Z

Then there is a bug... In theory, sos should check substep signature, and no task will be generated if the substep is ignored, then the failed substeps (tasks) will be grouped.

gaow · 2019-11-22T19:59:54Z

Okay let me put together a MWE to see if we can reproduce it.

gaow · 2019-11-24T00:39:14Z

@BoPeng here is an MWE:

[1]
reps = [x+1 for x in range(10)]
input: for_each = 'reps'
output: f'{_reps}.rds'
task: trunk_workers = 1, queue = 'midway2_head', walltime = '1m', trunk_size = 10, mem = '2G', cores = 1, workdir = './'
R: expand = True
	Sys.sleep({'60' if _index > 3 else '120'})
	saveRDS({_reps}, {_output:r})

You see it has output defined. Also you see I asked for 1 minute per substep wall time but have some substeps run for 2 min so this script will fail due to time limit.

On UChicago cluster with this template, I submit the job:

sos run issue_1232.sos -c issue_6.yml

output:

[MW] sos run issue_1232.sos -c susie_z.bm.yml 
INFO: Running 1: 
INFO: M10_bc4b359d3a4b95c0 submitted to midway2_head with job id 63959811
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task. 
...
WARNING: Task M10_bc4b359d3a4b95c0 inactive for more than 185 seconds, might have been killed.
ERROR: [1]: [1]: Failed to get results for tasks bc4b359d3a4b95c0, 34b5d4b0abc56bb7, 79ea373e8c17676a, f6c23ec102e8d38d, 6565b00aba5e4d30, 28a570891074db6e, 29d2ee265317e65c, f48e6e53e14dbe5e, 00ef86d02637fe76, 0d3ef962e492442f

as you can see it claims failure to get tasks for all 10 tasks. But actually the first 6 substeps have successfully completed with output.

I submit it again by running exactly the same command above:

[MW] sos run issue_1232.sos -c issue_6.yml 
INFO: Running 1: 
INFO: M10_bc4b359d3a4b95c0 restart from status aborted
INFO: M10_bc4b359d3a4b95c0 submitted to midway2_head with job id 63960183

it reruns everything, even though in fact output 1.rds to 6.rds do exist.

I expect here it will recognize that the first 6 substeps are successful and only work on the remaining 4 substeps.

gaow · 2019-11-24T00:52:15Z

Also in both runs I get in the err file:

INFO: M10_bc4b359d3a4b95c0 started
slurmstepd-midway2-0018: error: *** JOB 63960183 ON midway2-0018 CANCELLED AT 2019-11-23T18:45:32 DUE TO TIME LIMIT ***

but with sos status ... -v4 it says:

standard error:
================
Task M10_bc4b359d3a4b95c0 inactive for more than 151 seconds, might have been killed.

It would be nice to incorporate the PBS err into sos status -- I thought we have it but it was not recording the more useful information in this case? I suspect it is something to do with #1303

BoPeng · 2019-12-04T05:00:01Z

It would be nice to incorporate the PBS err into sos status

sos updated, but your template has

      #SBATCH --output={job_name}.out

which creates .out file in working directory, but SoS only recognizes system .out file in ~/.sos/tasks. After changing the lines to

      #SBATCH --output=~/.sos/tasks/{job_name}.out

sos should now be able to pick up the message, absorb into .task and display it.

gaow · 2019-12-04T05:09:32Z

I see ... but the problem with --output=~/.sos/tasks/{job_name}.out is that it will leave numerous files under ~/.sos/tasks folder that we might forget to delete. Also sometimes it is more straightforward to cat *.out | grep ... than to check each task using sos status so it makes sense to put *.out and *.err somewhere more obvious than ~/.sos/task.

Is there a way that SoS still picks these files and absorb into .task regardless of where they are?

BoPeng · 2019-12-04T05:18:22Z

sos status will absorb or remove these files so these files will not stay long in ~/.sos/tasks... Currently SoS only checks ~/.sos/tasks although this could be an option in the hosts.yml file, then I would avoid excessive options.

The task has been waiting for a while after being submitted. I will stop and leave it to you to confirm the updated behavior (showing cluster error message from command line and sos status).

BoPeng · 2019-12-04T20:43:56Z

Now, back to the root of this problem, when the task is killed, the task got no chance to save signatures of the completed subtasks. This is a hard problem to solve unless we save signatures of completed tasks to files, and got "picked up" when sos notices the tasks being killed. Even in this case, as I have said, the master tasks have to be re-executed before it could ignore completed subtasks.

gaow · 2019-12-04T21:44:25Z

I see. So this is a rather particular "bug" when a task itself is killed (due to walltime in this case). But in ordinary cases if some substeps in a task fail (script return code != 0), then the next time SoS reruns it should be able to regroup and focus on the failed ones, right? That was the behavior in the past I hope it is not broken.

Now back to this problem: it seems then #1321 -q none mode can help because it will save file signatures as it completes rather than involving task signatures?

BoPeng · 2019-12-04T22:36:00Z

-q none is different since the tasks are merged to substeps ....

Yes, this is a particular case when running tasks are killed. If the tasks are allowed to complete, SoS will be able to get results from completed subtasks.

There are solutions, as always, but none will be trivial. Right now all information on subtasks are saved in memory and are written to task file when all subtasks are completed. What would be required are writing information to disk as soon as subtasks are done. One way is to change format to an incremental one, at least for master tasks. Another way is to cache subtasks results to separate files, and merge them to the master task either when everything is done, or when the task is killed but sos status is called to collect the residues.

BoPeng · 2019-12-04T23:24:29Z

Just to prepare for the new file format,

When task files are retrieved, only result is of interest. https://github.com/vatlab/sos/blob/master/src/sos/hosts.py#L272
The results are combined from sub results at the end:

sos/src/sos/task_executor.py

Line 626 in 4289ed0

def _combine_results(self, task_id, results):

, from a list of single results.
Result for individual task is collected here

sos/src/sos/task_executor.py

Line 712 in 4289ed0

def _collect_task_result(self,

So the least intrusive way would be removing the combine result part and leave it to when the results are needed.

BoPeng · 2019-12-07T00:35:10Z

Seems to work now.

BoPeng self-assigned this Nov 24, 2019

BoPeng added the soon label Nov 24, 2019

gaow added request and removed soon labels Dec 3, 2019

gaow mentioned this issue Dec 4, 2019

-e ignore and -e abort for more error-handling options #1266

Closed

BoPeng pushed a commit that referenced this issue Dec 4, 2019

Show .err when the job is killed #1323

4289ed0

gaow mentioned this issue Dec 4, 2019

SLURM does not understand ~/.sos/tasks #1330

Closed

BoPeng added bug soon labels Dec 4, 2019

BoPeng changed the title ~~Regroup failed substeps before submitting tasks~~ Failure to save results and signatures of completed subtasks when master tasks are killed Dec 4, 2019

BoPeng pushed a commit that referenced this issue Dec 6, 2019

Use a cache file to cache executed results #1323

d105746

BoPeng pushed a commit that referenced this issue Dec 6, 2019

Fix previous patch #1323

82db772

BoPeng pushed a commit that referenced this issue Dec 6, 2019

Save partial signature #1323

0c9220b

BoPeng pushed a commit that referenced this issue Dec 6, 2019

Simplify signature writing #1323

44cf272

BoPeng closed this as completed Dec 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to save results and signatures of completed subtasks when master tasks are killed #1323

Failure to save results and signatures of completed subtasks when master tasks are killed #1323

gaow commented Nov 19, 2019 •

edited

gaow commented Nov 19, 2019 •

edited

BoPeng commented Nov 22, 2019

gaow commented Nov 22, 2019

BoPeng commented Nov 22, 2019

gaow commented Nov 22, 2019 •

edited

BoPeng commented Nov 22, 2019

gaow commented Nov 22, 2019

BoPeng commented Nov 22, 2019

gaow commented Nov 22, 2019

gaow commented Nov 24, 2019 •

edited

gaow commented Nov 24, 2019 •

edited

BoPeng commented Dec 4, 2019

gaow commented Dec 4, 2019

BoPeng commented Dec 4, 2019

BoPeng commented Dec 4, 2019

gaow commented Dec 4, 2019

BoPeng commented Dec 4, 2019

BoPeng commented Dec 4, 2019

BoPeng commented Dec 7, 2019

Failure to save results and signatures of completed subtasks when master tasks are killed #1323

Failure to save results and signatures of completed subtasks when master tasks are killed #1323

Comments

gaow commented Nov 19, 2019 • edited

gaow commented Nov 19, 2019 • edited

BoPeng commented Nov 22, 2019

gaow commented Nov 22, 2019

BoPeng commented Nov 22, 2019

gaow commented Nov 22, 2019 • edited

BoPeng commented Nov 22, 2019

gaow commented Nov 22, 2019

BoPeng commented Nov 22, 2019

gaow commented Nov 22, 2019

gaow commented Nov 24, 2019 • edited

gaow commented Nov 24, 2019 • edited

BoPeng commented Dec 4, 2019

gaow commented Dec 4, 2019

BoPeng commented Dec 4, 2019

BoPeng commented Dec 4, 2019

gaow commented Dec 4, 2019

BoPeng commented Dec 4, 2019

BoPeng commented Dec 4, 2019

BoPeng commented Dec 7, 2019

gaow commented Nov 19, 2019 •

edited

gaow commented Nov 19, 2019 •

edited

gaow commented Nov 22, 2019 •

edited

gaow commented Nov 24, 2019 •

edited

gaow commented Nov 24, 2019 •

edited