Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to save results and signatures of completed subtasks when master tasks are killed #1323

Closed
gaow opened this issue Nov 19, 2019 · 19 comments
Assignees

Comments

@gaow
Copy link
Member

gaow commented Nov 19, 2019

For my earlier submission of thousands of substeps (about 20K) as about 1K tasks each having 20 substeps, I got some failures. When I fix my code and resubmit, I see more submitted jobs than expected:

INFO: M20_8250b19e9a9b5435 submitted to midway2 with job id 63825899
INFO: M20_7c665000478155e5 submitted to midway2 with job id 63825901
INFO: M20_c2dbff87ce2436c9 submitted to midway2 with job id 63825904
INFO: Waiting for the completion of 50 tasks before submitting 1521 pending ones.
INFO: M20_366084e8befddcbb submitted to midway2 with job id 63825930
INFO: M20_2e099a81d3f2c361 submitted to midway2 with job id 63825932
INFO: M20_612b2a2a35f4d851 submitted to midway2 with job id 63825933
INFO: Waiting for the completion of 50 tasks before submitting 1518 pending ones.
INFO: M20_06aec99666ae8966 submitted to midway2 with job id 63825984
INFO: Waiting for the completion of 50 tasks before submitting 1517 pending ones.
INFO: M20_ee9a017c3c6a6af7 submitted to midway2 with job id 63826018
INFO: Waiting for the completion of 50 tasks before submitting 1516 pending ones.

INFO: Waiting for the completion of 50 tasks before submitting 1516 pending ones.

as you can see what happens is that the same failed task will still be resubmitted even though maybe 19/20 of them were successful they will only be skipped when the M20 task is executed. That means they claim the same resource and stay in the queue. It thus takes lots of time to recover maybe only a handful of failed runs. What I have to do now is to remove all task from ~/.sos and start with -s build in order to skip existing valid output. I think it would make sense to regroup substeps and generate new tasks to avoid such overhead.

@gaow
Copy link
Member Author

gaow commented Nov 19, 2019

Hmm actually I think for this case, for some reason the previously completed task were also submitted, because I see this in the .err file:

INFO: All 20 tasks in M20_2b4c01d441cc279f ignored or skipped

This means a job M20_2b4c01d441cc279f was submitted when all the sub-tasks have previously been completed. It went through the queue system as one of the 1.5K jobs, wait in the queue, got allocated resource but only to found that the 20 tasks have previously been completed. I wonder why it did not skip the outputs when they are there. But better yet I think we should regroup and submit only failed substeps -- I thought that's once the case at least? Because this is a pretty major issue that I should have noted before!

My current solution is to remove ~/.sos and .sos and use --touch to rebuild signature for existing outputs, then create new jobs from the failed ones (essentially regroup). This resulted in submission of a very reasonable number of jobs: 75 jobs instead of 1.5K! Although it takes a while to rebuild the signatures.

@BoPeng
Copy link
Contributor

BoPeng commented Nov 22, 2019

The reason is that the signatures are kept in the master task file so sos has to re-execute the entire master task to decide which ones have been completed. Not sure how to fix this right now.

@gaow
Copy link
Member Author

gaow commented Nov 22, 2019

Is it also behavior in the past? And what if we don't use task but just send workflow scripts to the cluster as proposed in #1321 ? Somehow I remember in the past there were some regroup behavior. My current "fix" is to remove ~/.sos and .sos and use -s build to resubmit; otherwise it is impossible to efficiently salvage a few failed substeps from thousands of substeps.

@BoPeng
Copy link
Contributor

BoPeng commented Nov 22, 2019

The regroup behavior was caused by ignored substeps. When there is no output, we have to rely on task signature to ignore subtasks, and as I said subtask signatures are saved with master task.

@gaow
Copy link
Member Author

gaow commented Nov 22, 2019

When there is no output

What do you mean by this? For the 20 tasks in M20_2b4c01d441cc279f ignored or skipped their output file do exist -- that is why I can use -s build to skip them. It is true that substeps can be ignored when they appear to run successfully, eg, when you rerun a successful workflow all of them will be ignored or skipped. But when a step is partially successful, even all substeps in some tasks are successful like above when 20 substeps in M20_2b4c01d441cc279f was good and generated valid output, they still get resubmitted when I rerun to finish up only a few failed jobs.

@BoPeng
Copy link
Contributor

BoPeng commented Nov 22, 2019

I meant that when there is no _output, there is no substep signature so sos cannot ignore substeps before tasks are generated.

@gaow
Copy link
Member Author

gaow commented Nov 22, 2019

Okay my point is that for my case there is _output. I edited my post above. Maybe it is a bug?

@BoPeng
Copy link
Contributor

BoPeng commented Nov 22, 2019

Then there is a bug... In theory, sos should check substep signature, and no task will be generated if the substep is ignored, then the failed substeps (tasks) will be grouped.

@gaow
Copy link
Member Author

gaow commented Nov 22, 2019

Okay let me put together a MWE to see if we can reproduce it.

@gaow
Copy link
Member Author

gaow commented Nov 24, 2019

@BoPeng here is an MWE:

[1]
reps = [x+1 for x in range(10)]
input: for_each = 'reps'
output: f'{_reps}.rds'
task: trunk_workers = 1, queue = 'midway2_head', walltime = '1m', trunk_size = 10, mem = '2G', cores = 1, workdir = './'
R: expand = True
	Sys.sleep({'60' if _index > 3 else '120'})
	saveRDS({_reps}, {_output:r})

You see it has output defined. Also you see I asked for 1 minute per substep wall time but have some substeps run for 2 min so this script will fail due to time limit.

On UChicago cluster with this template, I submit the job:

sos run issue_1232.sos -c issue_6.yml

output:

[MW] sos run issue_1232.sos -c susie_z.bm.yml 
INFO: Running 1: 
INFO: M10_bc4b359d3a4b95c0 submitted to midway2_head with job id 63959811
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task. 
...
WARNING: Task M10_bc4b359d3a4b95c0 inactive for more than 185 seconds, might have been killed.
ERROR: [1]: [1]: Failed to get results for tasks bc4b359d3a4b95c0, 34b5d4b0abc56bb7, 79ea373e8c17676a, f6c23ec102e8d38d, 6565b00aba5e4d30, 28a570891074db6e, 29d2ee265317e65c, f48e6e53e14dbe5e, 00ef86d02637fe76, 0d3ef962e492442f

as you can see it claims failure to get tasks for all 10 tasks. But actually the first 6 substeps have successfully completed with output.

I submit it again by running exactly the same command above:

[MW] sos run issue_1232.sos -c issue_6.yml 
INFO: Running 1: 
INFO: M10_bc4b359d3a4b95c0 restart from status aborted
INFO: M10_bc4b359d3a4b95c0 submitted to midway2_head with job id 63960183

it reruns everything, even though in fact output 1.rds to 6.rds do exist.

I expect here it will recognize that the first 6 substeps are successful and only work on the remaining 4 substeps.

@gaow
Copy link
Member Author

gaow commented Nov 24, 2019

Also in both runs I get in the err file:

INFO: M10_bc4b359d3a4b95c0 started
slurmstepd-midway2-0018: error: *** JOB 63960183 ON midway2-0018 CANCELLED AT 2019-11-23T18:45:32 DUE TO TIME LIMIT ***

but with sos status ... -v4 it says:

standard error:
================
Task M10_bc4b359d3a4b95c0 inactive for more than 151 seconds, might have been killed.

It would be nice to incorporate the PBS err into sos status -- I thought we have it but it was not recording the more useful information in this case? I suspect it is something to do with #1303

@BoPeng
Copy link
Contributor

BoPeng commented Dec 4, 2019

It would be nice to incorporate the PBS err into sos status

sos updated, but your template has

      #SBATCH --output={job_name}.out

which creates .out file in working directory, but SoS only recognizes system .out file in ~/.sos/tasks. After changing the lines to

      #SBATCH --output=~/.sos/tasks/{job_name}.out

sos should now be able to pick up the message, absorb into .task and display it.

@gaow
Copy link
Member Author

gaow commented Dec 4, 2019

I see ... but the problem with --output=~/.sos/tasks/{job_name}.out is that it will leave numerous files under ~/.sos/tasks folder that we might forget to delete. Also sometimes it is more straightforward to cat *.out | grep ... than to check each task using sos status so it makes sense to put *.out and *.err somewhere more obvious than ~/.sos/task.

Is there a way that SoS still picks these files and absorb into .task regardless of where they are?

BoPeng pushed a commit that referenced this issue Dec 4, 2019
@BoPeng
Copy link
Contributor

BoPeng commented Dec 4, 2019

sos status will absorb or remove these files so these files will not stay long in ~/.sos/tasks... Currently SoS only checks ~/.sos/tasks although this could be an option in the hosts.yml file, then I would avoid excessive options.

The task has been waiting for a while after being submitted. I will stop and leave it to you to confirm the updated behavior (showing cluster error message from command line and sos status).

@BoPeng
Copy link
Contributor

BoPeng commented Dec 4, 2019

Now, back to the root of this problem, when the task is killed, the task got no chance to save signatures of the completed subtasks. This is a hard problem to solve unless we save signatures of completed tasks to files, and got "picked up" when sos notices the tasks being killed. Even in this case, as I have said, the master tasks have to be re-executed before it could ignore completed subtasks.

@gaow
Copy link
Member Author

gaow commented Dec 4, 2019

I see. So this is a rather particular "bug" when a task itself is killed (due to walltime in this case). But in ordinary cases if some substeps in a task fail (script return code != 0), then the next time SoS reruns it should be able to regroup and focus on the failed ones, right? That was the behavior in the past I hope it is not broken.

Now back to this problem: it seems then #1321 -q none mode can help because it will save file signatures as it completes rather than involving task signatures?

@BoPeng
Copy link
Contributor

BoPeng commented Dec 4, 2019

-q none is different since the tasks are merged to substeps ....

Yes, this is a particular case when running tasks are killed. If the tasks are allowed to complete, SoS will be able to get results from completed subtasks.

There are solutions, as always, but none will be trivial. Right now all information on subtasks are saved in memory and are written to task file when all subtasks are completed. What would be required are writing information to disk as soon as subtasks are done. One way is to change format to an incremental one, at least for master tasks. Another way is to cache subtasks results to separate files, and merge them to the master task either when everything is done, or when the task is killed but sos status is called to collect the residues.

@BoPeng BoPeng changed the title Regroup failed substeps before submitting tasks Failure to save results and signatures of completed subtasks when master tasks are killed Dec 4, 2019
@BoPeng
Copy link
Contributor

BoPeng commented Dec 4, 2019

Just to prepare for the new file format,

  1. When task files are retrieved, only result is of interest. https://github.com/vatlab/sos/blob/master/src/sos/hosts.py#L272

  2. The results are combined from sub results at the end:

    def _combine_results(self, task_id, results):
    , from a list of single results.

  3. Result for individual task is collected here

    def _collect_task_result(self,

So the least intrusive way would be removing the combine result part and leave it to when the results are needed.

BoPeng pushed a commit that referenced this issue Dec 6, 2019
BoPeng pushed a commit that referenced this issue Dec 6, 2019
BoPeng pushed a commit that referenced this issue Dec 6, 2019
@BoPeng
Copy link
Contributor

BoPeng commented Dec 7, 2019

Seems to work now.

@BoPeng BoPeng closed this as completed Dec 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants