-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to save results and signatures of completed subtasks when master tasks are killed #1323
Comments
Hmm actually I think for this case, for some reason the previously completed task were also submitted, because I see this in the
This means a job My current solution is to remove |
The reason is that the signatures are kept in the master task file so sos has to re-execute the entire master task to decide which ones have been completed. Not sure how to fix this right now. |
Is it also behavior in the past? And what if we don't use |
The regroup behavior was caused by ignored substeps. When there is no |
What do you mean by this? For the |
I meant that when there is no |
Okay my point is that for my case there is |
Then there is a bug... In theory, sos should check substep signature, and no task will be generated if the substep is ignored, then the failed substeps (tasks) will be grouped. |
Okay let me put together a MWE to see if we can reproduce it. |
@BoPeng here is an MWE:
You see it has output defined. Also you see I asked for 1 minute per substep wall time but have some substeps run for 2 min so this script will fail due to time limit. On UChicago cluster with this template, I submit the job:
output:
as you can see it claims failure to get tasks for all 10 tasks. But actually the first 6 substeps have successfully completed with output. I submit it again by running exactly the same command above:
it reruns everything, even though in fact output I expect here it will recognize that the first 6 substeps are successful and only work on the remaining 4 substeps. |
Also in both runs I get in the
but with
It would be nice to incorporate the PBS err into |
which creates
sos should now be able to pick up the message, absorb into |
I see ... but the problem with Is there a way that SoS still picks these files and absorb into |
The task has been waiting for a while after being submitted. I will stop and leave it to you to confirm the updated behavior (showing cluster error message from command line and |
Now, back to the root of this problem, when the task is killed, the task got no chance to save signatures of the completed subtasks. This is a hard problem to solve unless we save signatures of completed tasks to files, and got "picked up" when sos notices the tasks being killed. Even in this case, as I have said, the master tasks have to be re-executed before it could ignore completed subtasks. |
I see. So this is a rather particular "bug" when a task itself is killed (due to walltime in this case). But in ordinary cases if some substeps in a task fail (script return code != 0), then the next time SoS reruns it should be able to regroup and focus on the failed ones, right? That was the behavior in the past I hope it is not broken. Now back to this problem: it seems then #1321 |
Yes, this is a particular case when running tasks are killed. If the tasks are allowed to complete, SoS will be able to get results from completed subtasks. There are solutions, as always, but none will be trivial. Right now all information on subtasks are saved in memory and are written to task file when all subtasks are completed. What would be required are writing information to disk as soon as subtasks are done. One way is to change format to an incremental one, at least for master tasks. Another way is to cache subtasks results to separate files, and merge them to the master task either when everything is done, or when the task is killed but |
Just to prepare for the new file format,
So the least intrusive way would be removing the combine result part and leave it to when the results are needed. |
Seems to work now. |
For my earlier submission of thousands of substeps (about 20K) as about 1K tasks each having 20 substeps, I got some failures. When I fix my code and resubmit, I see more submitted jobs than expected:
as you can see what happens is that the same failed task will still be resubmitted even though maybe 19/20 of them were successful they will only be skipped when the M20 task is executed. That means they claim the same resource and stay in the queue. It thus takes lots of time to recover maybe only a handful of failed runs. What I have to do now is to remove all task from
~/.sos
and start with-s build
in order to skip existing valid output. I think it would make sense to regroup substeps and generate new tasks to avoid such overhead.The text was updated successfully, but these errors were encountered: