You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another common approach is to "daisy chain" jobs by having the job script submit another job that is dependent on itself. For example, in train.slurm you'd have a line like:
# when train.slurm executes, have it submit another job dependent on itself
sbatch --dependency=$SLURM_JOBID train.slurm
This is usually done near the top of the script, before the command that actually launches the run.
One might also pair that with some logic to stop the chaining when the job is done. For example, the application or the user might touch a "run.done" file when it completes. Then the script can check for that file.
# exit right away if "run.done" file is detected
if [ -f run.done ] ; then
exit 0
fi
# otherwise chain up another job
sbatch --dependency=$SLURM_JOBID train.slurm
# then launch the run
<<launch run>>>
Additionally, one could check for the "run.done" file after the run and attempt to cancel any already daisy-chained job.
I don't have a list of pros/cons vs the job array, but it's one more method I see in practice.
The text was updated successfully, but these errors were encountered:
The job array works well to queue up multiple jobs:
https://github.com/stas00/ml-engineering/tree/master/fault-tolerance#queue-up-multiple-training-jobs
Another common approach is to "daisy chain" jobs by having the job script submit another job that is dependent on itself. For example, in
train.slurm
you'd have a line like:This is usually done near the top of the script, before the command that actually launches the run.
One might also pair that with some logic to stop the chaining when the job is done. For example, the application or the user might touch a "run.done" file when it completes. Then the script can check for that file.
Additionally, one could check for the "run.done" file after the run and attempt to cancel any already daisy-chained job.
I don't have a list of pros/cons vs the job array, but it's one more method I see in practice.
The text was updated successfully, but these errors were encountered: