Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daisy chain batch jobs #13

Closed
adammoody opened this issue Dec 3, 2023 · 1 comment
Closed

Daisy chain batch jobs #13

adammoody opened this issue Dec 3, 2023 · 1 comment

Comments

@adammoody
Copy link

The job array works well to queue up multiple jobs:

https://github.com/stas00/ml-engineering/tree/master/fault-tolerance#queue-up-multiple-training-jobs

Another common approach is to "daisy chain" jobs by having the job script submit another job that is dependent on itself. For example, in train.slurm you'd have a line like:

# when train.slurm executes, have it submit another job dependent on itself
sbatch --dependency=$SLURM_JOBID train.slurm

This is usually done near the top of the script, before the command that actually launches the run.

One might also pair that with some logic to stop the chaining when the job is done. For example, the application or the user might touch a "run.done" file when it completes. Then the script can check for that file.

# exit right away if "run.done" file is detected
if [ -f run.done ] ; then
  exit 0
fi

# otherwise chain up another job
sbatch --dependency=$SLURM_JOBID train.slurm

# then launch the run
<<launch run>>>

Additionally, one could check for the "run.done" file after the run and attempt to cancel any already daisy-chained job.

I don't have a list of pros/cons vs the job array, but it's one more method I see in practice.

@stas00
Copy link
Owner

stas00 commented Dec 4, 2023

Thank you for these suggestions, Adam.

--dependency is already covered in the SLURM guide, but you're making a good connection with the fault-tolerance section.

And for the file dropping I used the concept of the kill switch so it's there already.

so I combined both of your suggestions and pushed this:

a6e0f21

closing this for now - but please don't hesitate to continue if more things can be improved.

@stas00 stas00 closed this as completed Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants