Skip to content

Commit

Permalink
daisy chain jobs
Browse files Browse the repository at this point in the history
  • Loading branch information
stas00 committed Dec 4, 2023
1 parent fca0e85 commit a6e0f21
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions fault-tolerance/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,9 @@ Here we have 47 nodes being used (`alloc`), 23 available (`idle`) and 4 unavaila

The sysadmin is expected to periodically check the drained nodes, fix or replace them and then make them again available to be used by changing their state to `idle`.

The other approach is to daisy-chain jobs via `--dependency` as explained [here](../slurm/users.md#request-allocation-via-dependency). Both of these approaches could also be combined.

How do you know when the job array or a daisy chain should not resume - well, normally the training loop will exit immediately if it knows the job is done. But you could also add features like [kill switch](#kill-switch) which are even easier to use to prevent a job array from running.


## Frequent checkpoint saving
Expand Down Expand Up @@ -414,3 +417,9 @@ Signal handler called with signal 10
which means the job had a pid `58307` and it caught `SIGUSR1` (`10`) and it exited.

Now that you understand how this machinery works, instead of immediate `exit(0)` you can set exit-asap flag, finish the currently run iteration, check that the flag is up, save the checkpoint and exit. This is very similar to the code shown in Approach A above.



## Contributors

[Adam Moody](https://github.com/adammoody),

0 comments on commit a6e0f21

Please sign in to comment.