-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
My workflow
I'm using DVC with CLI tools that provide their own checkpointing solutions (e.g. lammps or CP2k). The main motivation for checkpointing is not to go back to e.g. an earlier model as it is often case in ML but more to continue a simulation and append to existing output files (which might be > 100 GB). This is often required, because simulations are carried out on HPC resources with strict time limits (hours to a few days). Reaching the time limit means, that the job is immediatly killed, so the DVC process won't have time to do anything. (Some clusters send some signal before stopping, but this is not guaranteed.)
My current solution to this problem, is to run the simulation in a temporary directory and move the outputs to the correct location only after the process finished. In this way I can call dvc repro multiple times without having DVC remove the created checkpoint files.
The corresponding dvc.yaml file could look something like:
stages:
cp2k:
cmd: mpirun cp2k.psmp -in cp2k.inp && mv tmp/ output/
deps:
- cp2k.inp
outs:
- output/Unfortunately, this doesn't work with dvc exp and feels a bit hacky as well.
Potential Solutions
There are a few things that would help me (and potentially many others who haven't used DVC yet in the field of HPC) :
- An option for
dvc queue start --keep-failed. In this way, after the experiemnt was killed, one could dodvc exp apply <id>and start a new experiment from the checkpoint files that would otherwise be removed. The data would then only be removed usingdvc exp remove. - Because the processes are killed and DVC writes a lock file it knows if the process was killed (this was changed some time ago, but one had to remove the
rwlockif that happend.). So it might be possible to have arestart_if_killedoption per stage where DVC, instead of removing all files if called the next time, will callcmdagain in the workspace. Only if that fails with some exception the command will end.
Additional Thoughts
Some ideas that might also be really helpful to the HPC community. We have our own queing system, so running multiple experiments in parallel by using something like #8121 would be really powerful.
I also saw, that the DVC queue is written modular and might in principle support more than celery. For HPC the dask distributed package is often used because it supports many HPC platforms. I've written dask4dvc for some compatibility but implementing that into dvc (maybe pip install dvc[dask] or dvc-dask as an extra dependency) could potentially also be really powerful (maybe even for parallelizing dvc repro). This could also be one possible solution for #755.