HPC and non-python checkpointing

# My workflow
I'm using DVC with CLI tools that provide their own checkpointing solutions (e.g. [lammps](https://www.lammps.org/#gsc.tab=0) or [CP2k](https://www.cp2k.org/)).  The main motivation for checkpointing is not to go back to e.g. an earlier model as it is often case in ML but more to continue a simulation and append to existing output files (which might be > 100 GB). This is often required, because simulations are carried out on HPC resources with strict time limits (hours to a few days). Reaching the time limit means, that the job is immediatly killed, so the DVC process won't have time to do anything. (Some clusters send some signal before stopping, but this is not guaranteed.)

My current solution to this problem, is to run the simulation in a *temporary* directory and move the outputs to the correct location only after the process finished. In this way I can call `dvc repro` multiple times without having DVC remove the created checkpoint files.
The corresponding `dvc.yaml` file could look something like:

```yaml
stages:
   cp2k:
     cmd: mpirun cp2k.psmp -in cp2k.inp && mv tmp/ output/
     deps:
      - cp2k.inp
     outs: 
      - output/
```

Unfortunately, this doesn't work with `dvc exp` and feels a bit hacky as well.

# Potential Solutions
There are a few things that would help me (and potentially many others who haven't used DVC yet in the field of HPC) :

- An option for `dvc queue start --keep-failed`. In this way, after the experiemnt was killed, one could do `dvc exp apply <id>` and start a new experiment from the checkpoint files that would otherwise be removed. The data would then only be removed using `dvc exp remove`.
- Because the processes are killed and DVC writes a lock file it knows if the process was killed (this was changed some time ago, but one had to remove the `rwlock` if that happend.). So it might be possible to have a `restart_if_killed` option per stage where DVC, instead of removing all files if called the next time, will call `cmd` again in the workspace. Only if that fails with some exception the command will end.

### Additional Thoughts
Some ideas that might also be really helpful to the HPC community. We have our own queing system, so running multiple experiments in parallel by using something like https://github.com/iterative/dvc/issues/8121 would be really powerful.
I also saw, that the DVC queue is written modular and might in principle support more than celery. For HPC the [dask distributed](https://distributed.dask.org/en/stable/) package is often used because it supports many HPC platforms. I've written [dask4dvc](https://github.com/zincware/dask4dvc) for some compatibility but implementing that into dvc (maybe `pip install dvc[dask]` or `dvc-dask` as an extra dependency) could potentially also be really powerful (maybe even for parallelizing `dvc repro`). This could also be one possible solution for https://github.com/iterative/dvc/issues/755.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HPC and non-python checkpointing #9235

My workflow

Potential Solutions

Additional Thoughts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HPC and non-python checkpointing #9235

Description

My workflow

Potential Solutions

Additional Thoughts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions