Skip to content

HPC and non-python checkpointing #9235

@PythonFZ

Description

@PythonFZ

My workflow

I'm using DVC with CLI tools that provide their own checkpointing solutions (e.g. lammps or CP2k). The main motivation for checkpointing is not to go back to e.g. an earlier model as it is often case in ML but more to continue a simulation and append to existing output files (which might be > 100 GB). This is often required, because simulations are carried out on HPC resources with strict time limits (hours to a few days). Reaching the time limit means, that the job is immediatly killed, so the DVC process won't have time to do anything. (Some clusters send some signal before stopping, but this is not guaranteed.)

My current solution to this problem, is to run the simulation in a temporary directory and move the outputs to the correct location only after the process finished. In this way I can call dvc repro multiple times without having DVC remove the created checkpoint files.
The corresponding dvc.yaml file could look something like:

stages:
   cp2k:
     cmd: mpirun cp2k.psmp -in cp2k.inp && mv tmp/ output/
     deps:
      - cp2k.inp
     outs: 
      - output/

Unfortunately, this doesn't work with dvc exp and feels a bit hacky as well.

Potential Solutions

There are a few things that would help me (and potentially many others who haven't used DVC yet in the field of HPC) :

  • An option for dvc queue start --keep-failed. In this way, after the experiemnt was killed, one could do dvc exp apply <id> and start a new experiment from the checkpoint files that would otherwise be removed. The data would then only be removed using dvc exp remove.
  • Because the processes are killed and DVC writes a lock file it knows if the process was killed (this was changed some time ago, but one had to remove the rwlock if that happend.). So it might be possible to have a restart_if_killed option per stage where DVC, instead of removing all files if called the next time, will call cmd again in the workspace. Only if that fails with some exception the command will end.

Additional Thoughts

Some ideas that might also be really helpful to the HPC community. We have our own queing system, so running multiple experiments in parallel by using something like #8121 would be really powerful.
I also saw, that the DVC queue is written modular and might in principle support more than celery. For HPC the dask distributed package is often used because it supports many HPC platforms. I've written dask4dvc for some compatibility but implementing that into dvc (maybe pip install dvc[dask] or dvc-dask as an extra dependency) could potentially also be really powerful (maybe even for parallelizing dvc repro). This could also be one possible solution for #755.

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionrequires active participation to reach a conclusion

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions