Skip to content

Checkpoint: separate model tuning and redesign behavior by a new command #4821

@dmpetrov

Description

@dmpetrov

The current checkpoint behavior works nice in the model tuning stage but it is very painful to make code/model redesign changes and start training from scratch. It seems like we need to separate these types of users' activities by two different commands dvc exp run and dvc exp continue.

Another confusion with the current design - it is not correct to require checkpoints to be --always-changed. They should be reproducible by default. dvc exp run should start from scratch if no direct ask for --continue. Also, I’m not 100% sure that --persist is the right behavior. It feels like checkpoint is a special type of outputs.

How checkpoint reproducibility should look like:

$ dvc exp run # runs (let say) 27 epochs then Ctrl+C
$ dvc exp run # it should recognize checkpoints and say something like:
Stage 'train' didn't change. To continue training checkpoints use 'dvc exp continue' command.

What is the next after canceling:

  1. Continue training with no changes. dvc exp continue - starts from the last step. It should add new checkpoints to the same experiment (the same level of hierarchy)
  2. Change workspace. vi train.py & dvc exp run will start training from scratch (in a regular way with removing all outputs)
  3. Change workspace and continue. vi train.py & dvc exp continue should fail and say that it is not possible since workspace was changed - please run another experiment dvc exp run
  4. Manual checkout of a checkpoint. dvc exp checkout e4253be, changing code/params vi train.py and run it as a separate experiment: it can be a new training dvc exp run or continue with the latest checkpoint dvc exp continue. In any of these cases, it creates a new experiment (it would be amazing to show it as a new level in the hierarchy since it was inherited from another experiment - nice-to-have priority).
  5. Automatic checkpoint change. The same as the previous one but automated with hyper params like dvc exp continue --params train.dropout 0.18 e4253be or starting from scratch dvc exp run --params train.dropout 0.18 e4253be

UPDATE 11/2 1am PST: I forgot to specify sha e4253be in the last scenario - updated.

Any feedback is highly appreciated. CC @iterative/engineering

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions