Checkpoint: separate model tuning and redesign behavior by a new command

The current checkpoint behavior works nice in the model tuning stage but it is very painful to make code/model redesign changes and start training from scratch. It seems like we need to separate these types of users' activities by two different commands `dvc exp run` and `dvc exp continue`.

Another confusion with the current design - it is not correct to require checkpoints to be `--always-changed`. They should be reproducible by default. `dvc exp run` should start from scratch if no direct ask for `--continue`. Also, I’m not 100% sure that `--persist` is the right behavior. It feels like checkpoint is a special type of outputs.

**How checkpoint reproducibility should look like**:
```
$ dvc exp run # runs (let say) 27 epochs then Ctrl+C
$ dvc exp run # it should recognize checkpoints and say something like:
Stage 'train' didn't change. To continue training checkpoints use 'dvc exp continue' command.
```

**What is the next after canceling:**
1. **Continue training with no changes**. `dvc exp continue` - starts from the last step. It should add new checkpoints to the same experiment (the same level of hierarchy)
2. **Change workspace**. `vi train.py` & `dvc exp run` will start training from scratch (in a regular way with removing all outputs)
3. **Change workspace and continue**. `vi train.py` & `dvc exp continue` should _fail_ and say that it is not possible since workspace was changed - please run another experiment `dvc exp run`
4. **Manual checkout of a checkpoint**. `dvc exp checkout e4253be`, changing code/params `vi train.py` and run it as a separate experiment: it can be a new training `dvc exp run` or continue with the latest checkpoint `dvc exp continue`. In any of these cases, it creates a new experiment (it would be amazing to show it as a new level in the hierarchy since it was inherited from another experiment - nice-to-have priority).
5. **Automatic checkpoint change**. The same as the previous one but automated with hyper params like `dvc exp continue --params train.dropout 0.18 e4253be` or starting from scratch `dvc exp run --params train.dropout 0.18 e4253be`

UPDATE 11/2 1am PST: I forgot to specify sha `e4253be` in the last scenario - updated.

Any feedback is highly appreciated. CC @iterative/engineering 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checkpoint: separate model tuning and redesign behavior by a new command #4821

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpoint: separate model tuning and redesign behavior by a new command #4821

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions