-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
The current checkpoint behavior works nice in the model tuning stage but it is very painful to make code/model redesign changes and start training from scratch. It seems like we need to separate these types of users' activities by two different commands dvc exp run and dvc exp continue.
Another confusion with the current design - it is not correct to require checkpoints to be --always-changed. They should be reproducible by default. dvc exp run should start from scratch if no direct ask for --continue. Also, I’m not 100% sure that --persist is the right behavior. It feels like checkpoint is a special type of outputs.
How checkpoint reproducibility should look like:
$ dvc exp run # runs (let say) 27 epochs then Ctrl+C
$ dvc exp run # it should recognize checkpoints and say something like:
Stage 'train' didn't change. To continue training checkpoints use 'dvc exp continue' command.
What is the next after canceling:
- Continue training with no changes.
dvc exp continue- starts from the last step. It should add new checkpoints to the same experiment (the same level of hierarchy) - Change workspace.
vi train.py&dvc exp runwill start training from scratch (in a regular way with removing all outputs) - Change workspace and continue.
vi train.py&dvc exp continueshould fail and say that it is not possible since workspace was changed - please run another experimentdvc exp run - Manual checkout of a checkpoint.
dvc exp checkout e4253be, changing code/paramsvi train.pyand run it as a separate experiment: it can be a new trainingdvc exp runor continue with the latest checkpointdvc exp continue. In any of these cases, it creates a new experiment (it would be amazing to show it as a new level in the hierarchy since it was inherited from another experiment - nice-to-have priority). - Automatic checkpoint change. The same as the previous one but automated with hyper params like
dvc exp continue --params train.dropout 0.18 e4253beor starting from scratchdvc exp run --params train.dropout 0.18 e4253be
UPDATE 11/2 1am PST: I forgot to specify sha e4253be in the last scenario - updated.
Any feedback is highly appreciated. CC @iterative/engineering