-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Are we in agreement that we don't support checkpoint anymore? (I'm personally still not convinced. Primarily because I'm not sure we have a decent replacement for this. I think it's needed if we were to remove this).
Originally posted by @shcheklein in treeverse/dvc.org#4415 (comment)
I don't really think we need a built-in replacement/solution in DVC to handle the checkpoints use case (which I am still unsure what that is).
People should handle interruption and resuming through the ML framework and DVC already provides convenient tools to wrap it (params, persist, run-cache).
My main points about dropping checkpoints are:
-
The current solution doesn't provide value while coming at an important cost of code/docs maintenance.
-
Induces users into incorrect workflows and the required changes in the code are not properly explained anywhere.
-
Introduces ambiguous/unexpected behavior when executing more complex/realistic pipelines (i.e. how are downstream stages after
checkpoint: truesupposed to be executed?; what aboutforeachordvc.yamlwith more than 1 model?)
As an example of the second point, here are the things that are "incorrect" in this repo (same applies to the example in https://dvc.org/doc/user-guide/experiment-management/checkpoints):
- Optimizer state is not handled.
The state_dict of the optimizer should be also considered when loading/saving.
- No learning rate scheduler.
I would dare to say that using a fixed learning rate would never result in a better model than using any kind of lr scheduler.
It would also require to be handled when loading/saving (which connects with the issues in the next point).
- Epochs are being handled (arguably) incorrectly.
When picking a checkpoint and resuming from it, the epochs param is now treated as epochs + epochs_completed_at_checkpoint which differs with the meaning when training without resuming where epochs params reflects the total number.
- After resuming from a checkpoint, the experiments can't be reproduced easily.
Let's say we have a completed experiments that was using checkpoints:
# EXP_A
lr: 0.003
weight_decay: 0
epochs: 15If I run:
$ dvc exp apply EXP_A_CHECKPOINT_5
$ dvc exp run -S lr=99 weight_decay=0 epochs=10It is not possible to reproduce the experiment with a single command. We would have to run the exact combination of exp apply and exp run.
It is not possible to reproduce the experiment at all if the checkpoints are deleted.
- Resumed experiments are not differentiable after persisting.
Let's say I have another experiment completed using checkpoints:
# EXP_B
lr: 0.1
weight_decay: 10
epochs: 40And I run:
$ dvc exp apply EXP_B_CHECKPOINT_39
$ dvc exp run -S lr=99 weight_decay=0 epochs=10Persisting this experiment or the one from the previous point will result in an equivalent state in the repo regarding params and step metric, even though the training that led to the resulting model is completely different.