Drop/Revisit usage of `checkpoints`

> Are we in agreement that we don't support checkpoint anymore? (I'm personally still not convinced. Primarily because I'm not sure we have a decent replacement for this. I think it's needed if we were to remove this).

_Originally posted by @shcheklein in https://github.com/iterative/dvc.org/issues/4415#issuecomment-1476721592_

---

I don't really think we need a built-in replacement/solution in DVC to handle the checkpoints use case (which I am still unsure what that is). 

People should handle interruption and resuming through the ML framework and DVC already provides convenient tools to wrap it (params, `persist`, run-cache). 

My main points about dropping checkpoints are:

- The current solution doesn't provide value while coming at an important cost of code/docs maintenance. 

- Induces users into incorrect workflows and the required changes in the code are not properly explained anywhere.

- Introduces ambiguous/unexpected behavior when executing more complex/realistic pipelines (i.e. how are downstream stages after `checkpoint: true` supposed to be executed?; what about `foreach` or `dvc.yaml` with more than 1 model?)

---

As an example of the second point, here are the things that are "incorrect" in this repo (same applies to the example in https://dvc.org/doc/user-guide/experiment-management/checkpoints):

- Optimizer state is not handled.

The `state_dict` of the optimizer should be also considered when loading/saving.

- No learning rate scheduler.

I would dare to say that using a fixed learning rate would never result in a better model than using _any_ kind of lr scheduler. 

It would also require to be handled when loading/saving (which connects with the issues in the next point).

- Epochs are being handled (arguably) incorrectly.

When picking a checkpoint and resuming from it, the `epochs` param is now treated as `epochs + epochs_completed_at_checkpoint` which differs with the meaning when training without resuming where `epochs` params reflects the total number.

- After resuming from a checkpoint, the experiments can't be reproduced easily.

Let's say we have a completed experiments that was using checkpoints:

```yaml
# EXP_A
lr: 0.003
weight_decay: 0
epochs: 15
```

If I run:

```console
$ dvc exp apply EXP_A_CHECKPOINT_5
$ dvc exp run -S lr=99 weight_decay=0 epochs=10
```

It is not possible to reproduce the experiment with a single command. We would have to run the exact combination of `exp apply` and `exp run`.
It is not possible to reproduce the experiment at all if the checkpoints are deleted.

- Resumed experiments are not differentiable after persisting.

Let's say I have another experiment completed using checkpoints:

```yaml
# EXP_B
lr: 0.1
weight_decay: 10
epochs: 40
```

And I run:

```console
$ dvc exp apply EXP_B_CHECKPOINT_39
$ dvc exp run -S lr=99 weight_decay=0 epochs=10
```

Persisting this experiment or the one from the previous point will result in an equivalent state in the repo regarding `params` and `step` metric, even though the training that led to the resulting model is completely different.











Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Drop/Revisit usage of `checkpoints` #9221

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Drop/Revisit usage of checkpoints #9221

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Drop/Revisit usage of `checkpoints` #9221