Bugfix: pop checkpoint resume from kwargs in experiments #4913
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
Purpose
I believe #4855 introduced a regression to do with experiments, checkpoints, and run cache. #4911 seems to have tried to address it, but I still get the same error:
Approach
After digging through the code, I found a spot where it seems like the
checkpoint_resumeparameter should be removed from the set ofkwargsbefore continuing. I don't seecheckpoint_resumebeing used anywhere outside ofdvc.repo.experiments.newanddvc.repo.experiments._resume_checkpoint(), and the former calls the latter and is the only function to do so, so I think this fix is correct.I spent a lot of time trying to write a test that captured this bug, but I failed. π’ When I compare what's happening in the stack trace between my repo and the test case, I find a difference here:
https://github.com/iterative/dvc/blob/4cf2f8139bdd2c30d439dea1bb7375ddb63e46d0/dvc/repo/reproduce.py#L169-L172
In my repo
kwargs["checkpoint_func"]gets set to None, whereas it's a function in the test case. Then when we get todvc.stage.run.run_stage(), we hit this condition:https://github.com/iterative/dvc/blob/4cf2f8139bdd2c30d439dea1bb7375ddb63e46d0/dvc/stage/run.py#L101-L109
Which then causes the error.