Skip to content

Conversation

@sjawhar
Copy link
Contributor

@sjawhar sjawhar commented Nov 19, 2020

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Purpose

I believe #4855 introduced a regression to do with experiments, checkpoints, and run cache. #4911 seems to have tried to address it, but I still get the same error:

2020-11-19 19:10:05,223 ERROR: Failed to reproduce experiment 'a32ea21' - failed to reproduce 'dvc.yaml': restore() got an unexpected keyword argument 'checkpoint_resume'
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/reproduce.py", line 175, in _reproduce_stages
    ret = _reproduce_stage(stage, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/reproduce.py", line 36, in _reproduce_stage
    stage = stage.reproduce(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/lib/python3.8/dist-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/__init__.py", line 339, in reproduce
    self.run(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/lib/python3.8/dist-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/__init__.py", line 465, in run
    run_stage(self, dry, force, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/run.py", line 106, in run_stage
    stage.repo.stage_cache.restore(stage, **kwargs)
TypeError: restore() got an unexpected keyword argument 'checkpoint_resume'

Approach

After digging through the code, I found a spot where it seems like the checkpoint_resume parameter should be removed from the set of kwargs before continuing. I don't see checkpoint_resume being used anywhere outside of dvc.repo.experiments.new and dvc.repo.experiments._resume_checkpoint(), and the former calls the latter and is the only function to do so, so I think this fix is correct.

I spent a lot of time trying to write a test that captured this bug, but I failed. 😒 When I compare what's happening in the stack trace between my repo and the test case, I find a difference here:
https://github.com/iterative/dvc/blob/4cf2f8139bdd2c30d439dea1bb7375ddb63e46d0/dvc/repo/reproduce.py#L169-L172

In my repo kwargs["checkpoint_func"] gets set to None, whereas it's a function in the test case. Then when we get to dvc.stage.run.run_stage(), we hit this condition:
https://github.com/iterative/dvc/blob/4cf2f8139bdd2c30d439dea1bb7375ddb63e46d0/dvc/stage/run.py#L101-L109

Which then causes the error.

@efiop efiop requested a review from pmrowla November 19, 2020 22:14
@sjawhar sjawhar force-pushed the hotfix/experiment-checkpoint-resume branch from 526830a to dab32ea Compare November 19, 2020 23:46
@pmrowla pmrowla added bugfix fixes bug skip-changelog Skips changelog labels Nov 20, 2020
@pmrowla pmrowla merged commit 153c374 into treeverse:master Nov 20, 2020
@pmrowla
Copy link
Contributor

pmrowla commented Nov 20, 2020

Thanks for fixing this! πŸ™

@sjawhar sjawhar deleted the hotfix/experiment-checkpoint-resume branch November 20, 2020 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix fixes bug skip-changelog Skips changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants