Skip to content

Conversation

@daavoo
Copy link
Contributor

@daavoo daavoo commented Dec 27, 2022

Closes #4428

@daavoo
Copy link
Contributor Author

daavoo commented Dec 27, 2022

    Thanks @daavoo! It's been a very long time since I looked into this issue, and I'm no longer sure about it, and I'm hesitant to put it in a minor version release.

Let's say I'm using experiments, and my training stage outputs both a model that's cached in dvc and metrics that aren't. With the recent name changes, we made it possible to have duplicate experiments, but it didn't seem so bad because duplicate stages would still be cached and skipped. With this PR, I will end up duplicating all the computation of this training stage.

It also makes dvc much more reliant on dvc.lock. Before this PR, It's possible to have a workflow where you treat dvc.lock as ephemeral and don't commit or share it. As long as you share the run-cache, it's still easy to recover everything. After this PR, there's no way to recover the outputs from a stage with cache: false outputs unless you have dvc.lock.

Should dvc be saving the git hash for objects that have cache: false instead of the dvc md5? In the typical case for cache: false where those files are being tracked by git, it seems like it's still useful to have the run-cache but there's no need to add those files to the dvc cache since they can be found in git objects.

What do you think?

Originally posted by @dberenbaum in #8718 (comment)

@daavoo
Copy link
Contributor Author

daavoo commented Dec 27, 2022

Let's say I'm using experiments, and my training stage outputs both a model that's cached in dvc and metrics that aren't. With the recent name changes, we made it possible to have duplicate experiments, but it didn't seem so bad because duplicate stages would still be cached and skipped. With this PR, I will end up duplicating all the computation of this training stage.

Feels like this goes maybe back to some older comments in #4428 to add an option to explicitly control run-cache?

It also makes dvc much more reliant on dvc.lock. Before this PR, It's possible to have a workflow where you treat dvc.lock as ephemeral and don't commit or share it. As long as you share the run-cache, it's still easy to recover everything. After this PR, there's no way to recover the outputs from a stage with cache: false outputs unless you have dvc.lock.

The whole run-cache behavior is widely undocumented so I am not sure how common this ephemeral dvc.lock workflow would be.

Still feels that the current cache: false + run-cache behavior is mostly unexpected for users in most common workflows, perhaps as mentioned above we could disable when cache: false but introduce an option to explicitly enable it?

@codecov
Copy link

codecov bot commented Dec 27, 2022

Codecov Report

Base: 93.59% // Head: 93.64% // Increases project coverage by +0.05% 🎉

Coverage data is based on head (e433e2f) compared to base (f507e4d).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8738      +/-   ##
==========================================
+ Coverage   93.59%   93.64%   +0.05%     
==========================================
  Files         456      456              
  Lines       36119    36111       -8     
  Branches     5230     5232       +2     
==========================================
+ Hits        33804    33816      +12     
+ Misses       1816     1799      -17     
+ Partials      499      496       -3     
Impacted Files Coverage Δ
dvc/stage/__init__.py 94.00% <100.00%> (+0.01%) ⬆️
tests/func/test_run_cache.py 100.00% <100.00%> (ø)
tests/unit/stage/test_cache.py 100.00% <100.00%> (ø)
tests/conftest.py 83.33% <0.00%> (+1.51%) ⬆️
tests/unit/test_analytics.py 100.00% <0.00%> (+2.85%) ⬆️
dvc/analytics.py 95.52% <0.00%> (+4.47%) ⬆️
dvc/testing/fixtures.py 97.50% <0.00%> (+8.33%) ⬆️
tests/remotes/git_server.py 42.42% <0.00%> (+9.09%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dberenbaum
Copy link
Contributor

Still feels that the current cache: false + run-cache behavior is mostly unexpected for users in most common workflows

Yup, I agree. The behavior in this PR (which I haven't tested) seems more expected from what's documented. My concerns are because:

  1. It's a complex issue that I hadn't looked at in a long time. Having looked at it again now, I'm less concerned.
  2. It will cause noticeable changes in a minor version. Since we are probably moving closer to documented behavior, it's not so bad, but it still may be surprising.
  3. It still feels like it's not ideal for some of the scenarios mentioned above. We could keep discussing options to improve this.

We could always do it knowing that we may have to revert until 3.0 if users start to complain.

@dberenbaum
Copy link
Contributor

We could always do it knowing that we may have to revert until 3.0 if users start to complain.

@daavoo Let's go ahead with this PR and keep an eye on whether it causes issues for users.

@daavoo
Copy link
Contributor Author

daavoo commented Jan 3, 2023

@daavoo Let's go ahead with this PR and keep an eye on whether it causes issues for users.

I was thinking that we could try to cover the scenario where only metrics or plots are cache: false by only applying the check to actual outs so:

Let's say I'm using experiments, and my training stage outputs both a model that's cached in dvc and metrics that aren't.

Would actually create the run-cache.

WDYT @dberenbaum ?

@dberenbaum
Copy link
Contributor

dberenbaum commented Jan 4, 2023

Yes, I like it! Let's go with that approach 🙏 .

Edit: one thing to consider is this doesn't help for dvclive experiments or anyone using top-level metrics, plots, etc. (listing them as regular outputs with cache false in the stage but listing them separately in other top-level sections), but I guess we can revisit that issue as a follow-up.

@daavoo daavoo force-pushed the run-cache-for-cache-false branch from 57d37f7 to 39d938e Compare January 9, 2023 11:30
@daavoo daavoo changed the title [WIP] stage: Skip run-cache if any out has cache: false. stage: Skip run-cache if any out has cache: false. Jan 9, 2023
@daavoo daavoo marked this pull request as ready for review January 9, 2023 11:30
@daavoo daavoo added A: run-cache Related to the run-cache feature enhancement Enhances DVC labels Jan 9, 2023
@daavoo daavoo force-pushed the run-cache-for-cache-false branch 2 times, most recently from dc5099d to 971717d Compare January 9, 2023 11:39
@daavoo daavoo requested review from a team and dberenbaum January 9, 2023 11:40
@daavoo
Copy link
Contributor Author

daavoo commented Jan 9, 2023

Yes, I like it! Let's go with that approach 🙏 .

Updated the P.R.

Edit: one thing to consider is this doesn't help for dvclive experiments or anyone using top-level metrics, plots, etc. (listing them as regular outputs with cache false in the stage but listing them separately in other top-level sections), but I guess we can revisit that issue as a follow-up.

Need to properly discuss but in that specific scenario(top-level metrics/plots listed as outs with cache: false), is there a real need for those files to be listed in outs?
Given that they will be tracked with Git, I would expect that referencing them only in the top-level metrics/plots would be enough

Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Edit: one thing to consider is this doesn't help for dvclive experiments or anyone using top-level metrics, plots, etc. (listing them as regular outputs with cache false in the stage but listing them separately in other top-level sections), but I guess we can revisit that issue as a follow-up.

I was thinking they wouldn't be collected during metrics/plots/params/exp commands, but that's not impacted here, so I don't think it's worth worrying about it.

Copy link
Contributor

@karajan1001 karajan1001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,

)
self.save(
allow_missing=allow_missing,
run_cache=not no_commit and not no_cache_outs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

           cache_outs = not (no_commit or any(
                not out.use_cache
                for out in self.outs
                if not (out.is_metric or out.is_plot)
            ))

@daavoo daavoo force-pushed the run-cache-for-cache-false branch 2 times, most recently from 0e45228 to 547c1eb Compare January 12, 2023 12:04
@daavoo daavoo enabled auto-merge (rebase) January 12, 2023 16:55
@daavoo daavoo force-pushed the run-cache-for-cache-false branch from 547c1eb to fc53575 Compare January 12, 2023 16:56
@daavoo daavoo force-pushed the run-cache-for-cache-false branch from fc53575 to e433e2f Compare January 12, 2023 17:33
@daavoo daavoo merged commit 8b075eb into main Jan 12, 2023
@daavoo daavoo deleted the run-cache-for-cache-false branch January 12, 2023 18:10
daavoo added a commit to treeverse/dvc.org that referenced this pull request Feb 6, 2023
daavoo added a commit to treeverse/dvc.org that referenced this pull request Feb 8, 2023
daavoo added a commit to treeverse/dvc.org that referenced this pull request Feb 8, 2023
daavoo added a commit that referenced this pull request Apr 28, 2023
After #8738 , run cache is not saved for a stage if any output has `cache: false`.
daavoo added a commit that referenced this pull request May 2, 2023
After #8738 , run cache is not saved for a stage if any output has `cache: false`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A: run-cache Related to the run-cache feature enhancement Enhances DVC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DVC Repro incorrectly saving directory to cache

3 participants