exp run: `dvc commit` DVC-tracked data deps when stashing an experiment #5859

pmrowla · 2021-04-21T10:03:10Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Will close #5593

Data dependencies are now dvc commited internally when stashing an experiment so that any modifications to that data dep are preserved in both workspace and tempdir runs (previously the changes were dropped entirely by the exp run dvc checkout step).

experiments

dberenbaum

Have a couple of test suggestions, but otherwise looks good!

The downside to this behavior will be that it will potentially take a long time to queue an experiment, right? Maybe we need to document that users should do dvc checkout to get back to their original data if they don't want to use the data changes in their workspace. Thoughts @jorgeorpinel?

tests/func/experiments/test_experiments.py

jorgeorpinel · 2021-04-21T17:15:08Z

Thanks @pmrowla 🙏

Data dependencies are now dvc commited internally when stashing an experiment ... previously the changes were dropped

Q#1. What happens to .dvc and dvc.lock files in the working tree if run fails after the commit? I'm guessing nothing (only changed in the tmp dir)

Q#2. by "stashing" do we litterally mean git stash? I didn't realize we still use that internally.

The downside to this behavior will be that it will potentially take a long time to queue an experiment, right? Maybe we need to document that users should do dvc checkout to get back to their original data

@dberenbaum Idk if commit is usually reported as a slow command but I guess it can be for large datasets?

As for dvc checkout maybe I didn't get it but I don't think that would change anything (assuming .dvc and dvc.lock files aren't modified in the workspace).

pmrowla · 2021-04-22T00:31:25Z

The downside to this behavior will be that it will potentially take a long time to queue an experiment, right?

@dberenbaum This will depend on the complexity of the user's pipeline and how many data deps they have, but yes it will slow down queueing.

Q#1. What happens to .dvc and dvc.lock files in the working tree if run fails after the commit? I'm guessing nothing (only changed in the tmp dir)

If exp run fails in a temp dir nothing in the workspace will change. If it fails in the workspace, the workspace will contain any (non-git-committed) changes to .dvc and dvc.lock files that were made prior to the failure. So .dvc files will be modified and contain the hashes from our dvc commit step.

Q#2. by "stashing" do we litterally mean git stash? I didn't realize we still use that internally.

Queued experiments are git stash (merge) commits. We don't use the standard git stash ref directly, but functionality wise, the experiment queue is just a custom git stash.

@jorgeorpinel

dberenbaum

I tried out the first test manually and it seems like there's maybe another test needed upstream somewhere.

Here's what I did:

mkdir repo
cd repo
git init
dvc init -q
git add .
git commit --quiet -m "init"
echo data > data
dvc add data
echo "import sys; import shutil; shutil.copyfile(sys.argv[1], sys.argv[2])" > copy.py
echo "foo: 1" > params.yaml
dvc run -n copy-file -M metrics.yaml -p foo -d copy.py -d data python copy.py params.yaml metrics.yaml
git add .
git commit -m "run stage"
echo modified > data
dvc exp run

The output from dvc exp run is:

'data.dvc' didn't change, skipping
Stage 'copy-file' didn't change, skipping

No experiment is created, and data is reverted back from modified to data.

dberenbaum · 2021-04-22T15:39:55Z

@jorgeorpinel I think we may just need to document that large changes to data dependencies in the workspace may slow down experiment queueing.

pmrowla · 2021-04-22T23:34:22Z

I tried out the first test manually and it seems like there's maybe another test needed upstream somewhere.

@dberenbaum bug was w/calling exp run without a target (the original test always called it with a specific target stage), should be fixed now and the test has been updated

jorgeorpinel · 2021-04-25T23:11:40Z

If it fails in the workspace, the workspace will contain any (non-git-committed) changes to .dvc and dvc.lock files ... contain the hashes from our dvc commit step.

Hm... Is this something we should be worried about? Let's wait and see I guess.
Out of curiosity, how feasible would it be to make it a transaction-type operation so the initial dvc commit gets rolled back (or happens in a tmp dir and gets discarded) if the run fails?

We don't use the standard git stash ref directly, but functionality wise, the experiment queue is just a custom git stash.

Could've mentioned it in your blog post @pmrowla !

I think we may just need to document that large changes to data dependencies in the workspace may slow down experiment queueing.

Agree. Created treeverse/dvc.org#2418

pmrowla · 2021-04-25T23:19:46Z

how feasible would it be to make it a transaction-type operation so the initial dvc commit gets rolled back (or happens in a tmp dir and gets discarded) if the run fails?

We can do this if we want to, but I don't think this is what users would expect. If dvc repro fails at some random point, we don't roll anything back, and the failed state is left in the repo. exp run should work the same way for experiments, so that the user can debug what happened.

If the user wants to roll back the repo state they can use git reset --hard the same way that they would for a failed repro.

jorgeorpinel · 2021-04-26T05:13:53Z

The initial dvc commit is also not obvious (nor documented) so who knows if having changed md5 sums in metafiles will help users debug or just confuse them. But yeah, just wondering. We can prob let it be until (if) we see confusion via support channels.

pmrowla added 3 commits April 21, 2021 18:58

stage,commit: add support for only committing data stages from

7cd8510

experiments

exp run: dvc commit data deps when stashing exp changes

692eca7

add test for modified DVC-tracked data when generating experiment

06cb293

pmrowla self-assigned this Apr 21, 2021

pmrowla added the enhancement Enhances DVC label Apr 21, 2021

pmrowla changed the title ~~exp run: dvc commit DVC-tracked data deps when stashing an experiment5593 exp run commit~~ exp run: dvc commit DVC-tracked data deps when stashing an experiment Apr 21, 2021

pmrowla requested a review from dberenbaum April 21, 2021 10:04

dberenbaum reviewed Apr 21, 2021

View reviewed changes

tests/func/experiments/test_experiments.py Outdated Show resolved Hide resolved

test with and without params change

a4439e3

dberenbaum suggested changes Apr 22, 2021

View reviewed changes

handle run without target

2c63880

dberenbaum approved these changes Apr 23, 2021

View reviewed changes

pmrowla merged commit e0d90cd into treeverse:master Apr 23, 2021

pmrowla deleted the 5593-exp-run-commit branch April 23, 2021 02:59

jorgeorpinel mentioned this pull request Apr 25, 2021

exp run: internal data committing performance implications treeverse/dvc.org#2418

Closed

2 tasks

jorgeorpinel mentioned this pull request Feb 11, 2022

guide: note that exp run commits data treeverse/dvc.org#3271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

exp run: `dvc commit` DVC-tracked data deps when stashing an experiment #5859

exp run: `dvc commit` DVC-tracked data deps when stashing an experiment #5859

pmrowla commented Apr 21, 2021 •

edited by jorgeorpinel

Loading

Uh oh!

dberenbaum left a comment

Uh oh!

Uh oh!

jorgeorpinel commented Apr 21, 2021 •

edited

Loading

Uh oh!

pmrowla commented Apr 22, 2021

Uh oh!

dberenbaum left a comment •

edited

Loading

Uh oh!

dberenbaum commented Apr 22, 2021

Uh oh!

pmrowla commented Apr 22, 2021

Uh oh!

jorgeorpinel commented Apr 25, 2021 •

edited

Loading

Uh oh!

pmrowla commented Apr 25, 2021 •

edited by jorgeorpinel

Loading

Uh oh!

jorgeorpinel commented Apr 26, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

exp run: dvc commit DVC-tracked data deps when stashing an experiment #5859

exp run: dvc commit DVC-tracked data deps when stashing an experiment #5859

Conversation

pmrowla commented Apr 21, 2021 • edited by jorgeorpinel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dberenbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorgeorpinel commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmrowla commented Apr 22, 2021

Uh oh!

dberenbaum left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dberenbaum commented Apr 22, 2021

Uh oh!

pmrowla commented Apr 22, 2021

Uh oh!

jorgeorpinel commented Apr 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmrowla commented Apr 25, 2021 • edited by jorgeorpinel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgeorpinel commented Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

exp run: `dvc commit` DVC-tracked data deps when stashing an experiment #5859

exp run: `dvc commit` DVC-tracked data deps when stashing an experiment #5859

pmrowla commented Apr 21, 2021 •

edited by jorgeorpinel

Loading

jorgeorpinel commented Apr 21, 2021 •

edited

Loading

dberenbaum left a comment •

edited

Loading

jorgeorpinel commented Apr 25, 2021 •

edited

Loading

pmrowla commented Apr 25, 2021 •

edited by jorgeorpinel

Loading

jorgeorpinel commented Apr 26, 2021 •

edited

Loading