Skip to content

Conversation

@pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Apr 21, 2021

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Will close #5593

  • Data dependencies are now dvc commited internally when stashing an experiment so that any modifications to that data dep are preserved in both workspace and tempdir runs (previously the changes were dropped entirely by the exp run dvc checkout step).

@pmrowla pmrowla self-assigned this Apr 21, 2021
@pmrowla pmrowla added the enhancement Enhances DVC label Apr 21, 2021
@pmrowla pmrowla changed the title exp run: dvc commit DVC-tracked data deps when stashing an experiment5593 exp run commit exp run: dvc commit DVC-tracked data deps when stashing an experiment Apr 21, 2021
@pmrowla pmrowla requested a review from dberenbaum April 21, 2021 10:04
Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a couple of test suggestions, but otherwise looks good!

The downside to this behavior will be that it will potentially take a long time to queue an experiment, right? Maybe we need to document that users should do dvc checkout to get back to their original data if they don't want to use the data changes in their workspace. Thoughts @jorgeorpinel?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 21, 2021

Thanks @pmrowla πŸ™

Data dependencies are now dvc commited internally when stashing an experiment ... previously the changes were dropped

Q#1. What happens to .dvc and dvc.lock files in the working tree if run fails after the commit? I'm guessing nothing (only changed in the tmp dir)

Q#2. by "stashing" do we litterally mean git stash? I didn't realize we still use that internally.

The downside to this behavior will be that it will potentially take a long time to queue an experiment, right? Maybe we need to document that users should do dvc checkout to get back to their original data

@dberenbaum Idk if commit is usually reported as a slow command but I guess it can be for large datasets?

As for dvc checkout maybe I didn't get it but I don't think that would change anything (assuming .dvc and dvc.lock files aren't modified in the workspace).

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 22, 2021

The downside to this behavior will be that it will potentially take a long time to queue an experiment, right?

@dberenbaum This will depend on the complexity of the user's pipeline and how many data deps they have, but yes it will slow down queueing.

Q#1. What happens to .dvc and dvc.lock files in the working tree if run fails after the commit? I'm guessing nothing (only changed in the tmp dir)

If exp run fails in a temp dir nothing in the workspace will change. If it fails in the workspace, the workspace will contain any (non-git-committed) changes to .dvc and dvc.lock files that were made prior to the failure. So .dvc files will be modified and contain the hashes from our dvc commit step.

Q#2. by "stashing" do we litterally mean git stash? I didn't realize we still use that internally.

Queued experiments are git stash (merge) commits. We don't use the standard git stash ref directly, but functionality wise, the experiment queue is just a custom git stash.

@jorgeorpinel

Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried out the first test manually and it seems like there's maybe another test needed upstream somewhere.

Here's what I did:

mkdir repo
cd repo
git init
dvc init -q
git add .
git commit --quiet -m "init"
echo data > data
dvc add data
echo "import sys; import shutil; shutil.copyfile(sys.argv[1], sys.argv[2])" > copy.py
echo "foo: 1" > params.yaml
dvc run -n copy-file -M metrics.yaml -p foo -d copy.py -d data python copy.py params.yaml metrics.yaml
git add .
git commit -m "run stage"
echo modified > data
dvc exp run

The output from dvc exp run is:

'data.dvc' didn't change, skipping
Stage 'copy-file' didn't change, skipping

No experiment is created, and data is reverted back from modified to data.

@dberenbaum
Copy link
Contributor

@jorgeorpinel I think we may just need to document that large changes to data dependencies in the workspace may slow down experiment queueing.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 22, 2021

I tried out the first test manually and it seems like there's maybe another test needed upstream somewhere.

@dberenbaum bug was w/calling exp run without a target (the original test always called it with a specific target stage), should be fixed now and the test has been updated

@pmrowla pmrowla merged commit e0d90cd into treeverse:master Apr 23, 2021
@pmrowla pmrowla deleted the 5593-exp-run-commit branch April 23, 2021 02:59
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 25, 2021

If it fails in the workspace, the workspace will contain any (non-git-committed) changes to .dvc and dvc.lock files ... contain the hashes from our dvc commit step.

Hm... Is this something we should be worried about? Let's wait and see I guess.
Out of curiosity, how feasible would it be to make it a transaction-type operation so the initial dvc commit gets rolled back (or happens in a tmp dir and gets discarded) if the run fails?

We don't use the standard git stash ref directly, but functionality wise, the experiment queue is just a custom git stash.

Could've mentioned it in your blog post @pmrowla !

I think we may just need to document that large changes to data dependencies in the workspace may slow down experiment queueing.

Agree. Created treeverse/dvc.org#2418

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 25, 2021

how feasible would it be to make it a transaction-type operation so the initial dvc commit gets rolled back (or happens in a tmp dir and gets discarded) if the run fails?

We can do this if we want to, but I don't think this is what users would expect. If dvc repro fails at some random point, we don't roll anything back, and the failed state is left in the repo. exp run should work the same way for experiments, so that the user can debug what happened.

If the user wants to roll back the repo state they can use git reset --hard the same way that they would for a failed repro.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 26, 2021

The initial dvc commit is also not obvious (nor documented) so who knows if having changed md5 sums in metafiles will help users debug or just confuse them. But yeah, just wondering. We can prob let it be until (if) we see confusion via support channels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhances DVC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

exp run: checks out data dependencies (destructive) [QA]

3 participants