Skip to content

Conversation

@skshetry
Copy link
Collaborator

@skshetry skshetry commented Dec 5, 2020

We try to optimize tree.exists calls and probably a few others
in that, they either look directly into the workspace or,
to the cache without running graph checks. It does not seem
to be possible just to run graph checks on find_outs_by_path
due to those optimizations.

So, that's why the collect also does a graph check for this
reason.

Fixes #5027
Fixes #4010

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

We try to optimize `tree.exists` calls and probably few others
in that they either look directly into the workspace or,
to the cache without running graph checks. It does not seem
to be possible just to run graph checks on `find_outs_by_path`
due to those optimizations.

So, that's why, the `collect` also does a graph check for this
reason.

Fixes treeverse#5027
Fixes treeverse#4010
@skshetry skshetry added the bugfix fixes bug label Dec 5, 2020
@skshetry skshetry requested review from efiop, pared and pmrowla December 5, 2020 03:38
@skshetry skshetry self-assigned this Dec 5, 2020
if not outs:
outs = [out for stage in self.stages for out in stage.outs]
# using `outs_graph` to ensure graph checks are run
outs = outs or self.outs_graph
Copy link
Collaborator Author

@skshetry skshetry Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disadvantage of it might be that, for example, dvc.api.open might start giving these unwanted errors, if their graphs are not correct.

Maybe, instead of giving these errors at all the times, we should only error out if n(outs) > 1 in the RepoTree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe it's not the tree that needs to worry about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this change is unnecessary to solve #5027, but for #4010.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of assumptions around the code about our outs/stages being in-check with the proper DAG, so we indeed need to check the graph. There were some discussions around whether or not some of the dag checks are really that necessary (e.g. overlapping outputs might be used to dvc checkout particular versions of datasets on demand), but so far there wasn't a good scenario that people were actively asking for.

@efiop efiop merged commit 189e88f into treeverse:master Dec 10, 2020
@skshetry skshetry deleted the graph-checks branch December 10, 2020 10:23
@pmrowla pmrowla mentioned this pull request Dec 12, 2020
2 tasks
skshetry added a commit to skshetry/dvc that referenced this pull request Dec 15, 2020
RepoDependency for example don't have any path_info

See: treeverse#4938 (comment)
Related: treeverse#5035
skshetry added a commit that referenced this pull request Dec 15, 2020
RepoDependency for example don't have any path_info

See: #4938 (comment)
Related: #5035
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix fixes bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ValueError while reading dvc repo provide helpful message on output duplications

2 participants