-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Run graph checks on collect/find_outs_by_path
#5035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We try to optimize `tree.exists` calls and probably few others in that they either look directly into the workspace or, to the cache without running graph checks. It does not seem to be possible just to run graph checks on `find_outs_by_path` due to those optimizations. So, that's why, the `collect` also does a graph check for this reason. Fixes treeverse#5027 Fixes treeverse#4010
| if not outs: | ||
| outs = [out for stage in self.stages for out in stage.outs] | ||
| # using `outs_graph` to ensure graph checks are run | ||
| outs = outs or self.outs_graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The disadvantage of it might be that, for example, dvc.api.open might start giving these unwanted errors, if their graphs are not correct.
Maybe, instead of giving these errors at all the times, we should only error out if n(outs) > 1 in the RepoTree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe it's not the tree that needs to worry about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are lots of assumptions around the code about our outs/stages being in-check with the proper DAG, so we indeed need to check the graph. There were some discussions around whether or not some of the dag checks are really that necessary (e.g. overlapping outputs might be used to dvc checkout particular versions of datasets on demand), but so far there wasn't a good scenario that people were actively asking for.
RepoDependency for example don't have any path_info See: treeverse#4938 (comment) Related: treeverse#5035
RepoDependency for example don't have any path_info See: #4938 (comment) Related: #5035
We try to optimize
tree.existscalls and probably a few othersin that, they either look directly into the workspace or,
to the cache without running graph checks. It does not seem
to be possible just to run graph checks on
find_outs_by_pathdue to those optimizations.
So, that's why the
collectalso does a graph check for thisreason.
Fixes #5027
Fixes #4010
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏