-
Notifications
You must be signed in to change notification settings - Fork 1.3k
api: refactor to use metadata and RepoTree #4458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dvc/api.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| tree = RepoTree(_repo, fetch=True) | |
| tree = DvcTree(_repo, fetch=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DvcTree is an internal API, and does not support subrepos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I get why you are suggesting that. As DvcTree does not support subrepos, we need to implement RepoTree.get_dvctree(path) or RepoTree.get_repo(repo_path), as get_hash also hashes git-files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, get_hash(dvc_only=True)? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suggest it because this is not meant to work with git files. Not a big fan of dvc_only=True though, but we could simply do:
if not tree.isdvc(path):
raise ....
hash = tree.get_hash()
🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically we could've made hash part of the metadata, but that might make it complicated for files inside dirs, as you'll need to parse the dir_cache (as we do in _get_granular_checksum, which is not that bad)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this will automatically close #3182 for files in the directory 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, did as you recommended, but metadata/hash/RepoTree all gets mixed. Feels like that thing could be a part of RepoTree on itself.
dvc/api.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no need for this, we could just use tree.get_hash(PathInfo(_repo.root_dir) / path) directly
dvc/api.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like there is a need for this change. Repo.open_by_relpath already does all of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open_by_relpath will most likely go away, as RepoTree supports subrepos directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry Agreed, it was only used for the API, so we could delete it in that case.
open_by_relpath also has some weird exceptions that need to be double checked. Maybe not worth messing with this right now, your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptions do not look good, I do have to tell you that because Repo and ExternalRepo threw different exception before, but now, I am throwing PathMissingError. Unified, but quite verbose.
This does not yet support subrepos as it requires us to work on setting up cache correctly which is pending.
| tmp_dir.scm_gen({"foo": "foo"}, commit="initial") | ||
|
|
||
| with pytest.raises(UrlNotDvcRepoError, match="not a DVC repository"): | ||
| with pytest.raises(OutputNotFoundError, match="with output 'foo'"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changing the API, we shouldn't do it in this PR if we can avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus OutputNotFoundError is an internal thing, that shouldn't be exposed to api users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have to break things here as it's not precise enough. OutputNotFoundError is more of a correct term here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we already threw OutputNotFoundError. It was just not documented. The following threw OutputNotFoundError.
https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/api.py#L31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry I think that one was caught by https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L51 and re-raised properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh missed that. So, we could raise the same exc here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry But this particular test doesn't catch that one, but rather UrlNotDvcRepoError and it should stay that way. The issue is that we are using RepoTree instead of DvcTree right now, which works for git-only repos, when it shouldn't. Could do something simple like if not tree.dvc_tree: raise UrlNotDvcRepoError though, if subrepos are a concern.
| single_stage=True, | ||
| metrics_no_cache=[metric_file], | ||
| cmd=(f'python -c "{metric_code}"'), | ||
| cmd=f'python -c "{metric_code}"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
|
The suggestions cannot be applied without complete refactoring of the |
|
Looks good, though I would also say that |
|
sorry @skshetry, clicked in close and comment |
|
Closing as #4465 has the same changes. |
This does not yet support subrepos as it requires us to work on setting up cache correctly which is pending.
After that, it should just be changing
subrepos=FalsetoTrue.Still unsure about proper exceptions here, but I did try to unify them somehow (still does not look good though).
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
Closes #3182