-
Notifications
You must be signed in to change notification settings - Fork 1.3k
api: refactor to use metadata and RepoTree #4458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,16 +2,17 @@ | |
| from contextlib import _GeneratorContextManager as GCM | ||
| from contextlib import contextmanager | ||
|
|
||
| from dvc.exceptions import DvcException, NotDvcRepoError | ||
| from funcy import reraise | ||
|
|
||
| from dvc.exceptions import ( | ||
| NotDvcRepoError, | ||
| OutputNotFoundError, | ||
| PathMissingError, | ||
| ) | ||
| from dvc.external_repo import external_repo | ||
| from dvc.path_info import PathInfo | ||
| from dvc.repo import Repo | ||
|
|
||
|
|
||
| class UrlNotDvcRepoError(DvcException): | ||
| """Thrown if the given URL is not a DVC repository.""" | ||
|
|
||
| def __init__(self, url): | ||
| super().__init__(f"'{url}' is not a DVC repository.") | ||
| from dvc.tree.repo import RepoTree | ||
|
|
||
|
|
||
| def get_url(path, repo=None, rev=None, remote=None): | ||
|
|
@@ -20,17 +21,26 @@ def get_url(path, repo=None, rev=None, remote=None): | |
| in a DVC repo. For Git repos, HEAD is used unless a rev argument is | ||
| supplied. The default remote is tried unless a remote argument is supplied. | ||
|
|
||
| Raises UrlNotDvcRepoError if repo is not a DVC project. | ||
| Raises OutputNotFoundError if the file is not a dvc-tracked file. | ||
|
|
||
| NOTE: This function does not check for the actual existence of the file or | ||
| directory in the remote storage. | ||
| """ | ||
| with _make_repo(repo, rev=rev) as _repo: | ||
| if not isinstance(_repo, Repo): | ||
| raise UrlNotDvcRepoError(_repo.url) # pylint: disable=no-member | ||
| out = _repo.find_out_by_relpath(path) | ||
| remote_obj = _repo.cloud.get_remote(remote) | ||
| return str(remote_obj.tree.hash_to_path_info(out.checksum)) | ||
| tree = RepoTree(_repo, fetch=True) | ||
| path_info = PathInfo(_repo.root_dir) / path | ||
|
|
||
| meta = tree.metadata(path_info) | ||
| exc = OutputNotFoundError(path) | ||
| if not meta.is_dvc: | ||
| raise exc | ||
|
|
||
| with reraise(FileNotFoundError, exc): | ||
| _, hash_ = tree.get_hash(PathInfo(_repo.root_dir) / path) | ||
|
|
||
| assert hash_ and meta.repo | ||
| cloud = meta.repo.cloud | ||
| return cloud.get_url_for(remote, checksum=hash_) | ||
|
|
||
|
|
||
| def open( # noqa, pylint: disable=redefined-builtin | ||
|
|
@@ -74,10 +84,14 @@ def __getattr__(self, name): | |
|
|
||
| def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None): | ||
| with _make_repo(repo, rev=rev) as _repo: | ||
| with _repo.open_by_relpath( | ||
| path, remote=remote, mode=mode, encoding=encoding | ||
| ) as fd: | ||
| yield fd | ||
| tree = RepoTree(_repo, stream=True, fetch=True) | ||
|
|
||
| path = PathInfo(_repo.root_dir) / path | ||
| with reraise(FileNotFoundError, PathMissingError(path, repo)): | ||
| with tree.open( | ||
| path, remote=remote, mode=mode, encoding=encoding | ||
| ) as fd: | ||
| yield fd | ||
|
||
|
|
||
|
|
||
| def read(path, repo=None, rev=None, remote=None, mode="r", encoding=None): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,8 +3,7 @@ | |
| import pytest | ||
|
|
||
| from dvc import api | ||
| from dvc.api import UrlNotDvcRepoError | ||
| from dvc.exceptions import FileMissingError | ||
| from dvc.exceptions import OutputNotFoundError, PathMissingError | ||
| from dvc.path_info import URLInfo | ||
| from dvc.utils.fs import remove | ||
|
|
||
|
|
@@ -47,10 +46,10 @@ def test_get_url_external(erepo_dir, cloud): | |
| def test_get_url_requires_dvc(tmp_dir, scm): | ||
| tmp_dir.scm_gen({"foo": "foo"}, commit="initial") | ||
|
|
||
| with pytest.raises(UrlNotDvcRepoError, match="not a DVC repository"): | ||
| with pytest.raises(OutputNotFoundError, match="with output 'foo'"): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is changing the API, we shouldn't do it in this PR if we can avoid it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Plus
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do have to break things here as it's not precise enough. OutputNotFoundError is more of a correct term here.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, we already threw https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/api.py#L31
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @skshetry I think that one was caught by https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L51 and re-raised properly.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ahh missed that. So, we could raise the same exc here?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @skshetry But this particular test doesn't catch that one, but rather UrlNotDvcRepoError and it should stay that way. The issue is that we are using RepoTree instead of DvcTree right now, which works for git-only repos, when it shouldn't. Could do something simple like |
||
| api.get_url("foo", repo=os.fspath(tmp_dir)) | ||
|
|
||
| with pytest.raises(UrlNotDvcRepoError): | ||
| with pytest.raises(OutputNotFoundError): | ||
| api.get_url("foo", repo=f"file://{tmp_dir}") | ||
|
|
||
|
|
||
|
|
@@ -144,7 +143,7 @@ def test_missing(tmp_dir, dvc, remote): | |
|
|
||
| remove("foo") | ||
|
|
||
| with pytest.raises(FileMissingError): | ||
| with pytest.raises(PathMissingError, match="path 'foo' does not exist"): | ||
| api.read("foo") | ||
|
|
||
|
|
||
|
|
@@ -164,12 +163,12 @@ def test_open_not_cached(dvc): | |
| dvc.run( | ||
| single_stage=True, | ||
| metrics_no_cache=[metric_file], | ||
| cmd=(f'python -c "{metric_code}"'), | ||
| cmd=f'python -c "{metric_code}"', | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ? |
||
| ) | ||
|
|
||
| with api.open(metric_file) as fd: | ||
| assert fd.read() == metric_content | ||
|
|
||
| os.remove(metric_file) | ||
| with pytest.raises(FileMissingError): | ||
| with pytest.raises(PathMissingError): | ||
| api.read(metric_file) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DvcTreeis an internal API, and does not support subrepos.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I get why you are suggesting that. As
DvcTreedoes not support subrepos, we need to implementRepoTree.get_dvctree(path)orRepoTree.get_repo(repo_path), asget_hashalso hashes git-files.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or,
get_hash(dvc_only=True)? Thoughts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suggest it because this is not meant to work with git files. Not a big fan of
dvc_only=Truethough, but we could simply do:🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically we could've made hash part of the metadata, but that might make it complicated for files inside dirs, as you'll need to parse the dir_cache (as we do in _get_granular_checksum, which is not that bad)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this will automatically close #3182 for files in the directory 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, did as you recommended, but
metadata/hash/RepoTreeall gets mixed. Feels like that thing could be a part ofRepoTreeon itself.