Skip to content

Conversation

@skshetry
Copy link
Collaborator

@skshetry skshetry commented Aug 24, 2020

This does not yet support subrepos as it requires us to work on setting up cache correctly which is pending.
After that, it should just be changing subrepos=False to True.

Still unsure about proper exceptions here, but I did try to unify them somehow (still does not look good though).

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Closes #3182

@skshetry skshetry requested review from efiop, pared and pmrowla August 24, 2020 13:10
@skshetry skshetry self-assigned this Aug 24, 2020
dvc/api.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tree = RepoTree(_repo, fetch=True)
tree = DvcTree(_repo, fetch=True)

Copy link
Collaborator Author

@skshetry skshetry Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DvcTree is an internal API, and does not support subrepos.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I get why you are suggesting that. As DvcTree does not support subrepos, we need to implement RepoTree.get_dvctree(path) or RepoTree.get_repo(repo_path), as get_hash also hashes git-files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, get_hash(dvc_only=True)? Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suggest it because this is not meant to work with git files. Not a big fan of dvc_only=True though, but we could simply do:

if not tree.isdvc(path):
    raise ....
hash = tree.get_hash() 

🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we could've made hash part of the metadata, but that might make it complicated for files inside dirs, as you'll need to parse the dir_cache (as we do in _get_granular_checksum, which is not that bad)

Copy link
Contributor

@efiop efiop Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this will automatically close #3182 for files in the directory 🙂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, did as you recommended, but metadata/hash/RepoTree all gets mixed. Feels like that thing could be a part of RepoTree on itself.

dvc/api.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no need for this, we could just use tree.get_hash(PathInfo(_repo.root_dir) / path) directly

dvc/api.py Outdated
Comment on lines +83 to +94
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like there is a need for this change. Repo.open_by_relpath already does all of this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open_by_relpath will most likely go away, as RepoTree supports subrepos directly.

Copy link
Contributor

@efiop efiop Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry Agreed, it was only used for the API, so we could delete it in that case.

open_by_relpath also has some weird exceptions that need to be double checked. Maybe not worth messing with this right now, your call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions do not look good, I do have to tell you that because Repo and ExternalRepo threw different exception before, but now, I am throwing PathMissingError. Unified, but quite verbose.

This does not yet support subrepos as it requires us to
work on setting up cache correctly which is pending.
tmp_dir.scm_gen({"foo": "foo"}, commit="initial")

with pytest.raises(UrlNotDvcRepoError, match="not a DVC repository"):
with pytest.raises(OutputNotFoundError, match="with output 'foo'"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing the API, we shouldn't do it in this PR if we can avoid it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus OutputNotFoundError is an internal thing, that shouldn't be exposed to api users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have to break things here as it's not precise enough. OutputNotFoundError is more of a correct term here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we already threw OutputNotFoundError. It was just not documented. The following threw OutputNotFoundError.

https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/api.py#L31

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry I think that one was caught by https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L51 and re-raised properly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh missed that. So, we could raise the same exc here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry But this particular test doesn't catch that one, but rather UrlNotDvcRepoError and it should stay that way. The issue is that we are using RepoTree instead of DvcTree right now, which works for git-only repos, when it shouldn't. Could do something simple like if not tree.dvc_tree: raise UrlNotDvcRepoError though, if subrepos are a concern.

single_stage=True,
metrics_no_cache=[metric_file],
cmd=(f'python -c "{metric_code}"'),
cmd=f'python -c "{metric_code}"',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@skshetry
Copy link
Collaborator Author

The suggestions cannot be applied without complete refactoring of the dvc.external_repo, was trying to break cycle here, but looks like not. Keeping it as a draft till the refactoring is done.

@skshetry skshetry marked this pull request as draft August 25, 2020 13:48
@pared
Copy link
Contributor

pared commented Aug 26, 2020

Looks good, though I would also say that DvcTree belongs in get_url. Maybe we will be able to make it support subrepos later.

@pared pared closed this Aug 26, 2020
@pared pared reopened this Aug 26, 2020
@pared
Copy link
Contributor

pared commented Aug 26, 2020

sorry @skshetry, clicked in close and comment

@skshetry
Copy link
Collaborator Author

Closing as #4465 has the same changes.

@skshetry skshetry closed this Aug 27, 2020
@skshetry skshetry deleted the dvc-api branch August 27, 2020 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

api: get_url() returns path to .dir for directory

3 participants