Skip to content

Google Drive support further enhancements #2865

@maxhora

Description

@maxhora

This ticket is to keep track of required further improvements for Google Drive remote implementation after #2551 will be merged into master:

  • Processing of auth exceptions and printing more meaningful error message on failure. GDrive remote support #2551 (comment)

  • Implement gc command support for gdrive ( def remove(self, path_info) overloading )

  • Validate resolved remote file object to have expected title ( def resolve_remote_item_from_path(self, path_parts, create): )

  • Simplify a few things using new and shiny @wrap_prop and filter_errors param of @retry from funcy 1.14 :)

  • Make name -> id deterministic by choosing minimum id GDrive remote support #2551 (comment)

  • Reimplement caching to have 2 entry points .path_info_to_ids() and .id_to_path_info(). The idea is to remove cache dictionaries completely, but it will require introduction of helper method which accepts parent_id, title and returns the actual id ( this method introduces 1 -> 1 relation between input and resulting remote id ). GDrive remote support #2551 (comment)

  • Simplify resolve_remote_item_from_path GDrive remote support #2551 (comment)

  • Protect create_remote_dir with lock. GDrive remote support #2551 (comment)

  • Organize methods, params in ordered and unified way. GDrive remote support #2551 (comment)

  • Enhance retrieving of HTTP error from GDrive API exceptions maxhora@783d8b6#r36184990

  • Create Iterative Google Drive account and Project to share client id and secret with final DVC users. Highest possible API usage limits quotas should be requested. Probably, it will be needed to have separate Google Project for CI. GDrive remote support #2551 (comment)

  • Stop creating a path to the DVC remote root if it does not exist. Since Google Drive allows multiple folders with the same name (at least on in My Drive and one in Shared With Me) in case a path like gdrive://root/storage is used to access, collaborators see a storage empty folder after the first dvc pull attempt`. Instead we should just throw path does not exist.

  • Hide /Users/ivan/Projects/test-gdrive/.env/lib/python3.7/site-packages/oauth2client/_helpers.py:255: UserWarning: Cannot access /Users/ivan/Projects/test-gdrive/.dvc/tmp/gdrive-user-credentials.json: No such file or directory warnings.warn(_MISSING_FILE_MESSAGE.format(filename)) on the first auth.

  • Reconsider self.no_traverse = False. Now even to pull a single file we run the full traversal. We can at least start listing only prefixes we need (e.g. remote/0c/* if we need to check if file 0c1234...ef exists), we can use a parallel exists if we need to check less than 256 files, etc.

  • Put notes in docs that path like gdrive://root is not accessible by other people - that ID must be used to actually share data with other team members.

  • We should store one credentials file per remote

  • Do not allow empty root DVC remote path (except shared?). I think it can prevent a lot of strange issues. More on this here remote add: should gdrive://root be an error? #3586

  • Add _download progress Google Drive support further enhancements #2865 (comment) -> gdrive: download: stream & add progress #3722

  • Support file streaming in dvc.api.open() function (more details in api: support streaming from Google Drive remotes #3408) and and update docs

  • Add import-url support

  • Check that close() is handles in the dvc.api.open() context manager.

  • Support show URL in dvc get for Gdrive. We can generate a HTTPS link that can be used in a browser if user is authed. The same link as you would get if you download a file from UI.

  • Fix credentials management for external repos (when we do dvc get, etc). We should have a way to cache them.

  • Support external dependencies and outputs

  • Check that dvc get-url functions properly. Need to come up with some credentials management.

  • Review and add more tests if needed. Starting from API tests.

  • Add an explanation comment about how path vs ids work, all non 1-1 stuff. Someone running into GDrive for first time will appreciate GDrive remote support #2551 (review). Explain how we deal with it, how caching is involved too.

  • Notify user on retries, keep the message on the screen if they are happening at a certain rate

  • Consider raising an exception if there are multiple remote root directories with the same name

  • Move _gdrive_* helpers into a separate module api to simplify testing and reading the code

  • When it's more or less stable make it trusted by default

  • Check if there is a way to generate URLs for GDrive publicly available files to download them w/o auth. Examples: https://github.com/NVlabs/stylegan/blob/master/pretrained_example.py and https://github.com/NVlabs/stylegan/blob/master/dnnlib/util.py . dvc get ideally should work w/o asking to Auth for public objects.

  • Improve performance, pass fields, consider using batch for exists call - https://developers.google.com/drive/api/v2/performance , see example here docs: How to get specific fields when listing files. iterative/PyDrive2#42

Metadata

Metadata

Assignees

Labels

enhancementEnhances DVChelp wantedp2-mediumMedium priority, should be done, but less important

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions