Dataset storage improvements

There were many requests related to datasets storing which might require a redesign of DVC internals and the cli API. I'll list the requirements here in the issue description. It would be great to discuss possible solutions in comments.

1. **A global place for all the datasets.** People tend to use a single DVC repo for all their datasets. Otherwise, the number or git-repos explodes.
1.1. **Reusage.** How to reuse these datasets from different projects and even repos?
1.2. **List all datasets.**
2. **Dataset versioning.** 
2.1. **Assign a version/tag/label** like 1.3 to a specific dataset. Git tag won't work since we don't need a global tag for all files.
2.2. See **list of versions/tags/labels** for a dataset.
2.3. How to **checkout a specific version of a dataset** in a convenient way?
2.4. Ability to get a dataset (with specified version) without Git. ML model deployment scenario when Git is not available in production servers.
3. **Storage visibility** for not technical folks like managers.
3.1. **Human readable cache** would be great. Thus manager can see datasets and models through S3 web.
3.2. If 3.1. is not possible - some UI is needed.
4. **Diff's** for dataset versions (see 2.1.). Which files were added\deleted\modified.
5. **Datasets synchronization** between machines. It looks like DVC solves this. Should we improve this experience?

Bonus question:
1. **Access control.** How can I give access to a dataset1 but not to dataset2 to a particular user?

The list can be extended.

UPDATE 1/15/19: Added 2.4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset storage improvements #1487

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset storage improvements #1487

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions