-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Labels
feature requestRequesting a new featureRequesting a new featurequestionI have a question?I have a question?
Description
There were many requests related to datasets storing which might require a redesign of DVC internals and the cli API. I'll list the requirements here in the issue description. It would be great to discuss possible solutions in comments.
- A global place for all the datasets. People tend to use a single DVC repo for all their datasets. Otherwise, the number or git-repos explodes.
1.1. Reusage. How to reuse these datasets from different projects and even repos?
1.2. List all datasets. - Dataset versioning.
2.1. Assign a version/tag/label like 1.3 to a specific dataset. Git tag won't work since we don't need a global tag for all files.
2.2. See list of versions/tags/labels for a dataset.
2.3. How to checkout a specific version of a dataset in a convenient way?
2.4. Ability to get a dataset (with specified version) without Git. ML model deployment scenario when Git is not available in production servers. - Storage visibility for not technical folks like managers.
3.1. Human readable cache would be great. Thus manager can see datasets and models through S3 web.
3.2. If 3.1. is not possible - some UI is needed. - Diff's for dataset versions (see 2.1.). Which files were added\deleted\modified.
- Datasets synchronization between machines. It looks like DVC solves this. Should we improve this experience?
Bonus question:
- Access control. How can I give access to a dataset1 but not to dataset2 to a particular user?
The list can be extended.
UPDATE 1/15/19: Added 2.4.
shcheklein, villasv, ternaus, efiop, AlJohri and 22 moreshcheklein, polvoazul, efiop, AlJohri, gazay and 8 moremostafahadian and vwxyzjn
Metadata
Metadata
Assignees
Labels
feature requestRequesting a new featureRequesting a new featurequestionI have a question?I have a question?