Skip to content

Dataset storage improvements #1487

@dmpetrov

Description

@dmpetrov

There were many requests related to datasets storing which might require a redesign of DVC internals and the cli API. I'll list the requirements here in the issue description. It would be great to discuss possible solutions in comments.

  1. A global place for all the datasets. People tend to use a single DVC repo for all their datasets. Otherwise, the number or git-repos explodes.
    1.1. Reusage. How to reuse these datasets from different projects and even repos?
    1.2. List all datasets.
  2. Dataset versioning.
    2.1. Assign a version/tag/label like 1.3 to a specific dataset. Git tag won't work since we don't need a global tag for all files.
    2.2. See list of versions/tags/labels for a dataset.
    2.3. How to checkout a specific version of a dataset in a convenient way?
    2.4. Ability to get a dataset (with specified version) without Git. ML model deployment scenario when Git is not available in production servers.
  3. Storage visibility for not technical folks like managers.
    3.1. Human readable cache would be great. Thus manager can see datasets and models through S3 web.
    3.2. If 3.1. is not possible - some UI is needed.
  4. Diff's for dataset versions (see 2.1.). Which files were added\deleted\modified.
  5. Datasets synchronization between machines. It looks like DVC solves this. Should we improve this experience?

Bonus question:

  1. Access control. How can I give access to a dataset1 but not to dataset2 to a particular user?

The list can be extended.

UPDATE 1/15/19: Added 2.4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions