Skip to content

Support pulling named subsets of data, or excluding files from pull #2825

@r-zip

Description

@r-zip

I've been working on a large project with multiple datasets. One of these datasets is large (>100 GB). If I simply run dvc pull, then it will pull the huge dataset, which takes up most available disk space on my machine.

The only way around this appears to be providing the file name to every data file to download. This is inconvenient, however, because there are many files I do want, and only one that I don't want.

I see two solutions to this:

  1. Allow named file groups. The user could specify groups of files in some sort of config, and pull them individually by name. I.e., dvc pull mnist. The user would also be able to exclude them: dvc pull all --exclude mnist.
  2. Allow exclusion of certain files from the command line. I.e., dvc pull --exclude data/mnist.dvc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestRequesting a new featurep3-nice-to-haveIt should be done this or next sprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions