Skip to content

dvc: wildcard outputs? #4254

@jorgeorpinel

Description

@jorgeorpinel

It's a well known limitation of DVC that 2 stages (or stage and .dvc file, etc.) can't have overlapping output paths (maybe also deps in some cases?) — this applies to directories, or course. For example

$ dvc add data/
$ dvc run -n clean -d data -o data python cleanup.py data
ERROR...

In the case above the dependency and the output are the same because maybe there's multiple raw data files in data/ and you don't want to use -d for each one — it may even be impossible if it's a variable number of raw data files coming from a previous, non-deterministic stage.
Similarly, the output may be hundreds of files (or a non-deterministic variable number) so you just want to indicate the whole directory.
For some external reason, maybe you need to avoid splitting the raw and clean data directories — we've had support cases like this e.g. this one.

Solution: Wildcards? E.g.

$ dvc add data/raw*
$ dvc run -n clean -d data/raw* -o data/**/clean* python cleanup.py data

Metadata

Metadata

Assignees

No one assigned

    Labels

    awaiting responsewe are waiting for your reply, please respond! :)discussionrequires active participation to reach a conclusionfeature requestRequesting a new feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions