-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
It's a well known limitation of DVC that 2 stages (or stage and .dvc file, etc.) can't have overlapping output paths (maybe also deps in some cases?) — this applies to directories, or course. For example
$ dvc add data/
$ dvc run -n clean -d data -o data python cleanup.py data
ERROR...In the case above the dependency and the output are the same because maybe there's multiple raw data files in data/ and you don't want to use -d for each one — it may even be impossible if it's a variable number of raw data files coming from a previous, non-deterministic stage.
Similarly, the output may be hundreds of files (or a non-deterministic variable number) so you just want to indicate the whole directory.
For some external reason, maybe you need to avoid splitting the raw and clean data directories — we've had support cases like this e.g. this one.
Solution: Wildcards? E.g.
$ dvc add data/raw*
$ dvc run -n clean -d data/raw* -o data/**/clean* python cleanup.py data