-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Bug Report
UPDATE: Skip to #4428 (comment)
As the title states. I run dvc repro extract for my extract pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.
My dvc.yaml looks like this. The preprocess stage should only indicate, that the not cached folders should be used as dependencies in later stages.
stages:
extract:
cmd: tar -xzvf data/thingy10k/10k_tetmesh.tar.gz -C data/thingy10k/ && tar -xzvf data/thingy10k/10k_surface.tar.gz -C data/thingy10k/
deps:
- data/thingy10k/10k_surface.tar.gz
- data/thingy10k/10k_tetmesh.tar.gz
outs:
- data/thingy10k/10k_surface:
cache: false
- data/thingy10k/10k_tetmesh:
cache: false
preprocess:
cmd: some preprocessing command
deps:
- data/thingy10k/10k_tetmesh
- data/thingy10k/10k_surfaceAfter running dvc repro extract, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with the cache: false option.
I confirmed that the contents of the output folders were indeed added to the cache by using du -sh .dvc/cache, which went up by exactly the size of the two folders after running the command.
Interestingly, after running dvc gc -a the cache is freed again. Also running dvc push (without first running dvc gc -a) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.
I have reproduced this in a local git and dvc repo by first adding the two archives using dvc add and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset under Output.
Please provide information about your setup
Output of dvc version:
$ dvc version
DVC version: 1.6.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Supports: http, https, ssh, webdav, webdavs
Cache types: hardlink, symlink
Repo: dvc, gitAdditional Information (if any):
If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.