Skip to content

DVC Repro incorrectly saving directory to cache #4428

@digitalillusions

Description

@digitalillusions

Bug Report

UPDATE: Skip to #4428 (comment)

As the title states. I run dvc repro extract for my extract pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.

My dvc.yaml looks like this. The preprocess stage should only indicate, that the not cached folders should be used as dependencies in later stages.

stages:
  extract:
    cmd: tar -xzvf data/thingy10k/10k_tetmesh.tar.gz -C data/thingy10k/ && tar -xzvf data/thingy10k/10k_surface.tar.gz -C data/thingy10k/
    deps:
    - data/thingy10k/10k_surface.tar.gz
    - data/thingy10k/10k_tetmesh.tar.gz
    outs:
    - data/thingy10k/10k_surface:
        cache: false
    - data/thingy10k/10k_tetmesh:
        cache: false
  preprocess:
    cmd: some preprocessing command
    deps:
    - data/thingy10k/10k_tetmesh
    - data/thingy10k/10k_surface

After running dvc repro extract, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with the cache: false option.

I confirmed that the contents of the output folders were indeed added to the cache by using du -sh .dvc/cache, which went up by exactly the size of the two folders after running the command.

Interestingly, after running dvc gc -a the cache is freed again. Also running dvc push (without first running dvc gc -a) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.

I have reproduced this in a local git and dvc repo by first adding the two archives using dvc add and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset under Output.

Please provide information about your setup

Output of dvc version:

$ dvc version
DVC version: 1.6.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Supports: http, https, ssh, webdav, webdavs
Cache types: hardlink, symlink
Repo: dvc, git

Additional Information (if any):

If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.

Metadata

Metadata

Assignees

Labels

feature requestRequesting a new featurep2-mediumMedium priority, should be done, but less important

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions