Skip to content

dvc gc does not remove files under dir.unpacked #2946

@tlouismarie

Description

@tlouismarie

Setup is dvc 0.75.0 with the .deb package under Ubuntu 18.04.
Also tested with windows 10 with .exe package. I tried with different configurations for the cache type (default, copy and symlink) and get a similar behavior.

I'm testing DVC and trying to understand how it manages the datasets in its cache.
I initialize an empty repository with dvc init and add data with dvc add data in directory that contains to datasets:

  • data/data.json of size 240M
  • data/data_1.json of size 65M

I then run a script that produces an output dataset: dvc run -f prepare_data.dvc -d src/prepare_data.py -d data -o output python src prepare_data.py data. It creates: output/prepared_data.npy of size 360M.
The .dvc/cache directory now contains the following files (names are simplified):

  • 06/file1 of size 240M at inode 60294316
  • 14/file2 of size 360M at inode 60949288
  • 20/file3.dir at inode 60294287
  • 59/file4 of size 65M at inode 60294327
  • dd/file5.dir at inode 60949289

I commit to git and tag it as expe1. I now modify my script and run dvc repro prepare_data.dvc. It produces a new file: output/prepared_data.npy of size 500M. The .dvc/cache directory now contains the following files:

  • 06/file1 of size 240M at inode 60294316
  • 14/file2 of size 360M at inode 60949288
  • 19/file6 of size 500M at inode 60949290
  • 20/file3.dir at inode 60294287
  • 20/file3.dir.unpacked/data.json of size 240M at inode 60294316
  • 20/file3.dir.unpacked/data_1.json of size 65M at inode 60294327
  • 22/file7.dir at inode 60294291
  • 59/file4 of size 65M at inode 60294327
  • dd/file5.dir at inode 60949289
  • dd/file5.dir.unpacked/prepared_data.npy of size 360M at inode 60949288

I commit and tag it as expe2. I now want to clean the cache to remove previous outputs from expe1 and run dvc gc. The .dvc/cache directory now contains the following files:

  • 06/file1 of size 240M at inode 60294316
  • empty directory 14
  • 19/file6 of size 500M at inode 60949290
  • 20/file3.dir at inode 60294287
  • 20/file3.dir.unpacked/data.json of size 240M at inode 60294316
  • 20/file3.dir.unpacked/data_1.json of size 65M at inode 60294327
  • 22/file7.dir at inode 60294291
  • 59/file4 of size 65M at inode 60294327
  • dd/file5.dir at inode 60949289
  • dd/file5.dir.unpacked/prepared_data.npy of size 360M at inode 60949288

Therefore contrary to what I expect, the previous version of the output file is still present in the cache at dd/file5.dir.unpacked/prepared_data.npy . Is this the expected behavior ?
How can I properly clean the cache ?
It seems that this file is useless as if try to checkout expe1, it raises an error, and I have to reproduce the experiment anyway:

$ git checkout expe1
$ dvc checkout
ERROR: unexpected error - Checkout failed for the following target: 
   output
Did you forget to fetch ?``

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugDid we break something?p1-importantImportant, aka current backlog of things to doresearch

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions