-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Setup is dvc 0.75.0 with the .deb package under Ubuntu 18.04.
Also tested with windows 10 with .exe package. I tried with different configurations for the cache type (default, copy and symlink) and get a similar behavior.
I'm testing DVC and trying to understand how it manages the datasets in its cache.
I initialize an empty repository with dvc init and add data with dvc add data in directory that contains to datasets:
data/data.jsonof size 240Mdata/data_1.jsonof size 65M
I then run a script that produces an output dataset: dvc run -f prepare_data.dvc -d src/prepare_data.py -d data -o output python src prepare_data.py data. It creates: output/prepared_data.npy of size 360M.
The .dvc/cache directory now contains the following files (names are simplified):
06/file1of size 240M at inode 6029431614/file2of size 360M at inode 6094928820/file3.dirat inode 6029428759/file4of size 65M at inode 60294327dd/file5.dirat inode 60949289
I commit to git and tag it as expe1. I now modify my script and run dvc repro prepare_data.dvc. It produces a new file: output/prepared_data.npy of size 500M. The .dvc/cache directory now contains the following files:
06/file1of size 240M at inode 6029431614/file2of size 360M at inode 6094928819/file6of size 500M at inode 6094929020/file3.dirat inode 6029428720/file3.dir.unpacked/data.jsonof size 240M at inode 6029431620/file3.dir.unpacked/data_1.jsonof size 65M at inode 6029432722/file7.dirat inode 6029429159/file4of size 65M at inode 60294327dd/file5.dirat inode 60949289dd/file5.dir.unpacked/prepared_data.npyof size 360M at inode 60949288
I commit and tag it as expe2. I now want to clean the cache to remove previous outputs from expe1 and run dvc gc. The .dvc/cache directory now contains the following files:
06/file1of size 240M at inode 60294316- empty directory
14 19/file6of size 500M at inode 6094929020/file3.dirat inode 6029428720/file3.dir.unpacked/data.jsonof size 240M at inode 6029431620/file3.dir.unpacked/data_1.jsonof size 65M at inode 6029432722/file7.dirat inode 6029429159/file4of size 65M at inode 60294327dd/file5.dirat inode 60949289dd/file5.dir.unpacked/prepared_data.npyof size 360M at inode 60949288
Therefore contrary to what I expect, the previous version of the output file is still present in the cache at dd/file5.dir.unpacked/prepared_data.npy . Is this the expected behavior ?
How can I properly clean the cache ?
It seems that this file is useless as if try to checkout expe1, it raises an error, and I have to reproduce the experiment anyway:
$ git checkout expe1
$ dvc checkout
ERROR: unexpected error - Checkout failed for the following target:
output
Did you forget to fetch ?``