-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Bug Report
Hi, first of all, let me congratulate you all for the good work on dvc -- it's a great tool. Alas, I'm running into some issues since 1.x.
Please provide information about your setup
Output of dvc version:
The issue occurs with 1.11.1, 1.11.10, 1.0.0b4 (I think), and dvc installed from git master on July 20 (1.1.11+b77ce0).
The issue does not occur with 0.94.1, and other earlier versions I had installed.
Other setup info: Ubuntu 18.04, dvc installed with venv, python 3.6.9. *Edit: the remote is S3.
Additional Information (if any):
My dvc repo has a layout like this:
root
| .gitignore
| .dvcignore
|----subdir1
| |---- file1.jpg
| |---- file1.xml
| |---- file2.jpg
| \---- file2.xml
|----subdir1.dvc
|----subdir2
| similar to subdir1
# etc
In other words, each subdirectory has its own subdir.dvc file. Inside, for each JPG file there is a "sidecar" xml file with annotation metadata. Importantly, I am using DVC only for JPG, and using regular git for the xml. (Before this was a dvc repo, it used to be a git-annex repo, which had explicit support for this with largefiles). I configure this by having *.jpg in .gitignore and having *.xml in .dvcignore. Until 1.x, this worked as expected, with each tool only minding their respective file types.
However, in the various versions of 1.x I tried (including yesterday's master, hoping that this was the same as Issue #4197), dvc status reports that the checksum of almost every subdirectory has changed, and whenever I do dvc pull it asks if I want to delete the xml files (it stops after denying the request). When going back to 0.94.1, everything works like it used to.
I did a quick check on dvc pull --verbose on each version. In 0.94.1, the output never mentions the xml files. On the 1.x versions, it does -- i.e. Path '/mnt/.../file1.xml' inode '624822, then fetched: [], which I guess means it's not really ignoring the xml. Since this does work pre 1.x, it seems like a bug, or at least unexpected behavior to me.
(Note: while the git/dvc mixed layout worked fine before 1.x, it seems like DVC does not like this setup, considering the way it automatically creates .gitignore files that ignore the whole content of the directory added to DVC -- I wish that behavior could be disabled.)
(Note 2: I also tried simply using a dvc file for each JPG, and while this works, it makes things very slow - dvc status goes from seconds to minutes. So it has its own issues, but I'm glad DVC can optimize for directories ;)
(Note 3: While I would like not having directories with mixed git/dvc content, other tools we use assume the sidecar layout. Adding the xml file into each DVC subdirectory .dvc would have its own issues, since the xmls change frequently. I even tried separate directories and simulating the sidecar layout with symlinks and union mounts, but both have issues on macOS, which we need to work with.)