Skip to content

Pull extrememly slow on ~400GB of data with hot DVC cache #3261

@kevin-hanselman

Description

@kevin-hanselman

OS: Docker image based off of tensorflow/tensorflow:latest-gpu-py3, with Python 3.7.5 as the system Python.

Initial setup:

# python --version
Python 3.7.5
# python -m venv .profiling_venv
# source .profiling_venv/bin/activate
# pip install dvc yappi
...
# dvc --version
0.82.6

DVC cache configuration:

# dvc config cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --local cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --global cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --system cache.dir
/ssd/.../dvc/cache
# cat /etc/xdg/dvc/config
[cache]
dir = /ssd/.../dvc/cache
protected = true
type = "hardlink,symlink"

Please note that the DVC cache is hot. In other words, most if not all files for dvc fetch are present and up-to-date at /ssd/....

Make a fresh clone and profile dvc pull:

# git clone ssh://... repo
...
# cd repo
# yappi -f callgrind -o dvc_pull.prof -s $(which dvc) pull
...

This dvc pull, uninstrumented, usually takes 40+ minutes with a hot DVC cache.

Count the number of DVC-tracked files (symlinks, see the above config) and the total size of the repo:

# find . -type l | wc -l
29003
# du -shL .
403G    .

Looking at the dvc_pull.prof (in KCachegrind) suggests that the bottleneck is the checksum process. The file_md5 and dos2unix functions in utils/__init__.py appear particularly costly.

Is this a known issue? Would the primary authors of DVC entertain a more performant version of file_md5 (perhaps written in C/C++ and without TQDM integration)?

dvc_pull_prof.zip

Screenshot_2020-01-30_14-02-20

Metadata

Metadata

Assignees

Labels

bugDid we break something?p1-importantImportant, aka current backlog of things to doresearch

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions