-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
OS: Docker image based off of tensorflow/tensorflow:latest-gpu-py3, with Python 3.7.5 as the system Python.
Initial setup:
# python --version
Python 3.7.5
# python -m venv .profiling_venv
# source .profiling_venv/bin/activate
# pip install dvc yappi
...
# dvc --version
0.82.6
DVC cache configuration:
# dvc config cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --local cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --global cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --system cache.dir
/ssd/.../dvc/cache
# cat /etc/xdg/dvc/config
[cache]
dir = /ssd/.../dvc/cache
protected = true
type = "hardlink,symlink"
Please note that the DVC cache is hot. In other words, most if not all files for dvc fetch are present and up-to-date at /ssd/....
Make a fresh clone and profile dvc pull:
# git clone ssh://... repo
...
# cd repo
# yappi -f callgrind -o dvc_pull.prof -s $(which dvc) pull
...
This dvc pull, uninstrumented, usually takes 40+ minutes with a hot DVC cache.
Count the number of DVC-tracked files (symlinks, see the above config) and the total size of the repo:
# find . -type l | wc -l
29003
# du -shL .
403G .
Looking at the dvc_pull.prof (in KCachegrind) suggests that the bottleneck is the checksum process. The file_md5 and dos2unix functions in utils/__init__.py appear particularly costly.
Is this a known issue? Would the primary authors of DVC entertain a more performant version of file_md5 (perhaps written in C/C++ and without TQDM integration)?
