Pull extrememly slow on ~400GB of data with hot DVC cache

OS: Docker image based off of `tensorflow/tensorflow:latest-gpu-py3`, with Python 3.7.5 as the system Python.

Initial setup:
```
# python --version
Python 3.7.5
# python -m venv .profiling_venv
# source .profiling_venv/bin/activate
# pip install dvc yappi
...
# dvc --version
0.82.6
```

DVC cache configuration:
```
# dvc config cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --local cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --global cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --system cache.dir
/ssd/.../dvc/cache
# cat /etc/xdg/dvc/config
[cache]
dir = /ssd/.../dvc/cache
protected = true
type = "hardlink,symlink"
```
Please note that the DVC cache is hot. In other words, most if not all files for `dvc fetch` are present and up-to-date at `/ssd/...`.

Make a fresh clone and profile `dvc pull`:
```
# git clone ssh://... repo
...
# cd repo
# yappi -f callgrind -o dvc_pull.prof -s $(which dvc) pull
...
```
**This `dvc pull`, uninstrumented, usually takes 40+ minutes with a hot DVC cache.**

Count the number of DVC-tracked files (symlinks, see the above config) and the total size of the repo:
```
# find . -type l | wc -l
29003
# du -shL .
403G    .
```

Looking at the `dvc_pull.prof` (in KCachegrind) suggests that the bottleneck is the checksum process. The `file_md5` and `dos2unix` functions in `utils/__init__.py` appear particularly costly.

Is this a known issue? Would the primary authors of DVC entertain a more performant version of `file_md5` (perhaps written in C/C++ and without TQDM integration)?

[dvc_pull_prof.zip](https://github.com/iterative/dvc/files/4135825/dvc_pull_prof.zip)

![Screenshot_2020-01-30_14-02-20](https://user-images.githubusercontent.com/1828432/73481187-3401ad00-4393-11ea-8041-cb051c713be9.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pull extrememly slow on ~400GB of data with hot DVC cache #3261

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pull extrememly slow on ~400GB of data with hot DVC cache #3261

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions