Skip to content

Collecting information from remote cache very slow #2373

@JohanMollevik

Description

@JohanMollevik

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

Running under Windows 10 wsl as well as debian stretch with an azure blob storage as remote

$ dvc --version
0.54.1

installed from pip (Python 3)


When doing operations involving the remote cache dvc is very slow. Even when the files affected is very small if the total dvc repo has many files.

In a repo with 247 directories, 2444700 files totaling 132GB.

I was running a dvc pull on a dvc file pointing out a 4 file dataset of 1 byte each and this took 70 minutes.

jmollevi@LNOR070124:~/Projects/dvctest$ time dvc pull data/test.dvc
Preparing to download data from 'azure://jmollevidvctest/data/'
Preparing to collect status from azure://jmollevidvctest/data/
Collecting information from local cache...
[##############################] 100%

Collecting information from remote cache...
[##############################] 100% Analysing status
[##############################] 100% data/test

Preparing to download data from 'azure://jmollevidvctest/data/'

Preparing to collect status from azure://jmollevidvctest/data/

Collecting information from local cache...
[##############################] 100%

Collecting information from remote cache...
[##############################] 100% Analysing status
[##############################] 100% data/test/c
[##############################] 100% Created unpacked dir
[##############################] 100% Checkout finished!

real 70m2.014s
user 19m34.031s
sys 0m58.359s
jmollevi@LNOR070124:~/Projects/dvctest$ grep -e '' data/test/*
data/test/a:a
data/test/b:b
data/test/c:e
data/test/f:d
jmollevi@LNOR070124:~/Projects/dvctest$

the only changed files where in the data/test folder
the large folder with 247 directories and 2444700 files was not in local cache.

For nearly all of the time dvc was on the steps marked
Collecting information from local remote cache...

Other operations are similarly slow.

This slow performance is a blocker for me for an otherwise very suitable tool. (I am creating a training set for machine learning using semi-automated methods to generate all those files.)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions