Skip to content

pull: pulling single file is really slow when there's hundreds other .dvc files #8768

@Marigold

Description

@Marigold

Bug Report

Description

We have about thousand small files in DVC. We're using Python API, though CLI has the same issue. We often need to add / pull a single new file so we use something like

from dvc.repo import Repo
repo = Repo("repo_root")
repo.pull("my_file.csv.dvc")

This takes almost 10 seconds, because DVC internally loads all stages before pulling that single file. I'd expect this to be almost instant. Why does it have to go through all the other dvc files? (my .dvcignore ignores as much files as possible, but the bottleneck is loading dvc files anyway)

Thanks!

Environment information

Output of dvc doctor:

DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.14 on macOS-12.5-x86_64-i386-64bit
Subprojects:
	dvc_data = 0.28.4
	dvc_objects = 0.14.0
	dvc_render = 0.0.15
	dvc_task = 0.1.8
	dvclive = 1.2.2
	scmrepo = 0.1.4
Supports:
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3, https, s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Additional Information (if any):

Metadata

Metadata

Assignees

Labels

bugDid we break something?performanceimprovement over resource / time consuming tasks

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions