-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
dvc pull clones repositories from which files were imported, even though they are cached (have cache: true implicitly or explicitly).
Reproduce
- dvc init
- dvc import any file from a different git repository
- dvc push
- clear the local cache
- dvc pull
At step 5 the repository is being cloned.
Expected
I expect data to be pushed to the remote in dvc push and pulled from the remote in dvc pull since the data is cached by default without accessing the git repository it was imported from (unless dvc update is called).
This is a big problem, since the git repo may be not accessible when dvc pull is called (e.g. when it is called by CI server). Moreover, it takes a lot of time if data is imported from several repositories with some large ones among them.
In my understanding, outputs are synced with the repository only in dvc update and dvc import. Not at dvc pull or dvc repro. Therefore I don't see why the repo would need to be accessible when calling dvc pull
Environment information
Output of dvc doctor:
$ dvc doctor
DVC version: 2.58.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.4.0-150-generic-x86_64-with-glibc2.31
Subprojects:
dvc_data = 0.51.0
dvc_objects = 0.23.0
dvc_render = 0.5.3
dvc_task = 0.3.0
scmrepo = 1.0.4
Supports:
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
ssh (sshfs = 2023.4.1)
Config:
Global: /home/tlakota/.config/dvc
System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1
Caches: local
Remotes: ssh, ssh
Workspace directory: ext4 on /dev/sdc
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/9d372b24e0a6ee54ffae81f6983b321a