Skip to content

pull: clones repositories for imported files #9738

@peper0

Description

@peper0

Description

dvc pull clones repositories from which files were imported, even though they are cached (have cache: true implicitly or explicitly).

Reproduce

  1. dvc init
  2. dvc import any file from a different git repository
  3. dvc push
  4. clear the local cache
  5. dvc pull

At step 5 the repository is being cloned.

Expected

I expect data to be pushed to the remote in dvc push and pulled from the remote in dvc pull since the data is cached by default without accessing the git repository it was imported from (unless dvc update is called).

This is a big problem, since the git repo may be not accessible when dvc pull is called (e.g. when it is called by CI server). Moreover, it takes a lot of time if data is imported from several repositories with some large ones among them.

In my understanding, outputs are synced with the repository only in dvc update and dvc import. Not at dvc pull or dvc repro. Therefore I don't see why the repo would need to be accessible when calling dvc pull

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.58.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.4.0-150-generic-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.23.0
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.0.4
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2023.4.1)
Config:
        Global: /home/tlakota/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1
Caches: local
Remotes: ssh, ssh
Workspace directory: ext4 on /dev/sdc
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/9d372b24e0a6ee54ffae81f6983b321a

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-syncRelated to dvc get/fetch/import/pull/pushp2-mediumMedium priority, should be done, but less important

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions