Skip to content

remote: efficiently collect directories #2648

@ghost

Description

Version: 0.62.1

Description: The current implementation for _collect_dir is an N+1 operation, where it walks the directory to list all the files and then for each one compute/request its checksum (get_file_checksum).

https://github.com/iterative/dvc/blob/4171aac0294fd316d51558d2593d10ff006221c2/dvc/remote/base.py#L195-L231

The state saves us from getting all the checksums again (the N operation).
However, there are remotes like S3 that have an operation to list the objects with their checksums and other stats (list_objects).

Let's discuss if it make sense to take advantage of this operation, and replace the N+1 (get_filechecksum(file) for file in walk(dir) if not state.get(file)) with the one that returns the list of files with some metadata already.

Related: #1654

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionrequires active participation to reach a conclusion

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions