Skip to content

perf: remote indexes #2147

@Suor

Description

@Suor

As we discussed making dvc fast should be high priority as poor performance can draw people away easily. The big part of todays slowness is working with remotes, which almost always includes collecting file statuses, which could be slow for bigger remotes. All this leads to some form of indexes.

However, our remotes don't provide a luxury of atomic group writes nor reads nor read-modify-write operations. We still can use the following strategy:

  • make an index dir on remote, say index,
  • write a list of files (with names, checksums, mtimes and/or file sizes) to an index file in that dir say 1.idx
  • later when some client needs to update that, e.g. after pushing some new files it:
    • reads current index,
    • updates it,
    • writes new list to 2.<uuid>.idx,
    • removes 1.idx.
      This way in a case of a race we will have several index files.
  • if a client needs to read an index it downloads all index files and combines them.

Since we will not only have adds, but also deletes we will need smart combine procudure like in CRDTs.

File format to be discussed, simple JSON or gzipped JSON with a list of files may do the job though.

What do you guys think? @shcheklein @dmpetrov @efiop @pared @MrOutis

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions