-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Labels
enhancementEnhances DVCEnhances DVCfeature requestRequesting a new featureRequesting a new featureperformanceimprovement over resource / time consuming tasksimprovement over resource / time consuming tasksquestionI have a question?I have a question?research
Description
As we discussed making dvc fast should be high priority as poor performance can draw people away easily. The big part of todays slowness is working with remotes, which almost always includes collecting file statuses, which could be slow for bigger remotes. All this leads to some form of indexes.
However, our remotes don't provide a luxury of atomic group writes nor reads nor read-modify-write operations. We still can use the following strategy:
- make an index dir on remote, say
index, - write a list of files (with names, checksums, mtimes and/or file sizes) to an index file in that dir say
1.idx - later when some client needs to update that, e.g. after pushing some new files it:
- reads current index,
- updates it,
- writes new list to
2.<uuid>.idx, - removes
1.idx.
This way in a case of a race we will have several index files.
- if a client needs to read an index it downloads all index files and combines them.
Since we will not only have adds, but also deletes we will need smart combine procudure like in CRDTs.
File format to be discussed, simple JSON or gzipped JSON with a list of files may do the job though.
What do you guys think? @shcheklein @dmpetrov @efiop @pared @MrOutis
efiop, shcheklein, dmpetrov, MatthieuBizien, jorgeorpinel and 1 more
Metadata
Metadata
Assignees
Labels
enhancementEnhances DVCEnhances DVCfeature requestRequesting a new featureRequesting a new featureperformanceimprovement over resource / time consuming tasksimprovement over resource / time consuming tasksquestionI have a question?I have a question?research