Update: There are now 4 different dedup methods. See #8 (comment) for initial details.
Looking for ideas about a possible deduplication feature.
-
One simple idea that I've already tried manually with find | uniq | ln is to find the duplicate hashes in the manifests and link their files together on the destination, thus saving disk space. This unsophisticated approach has low CPU & IO overhead but the prospective space savings is lackluster.
-
It is also possible to retain and correlate more detailed metadata from thin_dump which indicates where source volumes share blocks. However, I've seen complaints about the CPU power needed to process these large xml files (that is not to say they couldn't be cleverly pre-processed to reduce overhead).
-
More advanced dedup techniques include sliding hash window comparisons. At this point the dedup is actively doing new forms of compression and its not clear this is worth the trade-offs for most users. At the very least it appears beyond what Python can do efficiently.
-
Detecting when a bkchunk is updated only at its start or end may save significant space (at least when lvm chunks are small). This would require special routines using extra bandwidth in merge and receive functions.
Update: There are now 4 different dedup methods. See #8 (comment) for initial details.
Looking for ideas about a possible deduplication feature.
One simple idea that I've already tried manually with
find | uniq | lnis to find the duplicate hashes in the manifests and link their files together on the destination, thus saving disk space. This unsophisticated approach has low CPU & IO overhead but the prospective space savings is lackluster.It is also possible to retain and correlate more detailed metadata from
thin_dumpwhich indicates where source volumes share blocks. However, I've seen complaints about the CPU power needed to process these large xml files (that is not to say they couldn't be cleverly pre-processed to reduce overhead).More advanced dedup techniques include sliding hash window comparisons. At this point the dedup is actively doing new forms of compression and its not clear this is worth the trade-offs for most users. At the very least it appears beyond what Python can do efficiently.
Detecting when a bkchunk is updated only at its start or end may save significant space (at least when lvm chunks are small). This would require special routines using extra bandwidth in merge and receive functions.