Skip to content

Explore De-duplication options #8

@tasket

Description

@tasket

Update: There are now 4 different dedup methods. See #8 (comment) for initial details.


Looking for ideas about a possible deduplication feature.

  1. One simple idea that I've already tried manually with find | uniq | ln is to find the duplicate hashes in the manifests and link their files together on the destination, thus saving disk space. This unsophisticated approach has low CPU & IO overhead but the prospective space savings is lackluster.

  2. It is also possible to retain and correlate more detailed metadata from thin_dump which indicates where source volumes share blocks. However, I've seen complaints about the CPU power needed to process these large xml files (that is not to say they couldn't be cleverly pre-processed to reduce overhead).

  3. More advanced dedup techniques include sliding hash window comparisons. At this point the dedup is actively doing new forms of compression and its not clear this is worth the trade-offs for most users. At the very least it appears beyond what Python can do efficiently.

  4. Detecting when a bkchunk is updated only at its start or end may save significant space (at least when lvm chunks are small). This would require special routines using extra bandwidth in merge and receive functions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions