Explore De-duplication options

**Update:** There are now 4 different dedup methods. See https://github.com/tasket/sparsebak/issues/8#issuecomment-489490004 for initial details.

---

Looking for ideas about a possible deduplication feature.

1. One simple idea that I've already tried manually with `find | uniq | ln` is to find the duplicate hashes in the manifests and link their files together on the destination, thus saving disk space. This unsophisticated approach has low CPU & IO overhead but the prospective space savings is lackluster.

2. It is also possible to retain and correlate more detailed metadata from `thin_dump` which indicates where source volumes share blocks. However, I've seen complaints about the CPU power needed to process these large xml files (that is not to say they couldn't be cleverly pre-processed to reduce overhead).

3. More advanced dedup techniques include sliding hash window comparisons. At this point the dedup is actively doing new forms of compression and its not clear this is worth the trade-offs for most users. At the very least it appears beyond what Python can do efficiently.

4. Detecting when a bkchunk is updated only at its start or end may save significant space (at least when lvm chunks are small). This would require special routines using extra bandwidth in merge and receive functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore De-duplication options #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Explore De-duplication options #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions