-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore De-duplication options #8
Comments
Not sure de-deduplication (whatever method) is worth it: I know the tool is meant for more general usage than Qubes OS but the area with most duplicated data would be the root volumes of similar templates. And Qubes OS or not, I'm not sure that identical blocks could be detected as such because of fs/block alignment discrepancies (but I could be wrong). And even with deduplication, the savings for Qubes users would not really be significant: there's no benefit with the default setup (1 template) and in my case - 3 fedora templates - I've computed I'd be saving 7G max. That's really not much compared to my private volumes size (200+ G) that have barely any duplicated data. Moreover a lot of resources would be lost when the de-duplication algorithm goes over hundreds of gigabytes of non-duplicated data. That said deduplication would help with multiple MS Windows VMs (again, if fs alignment with the underlying block device doesn't pose an issue). And last, there's the problem of encryption: deduplication would require cleartext access to all the data, or at least to manifests. That would be a problem when using different keys. FWIW I'm using hardlink to deduplicate files from servers backups. There might be some interesting implementation ideas that could be reused for hashes, should de-duplication be implemented. |
I'm initially inclined to agree about root volumes being the main source of overlap. But I can imagine many situations (a large minority of them) where the user feels it necessary to clone certain appVMs with large initial data sets and then make unique adjustments to each clone ...and then include them all in the backups. Luckily, this sort of overlap information should be readily available from lvm. This corner-case also suggests that method 1 (the quick/easy one) is worthwhile. But it should only try to dedup under specific conditions or user command to avoid bogging down. Significant alignment will tend to be there when the dups are from cloning, otherwise there will still be some small set of incidental alignment even when the addresses differ. Its even possible to have sparsebak accept a 'cloning clue' from an OS hook and then remember whether 2 or more configured volumes are related by cloning. Encryption is not an issue for method 1 where all that's needed is sorting the manifests; might not be for method 2 as well. |
Try for an optional type-1 dedup feature in 0.5 when encryption is introduced. |
Initial tests with simple hash-comparison and hardlink, when using the new default 64k chunk size, are quite positive. Observations
See Testing Notes in the Readme for brief usage instructions. |
I've done some tests with
By the way quite a bit of time is spent with Given
Let me know if you'd like me to do more specific tests. (As usual - great work !) |
Thanks for the report! This initial attempt at indexing was a bit naive, but even so deduplicating large sets has a reputation for being a memory hog (many people found this out when trying zfs dedup). I have dom0 capped at 1.5G, and with my smaller archive saw less swapping; I don't think it was churning so you might want to increase your dom0 memory to this amount. There's an update that reduces index memory use, with ints and object refs instead of strings. The reduction is between 30-50% but CPU use has gone up slightly (Python dicts are known to be optimized for strings). I may get it down by 3/4 if I can somehow use ctypes or other traditional integer types. You can compare the process memory use of the old and new index types with the Though I haven't thought that deeply about it yet, I believe (maybe wrong) the index size is sub-quadratic bc the base of the exponent never reaches 2 under normal operating conditions (entropy is moderate). Improving the dedup conditions with some type of affinity setting, as you suggest, is certainly possible and I believe this is another use case for multiple archive sets... having volumes in the same set will indicate dedup affinity. The archive set is all about the characteristics of the (dest) storage such as where its located, it can also hold the parameters that affect storage efficiency (you will also be able to set the chunk size, compression params and overall cap on allocated space). |
BTW, I don't mind discussion in qubes-users and elsewhere (this is not a Qubes-specific tool :) ) just keep in mind its at "leaving experimental" phase and due for a name change. |
A 100% dedup benchmark on a 9.8GB (65% full) Windows 7 lv with the revised code:
This tells me the code is running pretty smoothly and the only two bottlenecks are swapping and compression. Improving these won't be very hard. |
New dedup options 3, 4 and 5 that use less than half the memory of 1 & 2 (naive dict indexing) have been added. They show much less swapping and impact on other system processes, however my initial tests show no great backup speed improvements over the older algorithms probably because my tests have been either SSD-to-SSD or over a fast internal net, making compression times overwhelm the test results. This suggests the new algos are a pretty good trade-off for users trying speed up over-Internet backups or stay within limited disk space constraints on the destination – and these are both common conditions that users experience. Current de-duplication indexing methods and typical memory use:
The test corpus consists of 4 data-only volumes and 2 different Linux distros, totaling about 20GB with daily incremental updates for 9 weeks (so archive space is substantially larger than 20GB). Space savings from dedup is around 13%, but special tests with cloned volumes can go as high as 100%. Rigorous benchmarks to come in the future... Notes:The indexing model of compile-and-reference-on-the-fly is highly dynamic and demands a lot from the language and runtime environment. After more than 2 weeks honing these methods, I suspect there is hardly any room for improvement with what Python offers natively. Should further improvement become necessary, I'd likely briefly explore numpy arrays (plugged into algo 4 or 5) before trying a ctypes/glibc solution. The sqlite3 method doesn't perform too badly overall despite the long initialization time. Init can be cut by half if the db is moved to On the whole, the current best performer seems to be the 5, bytearray tree method which loosely sorts index entries into many bytearrays (which are embedded in a dictionary)... this strikes a good balance between memory efficiency and speed. However, its possible that 4, array tree may scale better. Python's bytearrays vs numeric arrays have an odd asymmetry with In future, it may be advantageous to cap the number of index entries based on available memory and search speed. In this case the first step to creating the (limited) index would be creating a histogram identifying oft-repeating chunks to be included first. |
Possible new dedup mode: Simple inline comparison Implementation would be to simply look at hashes in-sequence from the prior snapshot of the volume being backed up; if there is a match, then merely skip to processing the next chunk (no hardlink necessary). This is helpful bc deltas are often relative to: A) the same volume, B) the latest prior snapshot, and C) the same location in the volume. This points to a possible default dedup mode that is more lightweight and probably offer more than half the reduction of existing dedup modes. Given the nature of thin LVM and other snapshot storage systems, the delta granularity is not ideal for space-efficient backups. Using the existing dedup options somewhat addresses this bc a typical disk LVM with chunk sizes from 256K to 4MB will often trigger dedup where writes to the volume were less than half the LVM chunk size. So a contiguous write of 30KB to a volume will initially look like a 4MB delta (on a system w 4MB chunk size) but dedup routines can pare that down to 64KB. |
The inline comparison has been implemented, and dedup in general is now leaving experimental status. |
Update: There are now 4 different dedup methods. See #8 (comment) for initial details.
Looking for ideas about a possible deduplication feature.
One simple idea that I've already tried manually with
find | uniq | ln
is to find the duplicate hashes in the manifests and link their files together on the destination, thus saving disk space. This unsophisticated approach has low CPU & IO overhead but the prospective space savings is lackluster.It is also possible to retain and correlate more detailed metadata from
thin_dump
which indicates where source volumes share blocks. However, I've seen complaints about the CPU power needed to process these large xml files (that is not to say they couldn't be cleverly pre-processed to reduce overhead).More advanced dedup techniques include sliding hash window comparisons. At this point the dedup is actively doing new forms of compression and its not clear this is worth the trade-offs for most users. At the very least it appears beyond what Python can do efficiently.
Detecting when a bkchunk is updated only at its start or end may save significant space (at least when lvm chunks are small). This would require special routines using extra bandwidth in merge and receive functions.
The text was updated successfully, but these errors were encountered: