Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore De-duplication options #8

Closed
tasket opened this issue Dec 12, 2018 · 11 comments
Closed

Explore De-duplication options #8

tasket opened this issue Dec 12, 2018 · 11 comments
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@tasket
Copy link
Owner

tasket commented Dec 12, 2018

Update: There are now 4 different dedup methods. See #8 (comment) for initial details.


Looking for ideas about a possible deduplication feature.

  1. One simple idea that I've already tried manually with find | uniq | ln is to find the duplicate hashes in the manifests and link their files together on the destination, thus saving disk space. This unsophisticated approach has low CPU & IO overhead but the prospective space savings is lackluster.

  2. It is also possible to retain and correlate more detailed metadata from thin_dump which indicates where source volumes share blocks. However, I've seen complaints about the CPU power needed to process these large xml files (that is not to say they couldn't be cleverly pre-processed to reduce overhead).

  3. More advanced dedup techniques include sliding hash window comparisons. At this point the dedup is actively doing new forms of compression and its not clear this is worth the trade-offs for most users. At the very least it appears beyond what Python can do efficiently.

  4. Detecting when a bkchunk is updated only at its start or end may save significant space (at least when lvm chunks are small). This would require special routines using extra bandwidth in merge and receive functions.

@tasket tasket added enhancement New feature or request help wanted Extra attention is needed labels Dec 12, 2018
@tasket tasket added this to the v1.0 milestone Dec 12, 2018
@ghost
Copy link

ghost commented Dec 12, 2018

Not sure de-deduplication (whatever method) is worth it: I know the tool is meant for more general usage than Qubes OS but the area with most duplicated data would be the root volumes of similar templates. And Qubes OS or not, I'm not sure that identical blocks could be detected as such because of fs/block alignment discrepancies (but I could be wrong).

And even with deduplication, the savings for Qubes users would not really be significant: there's no benefit with the default setup (1 template) and in my case - 3 fedora templates - I've computed I'd be saving 7G max. That's really not much compared to my private volumes size (200+ G) that have barely any duplicated data. Moreover a lot of resources would be lost when the de-duplication algorithm goes over hundreds of gigabytes of non-duplicated data.

That said deduplication would help with multiple MS Windows VMs (again, if fs alignment with the underlying block device doesn't pose an issue).

And last, there's the problem of encryption: deduplication would require cleartext access to all the data, or at least to manifests. That would be a problem when using different keys.

FWIW I'm using hardlink to deduplicate files from servers backups. There might be some interesting implementation ideas that could be reused for hashes, should de-duplication be implemented.

@tasket
Copy link
Owner Author

tasket commented Dec 12, 2018

I'm initially inclined to agree about root volumes being the main source of overlap. But I can imagine many situations (a large minority of them) where the user feels it necessary to clone certain appVMs with large initial data sets and then make unique adjustments to each clone ...and then include them all in the backups. Luckily, this sort of overlap information should be readily available from lvm.

This corner-case also suggests that method 1 (the quick/easy one) is worthwhile. But it should only try to dedup under specific conditions or user command to avoid bogging down.

Significant alignment will tend to be there when the dups are from cloning, otherwise there will still be some small set of incidental alignment even when the addresses differ. Its even possible to have sparsebak accept a 'cloning clue' from an OS hook and then remember whether 2 or more configured volumes are related by cloning.

Encryption is not an issue for method 1 where all that's needed is sorting the manifests; might not be for method 2 as well.

@tasket tasket modified the milestones: v1.0, v0.5 Dec 20, 2018
@tasket
Copy link
Owner Author

tasket commented Dec 20, 2018

Try for an optional type-1 dedup feature in 0.5 when encryption is introduced.

tasket added a commit that referenced this issue Apr 16, 2019
See issue #8 notes.
@tasket
Copy link
Owner Author

tasket commented Apr 16, 2019

Initial tests with simple hash-comparison and hardlink, when using the new default 64k chunk size, are quite positive.

Observations

  1. With no cloned volumes/data in a 2-month old daily archive, an overall 13% reduction in disk usage is achieved.
  2. Incremental send operations often see reduction in the 30-40% range.
  3. Cloned volumes are often reduced >50% and can see a 100% reduction if the clone is unmodified.

See Testing Notes in the Readme for brief usage instructions.

@tasket tasket modified the milestones: v0.5, v0.3 Apr 16, 2019
@ghost
Copy link

ghost commented Apr 17, 2019

I've done some tests with send --testing-dedup on a fedora template that was cloned from a "minimal" template months ago and had quite a few additional packages installed. For info the original "minimal" template data size is ~1.5GB and the cloned one is ~2.2GB.

  • the (real) time to complete a backup without dedup was ~3 minutes and I could work with the laptop while the backup was being completed.
  • the time with dedup was ~7:50s and the laptop was almost irresponsive during the whole backup duration. I noticed approx. +1GB swap usage so heavy I/O swap is likely culprit here (I have restricted dom0's total mem to only 1GB to free up memory for vms - it worked fine until now - so I'll have to reboot with more ram and see how dedup fares).
  • size reduction was ~700MB (32% of total / ~50% of the "shared" content), which is not bad at all given that the template was cloned months ago.

By the way quite a bit of time is spent with --testing-dedup after "'Done.'" is displayed, even when the volume didn't change since the last backup. send without dedup takes 2 seconds on a test volume (without changes) while it takes more than 3 minutes with --testing-dedup (and CPU/memory usage increase significantly during that step).

Given --testing-dedup's significant use of resources it seemed to make more sense to use "standard" backups and run sparsebak.py testing-dedup-existing ... unattended every once in a while (BTW that's how I backup all my servers: daily rsync without deduplication and a hardlink run every Sunday).
Half of my backup data is from -private volumes that have hardly any files in common so there's no point in spending resources to try to deduplicate them (side question: any idea what is the memory usage growth compared to the data set size ? Quadratic ?).
The other half is mainly from cloned MS Windows vms, and a few cloned linux templates so it made sense to restrict testing-dedup-existing to specific subsets of volumes. I did a quick test with my vm-win7-*root volumes backups: it took only 48 seconds - this time without trashing my laptop - and size reduction was extremely good considering how/when vms were cloned and which software/updates were installed:

7.4G  7.3G   (1%)  vm-win7-r4-clean-fullupdate-root
3.7G  688M  (19%) vm-win7-r4-clean-root
7.4G  994M  (13%) vm-win7-r4-work-root
4.7G  4.5G  (1%) vm-win7-template-root

vm-win7-r4-clean is a "clean" (eg. no updates) win7 VM I created on R4 long ago. I've cloned it into vm-win7-r4-work where I installed additional software as well as into vm-win7-r4-clean-fullupdate which has seen a full Windows Update cycle. vm-win7-template was imported from R3.2 and there wasn't any size reduction for that one (probably because of a different disk alignment ; it's interesting that there is a bit of deduplication though - those were probably found at other places in the same disk).

Let me know if you'd like me to do more specific tests.

(As usual - great work !)

@tasket
Copy link
Owner Author

tasket commented Apr 18, 2019

Thanks for the report! This initial attempt at indexing was a bit naive, but even so deduplicating large sets has a reputation for being a memory hog (many people found this out when trying zfs dedup).

I have dom0 capped at 1.5G, and with my smaller archive saw less swapping; I don't think it was churning so you might want to increase your dom0 memory to this amount.

There's an update that reduces index memory use, with ints and object refs instead of strings. The reduction is between 30-50% but CPU use has gone up slightly (Python dicts are known to be optimized for strings). I may get it down by 3/4 if I can somehow use ctypes or other traditional integer types. You can compare the process memory use of the old and new index types with the test1 and test2 commands. The send and dedup-existing routines have been moved to the new index, but I'm leaving the old one there presently for comparison.

Though I haven't thought that deeply about it yet, I believe (maybe wrong) the index size is sub-quadratic bc the base of the exponent never reaches 2 under normal operating conditions (entropy is moderate).

Improving the dedup conditions with some type of affinity setting, as you suggest, is certainly possible and I believe this is another use case for multiple archive sets... having volumes in the same set will indicate dedup affinity. The archive set is all about the characteristics of the (dest) storage such as where its located, it can also hold the parameters that affect storage efficiency (you will also be able to set the chunk size, compression params and overall cap on allocated space).

@tasket
Copy link
Owner Author

tasket commented Apr 18, 2019

BTW, I don't mind discussion in qubes-users and elsewhere (this is not a Qubes-specific tool :) ) just keep in mind its at "leaving experimental" phase and due for a name change.

@tasket
Copy link
Owner Author

tasket commented Apr 19, 2019

A 100% dedup benchmark on a 9.8GB (65% full) Windows 7 lv with the revised code:

11m40s  sparsebak --testing-dedup send
12m00s  gzip -4 >/dev/null
15m37s  gzip -4 >test.img.z

This tells me the code is running pretty smoothly and the only two bottlenecks are swapping and compression. Improving these won't be very hard.

@tasket tasket changed the title De-duplication concepts Explore De-duplication options May 6, 2019
tasket added a commit that referenced this issue May 6, 2019
fix issue #32 and issue #8.
@tasket
Copy link
Owner Author

tasket commented May 6, 2019

New dedup options 3, 4 and 5 that use less than half the memory of 1 & 2 (naive dict indexing) have been added. They show much less swapping and impact on other system processes, however my initial tests show no great backup speed improvements over the older algorithms probably because my tests have been either SSD-to-SSD or over a fast internal net, making compression times overwhelm the test results. This suggests the new algos are a pretty good trade-off for users trying speed up over-Internet backups or stay within limited disk space constraints on the destination – and these are both common conditions that users experience.

Current de-duplication indexing methods and typical memory use:

Algo # Algo name Max memory Init time (sec.)
2 naive dict 950MB 3
3 sqlite3 180MB 18
4 array tree 230MB 9
5 bytearray tree 230MB 6

The test corpus consists of 4 data-only volumes and 2 different Linux distros, totaling about 20GB with daily incremental updates for 9 weeks (so archive space is substantially larger than 20GB). Space savings from dedup is around 13%, but special tests with cloned volumes can go as high as 100%.

Rigorous benchmarks to come in the future...

Notes:

The indexing model of compile-and-reference-on-the-fly is highly dynamic and demands a lot from the language and runtime environment. After more than 2 weeks honing these methods, I suspect there is hardly any room for improvement with what Python offers natively. Should further improvement become necessary, I'd likely briefly explore numpy arrays (plugged into algo 4 or 5) before trying a ctypes/glibc solution.

The sqlite3 method doesn't perform too badly overall despite the long initialization time. Init can be cut by half if the db is moved to :memory:, but this greatly increases RAM use to over 520MB in the above example with my test suite. This method uses ctypes to translate signed <-> unsigned ints in order to trick sqlite into storing the equivalent of full 64bit unsigned ints (storing them as blobs seemed to be a waste).

On the whole, the current best performer seems to be the 5, bytearray tree method which loosely sorts index entries into many bytearrays (which are embedded in a dictionary)... this strikes a good balance between memory efficiency and speed. However, its possible that 4, array tree may scale better. Python's bytearrays vs numeric arrays have an odd asymmetry with append vs search speed (bytearrays are better at search) which suggests that Python's engineers have not taken highly dynamic use cases into account; with moderate optimization they could probably achieve 10-100X improvement in array search speed.

In future, it may be advantageous to cap the number of index entries based on available memory and search speed. In this case the first step to creating the (limited) index would be creating a histogram identifying oft-repeating chunks to be included first.

@tasket
Copy link
Owner Author

tasket commented Jan 2, 2020

Possible new dedup mode: Simple inline comparison

Implementation would be to simply look at hashes in-sequence from the prior snapshot of the volume being backed up; if there is a match, then merely skip to processing the next chunk (no hardlink necessary).

This is helpful bc deltas are often relative to: A) the same volume, B) the latest prior snapshot, and C) the same location in the volume. This points to a possible default dedup mode that is more lightweight and probably offer more than half the reduction of existing dedup modes.

Given the nature of thin LVM and other snapshot storage systems, the delta granularity is not ideal for space-efficient backups. Using the existing dedup options somewhat addresses this bc a typical disk LVM with chunk sizes from 256K to 4MB will often trigger dedup where writes to the volume were less than half the LVM chunk size. So a contiguous write of 30KB to a volume will initially look like a 4MB delta (on a system w 4MB chunk size) but dedup routines can pare that down to 64KB.

@tasket
Copy link
Owner Author

tasket commented May 16, 2021

The inline comparison has been implemented, and dedup in general is now leaving experimental status.

@tasket tasket closed this as completed May 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant