-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance hits of user choices #81
Comments
There is already an issue for generic CPU optimization. But you should know that (BTW, I was able to make the The doc for Re-Compression dominates And the respective efficiency of both options depends on just how much the archive copy differs with the local volume being over-written. (More difference = less efficiency.) This type of issue is why backup tools like Wyng try to migrate to efficient compression libraries as soon as they can, bc uncompressed data chunks cannot be safely compared for Hashtype is already fastest type, sha256. Chunk size IIRC was chosen to reduce the amount of metadata being sent over a slow network and reduce metadata that had to be verified in Heads env. You might try an archive configured with smaller chunk sizes (the default is 65536) to see how that impacts send/receive ops. |
Did comparison to get virtualization and additional IO costs for
Where Windows-standalone-root was chosen because it's the biggest LVM I had in hand, weighting 16Gb of backuped data on the backup storage compressed size (26465MiB on Thin LVM reported by QubesOS manager). AppVM (QubesOS mode):
Locally mounted LVM in dom0 of same archive. But now no IO overhead+virtualization:
|
There is another reason why --sparse can be slower: Without sparse the list of chunks to be sent is pre-fetched by the helper program, but with sparse it must wait for the local system to compress+compare before receiving the next chunk identifier. So that introduces latency. An idea for the future would be for |
From #83 (comment)
Some clarifications. The 100% speed improvement was gained having --sparse-write over --sparse on locally mounted archive dir (comparison was done between having archive stored inside Nothing to do on my current tests over sshfs mounted LUKS mapped container over a sshfs loopback raw file. Those results are opposed to testing EDIT: Will verify results of SSHFS tweaking advices (if that is the 100% expected improvement here). |
@tasket some results on fresh Q4.1 install (and why I posted 3 bug reports) At the time of writing this (2022-04-14)
Qubes 4.1 clean install backup. root-autosnap created with systemd root-autosnap.shutdown at shutdown at /usr/lib/systemd/system-shutdown/root-autosnap.shutdown:
Interesting enough, specifying vm-pool at arch-init still permits to backup root-autosnap from wyng. Basically, for the next tests, we vary arch-init settings:
or Then:
Then: Then in-between tests: Most of the CPU operations are happening over dom0, where wyng-backups seems to be waiting on IOs. Unknowns: cost of encryption (cannot test --encrypt=off on "Wyng 0.4.0alpha release 20220104", bugs reported individually. Knowns:
|
@tasket !!!! Finally found a cheap provider to experiment with. veeble.org, 5$ USD a month, 2gb ram, 20GB ssd and 100TB bandwidth. Was able to duplicate rsync.net subaccount setup based on basic user rights, and create rw account with ro sub-account(in subdir) used to specify what OEM image type is there (q41_insurgo here as example) where ssh authorized_keys is simply put somewhere else per sshd_config override on user match:
So safe state restoration as a service is totally feasible on cheap storage friendly VPS services (again, no 0.4 encryption testing, but I see no stopper there. Please fix #112 though!) @tasket : you got a x230 laptop? This might go faster now. I see you are not active on Matrix? Basically, RW account is used by OEM/org to create archive under RO account made available to access backup archives, at the condition of having public key in authorized_keys above. The RO account is used in qubes-ssh specified app-qube per dom0 to retrieve trusted states archives and works pretty well as opposed to sshfs (now deprecated anyway...). We will have a problem offering states as service though. As of now, I see that the wyng helper script and errors are at a shared location if multiple peoples were using the service at the same time, those should be isolated with different paths on the ssh server host. Want me to create an individual issue? Some comparison of performances differences of current modes with wyng-backup defaults of arch-init: With without --dedup and --sparse-write:
With without --dedup and --sparse:We see that bandwidth consumed is strongly reduced, but where CPU and processing time is increased exponentially. Consequently, I think an emphasis should be made between sparse and sparse-write in the documentation. This has high impact and I wish I could trace what accounts for the difference a bit more. dom0 was using 10-20% cpu the whole time, so not so busy to account for the difference of time processing. Load average on server is 0.00 0.01 0.05, so if something could be done by the server to help the client speed things up (through the helper), that might be a nice avenue here. There seems to be no real reason to use all that bandwidth following past result, where something seems to be missing to ease catching and transmitting only the changes needed faster. @tasket : thoughts? Intuition here is that the client could upload a bit more about its mapping to the server (4mb upload vs 361 dl here, an hour later while the previous tests were done under minutes with 50mbit download link) |
Conslusion --dedup with --sparse vs --sparse-write
So having --sparse is:
|
@tlaurion I don't have an x230 but I do have a T430s which is internally almost identical. It currently has a basic Qubes 4.1 install and factory firmware. blake2 isn't required for v0.4, you can manually select sha256. zstd might give you a speed boost, but it could also mess things up because the format has been evolving recently so I doubt how reproducible the resulting "comparison chunks" will be (probably an issue bc Python library and script library won't be identical). So bzip2 is still the safe bet. FWIW, I could now add gzip support to Wyng because newer Python gzip lib allows override of time header info which is required for consistent hashing.
I think this is due to /tmp dir paths being static. I am already addressing this in v0.4 but if you need it working in v0.3 then open an issue. The benchmark is interesting. I would not have expected 80m added with
Maybe Python isn't pushing the flush past its various io layers, or it may be an ssh/Internet buffering behavior. But yeah, I interpret this as mostly latency/waiting occurring when it shouldn't. Obviously sparse receive could be very valuable if this were resolved so I'll definitely try to do so. Also note
Yes the procedural difference between sparse and non-sparse is that the latter sends an entire file list to the helper script in one batch, while sparse mode compares-then-requests each chunk individually. Doing it the current way actually presents opportunity for reduced (not enlarged) processing time but specific i/o behaviors may make it necessary to use asyncio to realize that potential. And yes, comparing all then sending the list to the helper would immediately improve performance, but that seems like the low road to me; we want CPU comparing and net i/o flowing simultaneously if possible. |
As per imperfect PR proposed, I was able to integrate blake2 and zstd under Heads, and removing thin-provisioning-tools checks. zstd is and blake2 are definitely speedier, so a little bit more details on zstd not having consistent hashing would be welcome here for next steps of testing. bzip2 is damn slow! |
Yes, my initial tests of zstd files from different sources shows they don't match. Under certain conditions they are very close in size so I will look further with hexdump to see if the difference is just header info. blake2 isn't really faster than sha256 as the latter usually benefits from hw acceleration. However, blake2 is considered more secure as it has good resistance against length extension attacks. bzip2 does compare favorably to zstd speed when higher compression ratios are used. If you're OK with lower compression ratios (say 3.0:1 instead of 3.8:1) and compression speed is more important than net bandwidth, then gzip is a future possibility. Currently Wyng v0.3 cannot do gzip because its geared to Python 3.5. |
BTW, considering you are importing new tools into Heads environment, the compression issue IIRC is resolved if the env has |
BTW2... adding |
@tlaurion After doing some manual tests with python-zstd and 'zstd' command line tool, I have some good news... The output does match if The bad news: This was tested in dom0 / fc32 system where both the python library and the CLI tool use libzstd version 1.4.x. Newer Linux releases have a CLI command version 1.5.x which does not yield matching output with the older library version. So for zstd to work with the Wyng sh script, for the time being you will have to use older zstd v1.4.x in the Heads environment. |
zstandard issue explaining the conditions for reproducibility: |
@tlaurion wyng-extract.sh has been updated in fix03 to make zstd compression reproducible and generally usable in this context. Compression levels 3-10 will give fast results with good size reduction. |
@tasket Will look at it, but as stated in PR #104 the script contains bashisms that Heads' busybox (ash compliant) doesn't like. I tried to remove some of those bashisms but broke the script doing so, leaving trace of what was needed to be removed to be more posix'ish compliant. |
From compilation choices, I understood that blake2b is also hardware accelerated
I think the priority will be to reduce restoration times, so I guess a combination with higher compression time (zstd 3 is the default right? So should do tests with zstd 10-19?) and blake2b. |
Will restest this, I'm not clear on the impacts of facebook/zstd#999 (comment) comment in our wyng-backup case. @tasket It's also confusing to know that once dom0 will be upgraded the full backup archive will need to be redone? So basically, what I understand from this is that things will break if hashes are on resulting compressed data and not its origin blocks? This might be problematic? |
@tasket I could also pack pigz instead of zstd and compare results with --sparse restoration. For the sake of states restoration as a service, there will be a choice to be made toward archive lifetime and restoration speed over network, on which as of now I have not enough experimentation background. The result of --sparse restoration above were the result of fix03 branch with wyng-backup default settings used to backup over local wyng qube, with python script used to receive the archive. I only rsync'ed the archive over VPS for network based restoration tests exposed, so any clear recommendations on settings to be tested on arch-init would be welcome to optimize network bandwidth and restoration time :) Could also switch to test 0.4 branch from now on as well. Have not followed improvements on that branch, but if integrity contract is now built in (without encryption or with it, if it can be passed as option unattended), I could start to test this instead, of course if wyng-extract script can be used with it going forward. Not to mix performance tests with long term support as of now, but since states are meant to be selectable, I would definitely prefer directions that would not require to recreate the archives too often :) As of now, just getting excited to have PoC over Heads. |
zstd level 10 will give about the same throughput as gzip/zlib level 4 but with noticeably better compression ratios. Feel free to experiment but I personally wouldn't use above zstd 10; the setting I typically use is either 3 or 7. This benchmark chart gives a general idea of the differences. Keep in mind that for the wyng-extract.sh script in sparse mode, it must also do compression (in addition to decompression) in order to find/fetch only changed chunks. When dom0 changes to zstd 1.5 some choices will have to be made. With Wyng-only operation, the "breakage" would manifest as dedup and remap becoming temporarily inefficient but I would expect no data corruption. Especially with a remap op (where a mismatched snapshot is deleted and new snapshot is paired) would result in a whole additional copy of the volume being added to the archive (although subsequent remaps of the same volume would not suffer this effect). IIRC the borg backup program standardized on zstd early and has issued many advisories to users to ditch and rebuild their archives after upgrading to avoid archives ballooning in size. For the time being, I will look for ways to advise/warn users, but I may put restrictions on which version can be used (already started this in the sh script). OTOH, a careful archive user/curator could discern when zstd has changed to 1.5 and then prune all the older sessions that were done with 1.4. I think for your use case w sh script, disk space would be saved but bandwidth for dl updates is not saved. OTOH2, Ubuntu LTS already has 1.5 of the python3-zstd library, and that version is already in Debian Testing. Fedora lags badly, however, with no update between fc32 and fc37. Maybe consider backporting the 1.5 library to Fedora ourselves. Hashing: I would use blake2b because the difference vs sha256 may not even be noticeable as they are both far faster than most compression options. Wyng 0.3 vs upgrading to v0.4alpha: The v0.4 format is going to change some more when alpha3 drops, but I don't anticipate any conversion roadblocks bc unencrypted data chunks will remain the same. There is already alpha1->alpha2 conversion that is done automatically but I don't anticipate v0.3->v0.4 conversion until the end of alpha3. I still prefer to test the extractor sh script on v0.3 and then convert it to v0.4 later mostly bc some tedious steps will have to be added to support v0.4 format. Verification of v0.4 archives: Think of it being mostly the same as v0.3 except you only need to do your own verification on archive.ini if archive is unencrypted; archive.ini will verify the rest of the metadata and data. If archive is encrypted then archive.ini is self-verifying. |
@tlaurion Here is my updated survey of the situation, based on feedback from zstd project and some recent tests I've made... AssessmentNeither Zlib nor Gzip can match shell command output with Python lib output. This is unfortunate because Zlib output remains very consistent between versions ranging from Fedora 32 through 36 and Python 3.5 through 3.11. Bzip2 output matches no matter what, across shell, Python and different versions. Zlib, Gzip and Bzip2 are mature, stable code bases. Zstd can be very consistent between shell and Python output if the versions are similar. Its an encouraging sign, but Zstd project is extremely noncommittal on the subject of reproducibility; if they so much as tweak a status message or fix a buffer overflow vuln we are to assume Zstd output will be different than in the past. Options
OtherAffects issue #54 – SSH/Rsync/remote: The extract shell script operates as a file batch processor, so the addition of remote access transfers ought to be straightforward. Sparse mode: At this point I would make the script blockdev-only, which gets us past the busybox |
Weird issue with ext4 while attempting to cp -alr archive dir to another one. Seems like there is a maximum number of possible references to the same blocks? Maybe documentation should suggest filesystem limits. As of now, we know ext4 might not be a perfect fit in terms of fixated Inode (maximum number of small files that can be created on a ext4 filesystem, determined at fs creation time) and this weird limit I encountered trying to archive an archive doing a directory copy with hardlinks tracking. |
The hard-link limit for any single file on most Linux filesystems is about 65,000. Having any data that is quite that dedup-prone is a very small corner case. Wyng has its internal workaround, which you helped with via your feedback. But externally, no; nothing in GNU or Linux guards against it or works around it. That Wyng workaround could probably be enhanced so that links are kept to, say, 6500 per file instead of 65,000. But I very much doubt its a good idea to implement that before "Cloud storage API" feature. But note... Implementing an internal archive-copying feature could also be the answer. |
Sorry if I am a bit rigid on documentation. I have a hard time wrapping my head on current, where QubesOS costs on AppVM's LVM disk for storage for backup storage has known high overhead, so I thought of using
--sparse-write
to spread the cpu pinning costs over 2 CPUs, giving edge over--sparse
, but don't see direct benefit.Let me explain:
Doc today says:
Where detailed doc says:
My understanding is that present code is not paralleling any work (in neither modes, think its another tticket), so one core performance would be the limitation of combined dom0 + qube virtualized IO backup storage in my use case (which happens over wyng-backups-vm storage).
I would still have expected
--sparse-write
(50% storage qubes, 50% dom0) to fasten the receive operation over--sparse
(100% CPU hit for local calculation and less pulling over qubes's stored backups data) where the results seem to be as equal.Maybe you could clarify or give a bit more of insights? Otherwise I will put timestamps in my scripts.
Pertinent notes on current archive.ini conf:
Where bz2 was chosen for Head's current busybox support and where I lost track of the chunksize and hashtype costs, so you may shed some lights if you will! :)
Also note that https://git.busybox.net/buildroot/commit/?id=6bccac75ea3f8cd66bcde3747067add14b0c4f2c relies on python script... so not gonna happen soon under Heads.
The text was updated successfully, but these errors were encountered: