-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lib/model, lib/scanner: Efficient inserts/deletes in the middle of the file #3527
Conversation
LUL:
|
ec047fb
to
c8f4d89
Compare
So it looks sort of reasonable on the face of it. Why, though? I mean what's the compelling use case here? And what are the performance tradeoffs, i guess it results in up to blocksize number of rolling hash rounds and then one or more extra sha256 rounds? Apart from that, what is the |
I've filed an extra bug on the compiler issue, in the meantime this compiles:
|
The use case is somewhat obvious, better performance for files that get content appended in the middle. So the algorithm is http://tutorials.jenkov.com/rsync/checksums.html (or simillar to this http://blog.liw.fi/posts/rsync-in-python/#comment-fee8d5e07794fdba3fe2d76aa2706a13) |
If you are fine with the general idea/approach, I'll add tests add comments, and potentially add a flag to disable this. |
Sure. But when is that required? I mean in what real world situations is this a win. I've heard of Photoshop files, which may be large-ish (I guess on the scale of ~100 MB?) and are binary and rewritten on each save so can theoretically benefit from this. But are there any other real life cases?
I don't see why we'd need to do more than But my real concern is how much does this cost in the general case where it is not needed, that is, how much slower will it be to sync a change of some piece of data in the middle of some file. In the end I'm trying to judge if this makes sense based on the advantages for some use case I'm not sure of outweighing increasing the cost of every sync operation. |
Yes we do blockSize rounds per block, but in total we do length_of_file rounds, as we produce a hash for the first block, then add the next byte (which removes the last byte as the window slides) produce a hash again, etc, etc until we reach the end of the file, so in total it will be length_of_file rounds. Sadly, I don't have a usecase. I mean rsync does it, and that's like a defacto best tool around, hence why I am trying to make syncthing just as good . I know large word files, large PDFs, large SVGs would benefit from this too. In terms of cost, I feel that computing the rolling hash as we scan will have pretty much no noticeable impact, compared to SHA256. When downloading, we'll have to compute the rollmap for the file, which again, I suspect will have no noticeable impact when compared having to access the database, ask for stuff over the network and verify it with SHA256. Yes, there might be a few extra rounds of SHA256 involved in case the weak hash collides, but I suspect it's improbable for the hash to collide within the same file. Otherwise, if we don't collide, we are not going over the network and already benefiting from savings. Btw, if we get variable block sizes, and micro delta indexes syncthing will be better than rsync in terms of features and network bandwidth required by the protocol. |
I mean, perhaps we should ask the community if they would find this useful |
Given that you've done the work, I feel it would be a shame to lose it, if the cost is negligible and it's nice and encapsulated... |
... Maybe only enable on files over a certain size if the cost does become an issue? |
Right, I misunderstood how it worked and interpreted one "round" as one pass over the file, when in fact it's a total of one pass over the file and that's it? Given that, I think this is viable. However I think we should only trigger this based on some half way intelligent heuristic, for example by detecting that all or most blocks have changed after a given point in a file. If we're just going to append data, replace the last block, or replace a block in the middle there is no reason to compute a roll hash. And it would be good to keep a few metrics on this, some sort of hit ratio, that we could add to the report and make future decisions based on. (rsync was designed probably 30 years ago, for a different world of computing and other constraints and goals. It may still be correct on this point but I'm not takin it as a given.) |
Yes, we compute I just realized that it won't work with big files, as it's 4 bytes per hash, and if we have a 8GB file, that's 32GB worth of hashes. I'll change the API so that it takes a list of hashes we are interested in, and returns offsets for each hash if it finds it during the roll. Yes, perhaps enabling this for anything that will potentially have to find xMB from somewhere else would make sense. |
Anyways, lets settle if this is something we want to go for, and if it is, I'll drive it home. |
I don't mind this, but I would like to know about at least one real world case where it will make a positive difference. As in an actual user saying "this will help a lot in my situation because x" where x is something more concrete than "I think the algorithm would be cool" :) After that it's "just" technical details and I'm sure we can solve them. |
The gzipped-sql-dumps one sounds vaguely plausible, if that does result in the sort of data shifts we are talking about here. Could be interesting to check. |
Do PST-files (MS Outlook) fall into this category? |
PST files are not likely in this category, they work like a database file, expanding to the desired size by allocating space at the end, with rewrites happening in the middle without shifting allocations in the middle of the file. In essence, a PST file (or a SQLite database) is like a small filesystem which grows in size to fit internal allocations, without moving anything already allocated unless it's told to do so. The biggest likely benefactors are the few odd applications that use fallocate(2) with FALLOC_FL_INSERT_RANGE or FALLOC_FL_COLLAPSE_RANGE (I know such applications exist, but I don't know of any specific examples, and they are essentially non-existent on Windows because it doesn't have a syscall that works that way). In theory though, anything that shifts data around inside a file could benefit. The most obvious case would be plain text files with extra data inserted in the middle, but it should also benefit some compressed formats such as PSD, XCF, DOCX, ODT, and similar, as well as some archive formats. |
Indeed. But all of those file formats are usually small enough that this doesn't matter. If the compressed gigabytes-sized SQL dump does benefit from this, that would be fair. |
The thing is, once you start to see files bigger than a few megabytes, this type of thing does improve the network bandwidth utilization measurably. Big files are of course going to benefit the most, but even smaller ones will benefit. Using rsync as an example (because it has the ability to behave like this), take a 256MB file, copy it, and then add one byte at the beginning of the file. Re-syncing this file will require copying the whole 256MB using a traditional copy, which will mean close to 270MB of data over the network using scp, while using a method like this proposal with rsync will require sending maybe a few KB of data over the network o achieve the same thing. |
The point is that we are unaware of anything that prepends or inserts to the file while also being of reasonable size. |
Use case: pack files. Media heavy games, preprocess 1000s of assets into the memory format used in ram. So instead of 1000 individual file loads and processing, one file is dumped into ram. Use case: mng (sequence of pngs). Artists typically add a new action (fade in, new loop) to the same file and update a text file to reference the new frames with a new action verb. Alot easier for programmers to deal with than a whole new asset. Rsync is the king for updating iterated game assets. I can't stress that observation enough. Aspera etc are woeful. |
There are all kinds of other things too though. Zip archives for example will benefit from this if new files are added or old ones are removed, or even if the archive is rewritten. ISO images would fall into the same category. You can't think of this like a filesystem developer would. A file that gets replaced by a new version with the same name is seen as the same file with different data by Syncthing, not a new separate file (and that is exactly how it should be in userspace). This means that things that rewrite files using replace by rename method still get a benefit form this because Syncthing still sees it as the same file. There are very few things that prepend to or insert data into the middle of a file in-place, but there are quite a few which rewrite the whole file but still cause existing data to shift around inside it. Taking the example of a zip archive: Create a zip archive with a bunch of files in it. Now create a second with the same files, except remove one or add one more to it. Now do a hex-dump of the files and do a side-by-side comparison (for example with vimdiff or winmerge). Most of the data will be the same between the files, just with parts of it shifted. The same applies to ISO 9660 filesystem images, and, with slightly less frequency and in more specific conditions, to most other archive formats. |
What about VM Disk Images (e.g. VMware .vmdk) or TrueCrypt Containers? ;-) |
These I suspect change in place, or are append only, so are already handled. We are only interested in prepend or insert cases. |
[ CORRECTED ] So the gzipped SQL dumps argument is invalid (there are some similar patterns, but then it looks like it drifts):
Same for zip files
Same for xz... bzip is different from the very beginning it seems. |
Hmm OK I'll investigate |
Bump. Not to be forgotten. |
This fails the basic sync test on my computer, which passes on master. Given that it apparently passes for Audrius I need to investigate. I haven't had time to do so. |
@st-jenkins test this please |
Yeah so now that the other urgent things are maybe sorted for the moment, let me take another look. |
@st-jenkins test this please, i thought audrius fixed you |
@AudriusButkevicius you have a couple of conflicts to resolve at your leisure, but i'll try to test the branch at the state it was in the meantime |
OK I've narrowed it down to us losing events somewhere, which is probably not something you've caused here but is triggered by timing or whatever, looking into that. |
Now it works for me. Lets get master merged into this so it's clean and make some progress. |
We lose important events due to the frequency of ItemStarted / ItemFinished events.
@st-jenkins test this please |
@st-jenkins test this please |
@st-review merge |
👌 Merged as 0582836. Thanks, @AudriusButkevicius! |
…e file GitHub-Pull-Request: #3527
not sure if you were fine to merge this, but given this is "approved" I assume you were. |
I was going to review it another time, but yeah this is fine. However, I still want stats on this. I don't think anyone will mind adding a usage reporting variable on the hit rate for this block finder, at the same time the feature is added. So some time during the next two weeks, before release, I would very much like something that reports the blocks stats from this thing. I can implement it myself as well if you don't have the bandwidth. |
We lose important events due to the frequency of ItemStarted / ItemFinished events.
We lose important events due to the frequency of ItemStarted / ItemFinished events.
* v0.14.16-hotfix: gui, man: Update docs & translations lib/model: Allow empty subdirs in scan request (fixes #3829) lib/model: Don't send symlinks to old devices that can't handle them (fixes #3802) lib/model: Accept scan requests of paths ending in slash (fixes #3804) gui: Avoid pause between event polls (ref #3527)
Purpose
Review with &w=1
Uses a weak rolling hash to identify shifted blocks.
Potentially a DoS surface, by forcing us to hash the file
size_of_file
times with SHA256 since the weak hash can collide fairly easily?Yet there are other ways to do a similar thing I guess, such as providing blocks with the incorrect hash to start with, but that is somewhat limited by the network speed.
Also probably needs some tests, as I haven't checked if it actually works.