Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/model, lib/scanner: Efficient inserts/deletes in the middle of the file #3527

Closed
wants to merge 13 commits into from

Conversation

AudriusButkevicius
Copy link
Member

@AudriusButkevicius AudriusButkevicius commented Aug 20, 2016

Purpose

Review with &w=1
Uses a weak rolling hash to identify shifted blocks.
Potentially a DoS surface, by forcing us to hash the file size_of_file times with SHA256 since the weak hash can collide fairly easily?

Yet there are other ways to do a similar thing I guess, such as providing blocks with the incorrect hash to start with, but that is somewhat limited by the network speed.

Also probably needs some tests, as I haven't checked if it actually works.

@AudriusButkevicius
Copy link
Member Author

LUL:

run compiler with -v for register allocation sites
lib/weakhash/weakhash.go:77: internal compiler error: out of fixed registers

@AudriusButkevicius AudriusButkevicius changed the title lib/model, lib/scanner: More efficient appends in the middle of the file lib/model, lib/scanner: Efficient inserts/deletes in the middle of the file Aug 20, 2016
@calmh
Copy link
Member

calmh commented Aug 20, 2016

So it looks sort of reasonable on the face of it. Why, though? I mean what's the compelling use case here? And what are the performance tradeoffs, i guess it results in up to blocksize number of rolling hash rounds and then one or more extra sha256 rounds?

Apart from that, what is the weakhash algo? There could be a comment about what it is and where it comes from.

@calmh
Copy link
Member

calmh commented Aug 20, 2016

I've filed an extra bug on the compiler issue, in the meantime this compiles:

func (d *digest) Write(data []byte) (int, error) {
    for _, c := range data {
        d.a -= uint16(d.buf[d.j])
        d.a += uint16(c)
        d.b -= uint16(d.size) * uint16(d.buf[d.j])
        d.b += d.a
        d.buf[d.j] = c
        d.j = (d.j + 1) % d.size
    }
    return len(data), nil
}

@AudriusButkevicius
Copy link
Member Author

The use case is somewhat obvious, better performance for files that get content appended in the middle.
It results in length_of_file rounds of rolling checksum (yet the round itself is 4-5 operations per byte as it's a sliding window hash), and yes, potentially a few more SHA256 rounds.

So the algorithm is http://tutorials.jenkov.com/rsync/checksums.html (or simillar to this http://blog.liw.fi/posts/rsync-in-python/#comment-fee8d5e07794fdba3fe2d76aa2706a13)
which I believe is what rsync used before they switched to a bastardized version of adler32.
We could use adler32, yet we'd have to make a copy of the implementation in the stdlib as it does not support a rolling window, so a lot of meaningless recomputation is involved.

@AudriusButkevicius
Copy link
Member Author

If you are fine with the general idea/approach, I'll add tests add comments, and potentially add a flag to disable this.

@calmh
Copy link
Member

calmh commented Aug 21, 2016

The use case is somewhat obvious, better performance for files that get content appended in the middle.

Sure. But when is that required? I mean in what real world situations is this a win. I've heard of Photoshop files, which may be large-ish (I guess on the scale of ~100 MB?) and are binary and rewritten on each save so can theoretically benefit from this. But are there any other real life cases?

It results in length_of_file rounds of rolling checksum (yet the round itself is 4-5 operations per byte as it's a sliding window hash), and yes, potentially a few more SHA256 rounds.

I don't see why we'd need to do more than blockSize rounds as doing one with offset blockSize+1 is equivalent to doing one with offset 1, just the resulting list of block hashes being off by one?

But my real concern is how much does this cost in the general case where it is not needed, that is, how much slower will it be to sync a change of some piece of data in the middle of some file.

In the end I'm trying to judge if this makes sense based on the advantages for some use case I'm not sure of outweighing increasing the cost of every sync operation.

@AudriusButkevicius
Copy link
Member Author

Yes we do blockSize rounds per block, but in total we do length_of_file rounds, as we produce a hash for the first block, then add the next byte (which removes the last byte as the window slides) produce a hash again, etc, etc until we reach the end of the file, so in total it will be length_of_file rounds.

Sadly, I don't have a usecase. I mean rsync does it, and that's like a defacto best tool around, hence why I am trying to make syncthing just as good . I know large word files, large PDFs, large SVGs would benefit from this too.

In terms of cost, I feel that computing the rolling hash as we scan will have pretty much no noticeable impact, compared to SHA256.

When downloading, we'll have to compute the rollmap for the file, which again, I suspect will have no noticeable impact when compared having to access the database, ask for stuff over the network and verify it with SHA256. Yes, there might be a few extra rounds of SHA256 involved in case the weak hash collides, but I suspect it's improbable for the hash to collide within the same file. Otherwise, if we don't collide, we are not going over the network and already benefiting from savings.

Btw, if we get variable block sizes, and micro delta indexes syncthing will be better than rsync in terms of features and network bandwidth required by the protocol.

@AudriusButkevicius
Copy link
Member Author

I mean, perhaps we should ask the community if they would find this useful

@canton7
Copy link
Member

canton7 commented Aug 21, 2016

Given that you've done the work, I feel it would be a shame to lose it, if the cost is negligible and it's nice and encapsulated...

@canton7
Copy link
Member

canton7 commented Aug 21, 2016

... Maybe only enable on files over a certain size if the cost does become an issue?

@calmh
Copy link
Member

calmh commented Aug 21, 2016

Right, I misunderstood how it worked and interpreted one "round" as one pass over the file, when in fact it's a total of one pass over the file and that's it?

Given that, I think this is viable.

However I think we should only trigger this based on some half way intelligent heuristic, for example by detecting that all or most blocks have changed after a given point in a file. If we're just going to append data, replace the last block, or replace a block in the middle there is no reason to compute a roll hash.

And it would be good to keep a few metrics on this, some sort of hit ratio, that we could add to the report and make future decisions based on.

(rsync was designed probably 30 years ago, for a different world of computing and other constraints and goals. It may still be correct on this point but I'm not takin it as a given.)

@AudriusButkevicius
Copy link
Member Author

Right, I misunderstood how it worked and interpreted one "round" as one pass over the file, when in fact it's a total of one pass over the file and that's it?

Yes, we compute number of bytes in file hashes, each hash covering protocol.BlockSize, and each hash represents byte index in the file.

I just realized that it won't work with big files, as it's 4 bytes per hash, and if we have a 8GB file, that's 32GB worth of hashes.

I'll change the API so that it takes a list of hashes we are interested in, and returns offsets for each hash if it finds it during the roll.

Yes, perhaps enabling this for anything that will potentially have to find xMB from somewhere else would make sense.
That somewhere else could be the blockmap though, but perhaps checking if we can get anything by rolling is better than accessing the block map for every hash?

@AudriusButkevicius
Copy link
Member Author

Anyways, lets settle if this is something we want to go for, and if it is, I'll drive it home.

@calmh
Copy link
Member

calmh commented Aug 22, 2016

I don't mind this, but I would like to know about at least one real world case where it will make a positive difference. As in an actual user saying "this will help a lot in my situation because x" where x is something more concrete than "I think the algorithm would be cool" :)

After that it's "just" technical details and I'm sure we can solve them.

@AudriusButkevicius
Copy link
Member Author

Well I can't give you one sadly. Perhaps ask on the forum, yet I found #1284 #107 that talk about this and "better delta sync", "noob question - incremental sync" which talk about this and the interest of having this .

@calmh
Copy link
Member

calmh commented Aug 22, 2016

The gzipped-sql-dumps one sounds vaguely plausible, if that does result in the sort of data shifts we are talking about here. Could be interesting to check.

@uok
Copy link
Contributor

uok commented Aug 22, 2016

Do PST-files (MS Outlook) fall into this category?

@Ferroin
Copy link

Ferroin commented Aug 22, 2016

PST files are not likely in this category, they work like a database file, expanding to the desired size by allocating space at the end, with rewrites happening in the middle without shifting allocations in the middle of the file. In essence, a PST file (or a SQLite database) is like a small filesystem which grows in size to fit internal allocations, without moving anything already allocated unless it's told to do so.

The biggest likely benefactors are the few odd applications that use fallocate(2) with FALLOC_FL_INSERT_RANGE or FALLOC_FL_COLLAPSE_RANGE (I know such applications exist, but I don't know of any specific examples, and they are essentially non-existent on Windows because it doesn't have a syscall that works that way).

In theory though, anything that shifts data around inside a file could benefit. The most obvious case would be plain text files with extra data inserted in the middle, but it should also benefit some compressed formats such as PSD, XCF, DOCX, ODT, and similar, as well as some archive formats.

@calmh
Copy link
Member

calmh commented Aug 23, 2016

Indeed. But all of those file formats are usually small enough that this doesn't matter. If the compressed gigabytes-sized SQL dump does benefit from this, that would be fair.

@Ferroin
Copy link

Ferroin commented Aug 23, 2016

The thing is, once you start to see files bigger than a few megabytes, this type of thing does improve the network bandwidth utilization measurably. Big files are of course going to benefit the most, but even smaller ones will benefit. Using rsync as an example (because it has the ability to behave like this), take a 256MB file, copy it, and then add one byte at the beginning of the file. Re-syncing this file will require copying the whole 256MB using a traditional copy, which will mean close to 270MB of data over the network using scp, while using a method like this proposal with rsync will require sending maybe a few KB of data over the network o achieve the same thing.

@AudriusButkevicius
Copy link
Member Author

The point is that we are unaware of anything that prepends or inserts to the file while also being of reasonable size.

@mdevey
Copy link

mdevey commented Aug 23, 2016

Use case: pack files. Media heavy games, preprocess 1000s of assets into the memory format used in ram. So instead of 1000 individual file loads and processing, one file is dumped into ram.

Use case: mng (sequence of pngs). Artists typically add a new action (fade in, new loop) to the same file and update a text file to reference the new frames with a new action verb. Alot easier for programmers to deal with than a whole new asset.

Rsync is the king for updating iterated game assets. I can't stress that observation enough. Aspera etc are woeful.

@Ferroin
Copy link

Ferroin commented Aug 23, 2016

There are all kinds of other things too though. Zip archives for example will benefit from this if new files are added or old ones are removed, or even if the archive is rewritten. ISO images would fall into the same category.

You can't think of this like a filesystem developer would. A file that gets replaced by a new version with the same name is seen as the same file with different data by Syncthing, not a new separate file (and that is exactly how it should be in userspace). This means that things that rewrite files using replace by rename method still get a benefit form this because Syncthing still sees it as the same file. There are very few things that prepend to or insert data into the middle of a file in-place, but there are quite a few which rewrite the whole file but still cause existing data to shift around inside it.

Taking the example of a zip archive: Create a zip archive with a bunch of files in it. Now create a second with the same files, except remove one or add one more to it. Now do a hex-dump of the files and do a side-by-side comparison (for example with vimdiff or winmerge). Most of the data will be the same between the files, just with parts of it shifted. The same applies to ISO 9660 filesystem images, and, with slightly less frequency and in more specific conditions, to most other archive formats.

@fti7
Copy link
Contributor

fti7 commented Aug 23, 2016

What about VM Disk Images (e.g. VMware .vmdk) or TrueCrypt Containers? ;-)

@AudriusButkevicius
Copy link
Member Author

These I suspect change in place, or are append only, so are already handled. We are only interested in prepend or insert cases.

@AudriusButkevicius
Copy link
Member Author

AudriusButkevicius commented Aug 23, 2016

[ CORRECTED ]

So the gzipped SQL dumps argument is invalid (there are some similar patterns, but then it looks like it drifts):

bash@smash:~$ diff -y <(cat x.sql | gzip -c | xxd) <(cat y.sql | gzip -c | xxd) |
...
0000410: b4f2 8ff7 d12d 23db f4d4 3585 3ee5 af6b  .....-#...5   0000410: b4f2 8ff7 d12d 23db f4d4 3585 3ee5 af6b  .....-#...5
0000420: 9d9f f4da b5de 2bbe 5b2d 1e21 f1eb a509  ......+.[-.   0000420: 9d9f f4da b5de 2bbe 5b2d 1e21 f1eb a509  ......+.[-.
0000430: 97e3 9e92 9e4c 01f3 925c f3e8 b858 b08b  .....L...\.   0000430: 97e3 9e92 9e4c 01f3 925c f3e8 b858 b08b  .....L...\.
0000440: 56e3 892e e5a2 557f b4cb a19b f88a 8efc  V.....U.... | 0000440: 56e3 892e e5a2 55a7 2e5c ba92 6ffe ec1d  V.....U..\.
0000450: 14d9 5974 fb73 b8f0 9cf2 cced bb7d f9e5  ..Yt.s..... | 0000450: 77e8 cebe a277 3fc5 8016 ddfe 1c7e 3de7  w....w?....
0000460: 2e1c 23e9 7f7a 824d 3f9e 8fe6 593a 8e12  ..#..z.M?.. | 0000460: 4173 a36f 5f7e b95f c791 3c60 824d e79e  As.o_~._..<
0000470: 1d8c c947 1723 76f4 ea09 444e ff40 62b4  ...G.#v...D | 0000470: 8fe6 593a 8e12 1da1 c971 1723 76f4 ea09  ..Y:.....q.
0000480: 3134 4bfc 03c9 d7c6 c0f6 9bcf 7c20 71a8  14K........ | 0000480: 8453 ff40 b6b4 3134 4bfc 0319 d9c6 c0f6  .S.@..14K..
0000490: 2f63 52ab 33f4 4641 fafb 6619 4461 8abf  /cR.3.FA..f | 0000490: 9bcf 7c20 11ab 2fa3 57ab 33f4 4641 fafb  ..| ../.W.3
00004a0: 10ed 5eb1 4e78 12f0 a7b9 cd52 2471 9c44  ..^.Nx..... | 00004a0: 6619 4461 8abf 10ed 5eb1 4e78 12f0 a709  f.Da....^.N
00004b0: 8abe 3d9e df59 1bc3 466a 1d6b 80fd 9414  ..=..Y..Fj. | 00004b0: cf52 2471 9c44 8ad3 3d9e f459 1bc3 466a  .R$q.D..=..
00004c0: d359 2c0e 4b23 1776 c813 f598 cf32 c34c  .Y,.K#.v... | 00004c0: 1d6b 28fe 9414 d359 2c0e cb2d 1776 c813  .k(....Y,..
00004d0: c277 3e37 08ad c790 151f da0e 230b e728  .w>7....... | 00004d0: f598 cf32 c34c c277 3e37 32ad 0796 151f  ...2.L.w>72
00004e0: 6c35 6d77 9e8b 77c5 000a dbb2 2f6e dbc9  l5mw..w.... | 00004e0: da8e 2d0b e728 6c35 6d77 9e8b 77c5 000a  ..-..(l5mw.
00004f0: f654 bb66 78a4 fb5c 7685 cd96 a7a9 a7f3  .T.fx..\v.. | 00004f0: dbb2 2f6e dbc9 f654 bb66 78a4 fb5c 7685  ../n...T.fx
0000500: cde2 da1e fab9 84b8 0323 da21 2474 d975  .........#. | 0000500: cd96 a7f9 a8f3 cd82 dd1e 4eba 84b8 03c3  ..........N
0000510: 1ecb 2850 99c5 a3ef b350 b056 3a16 0916  ..(P.....P. | 0000510: dc21 cc74 d975 1ee0 287a 99c5 a3ef b350  .!.t.u..(z.
0000520: 9b46 2122 90cb 9389 27a4 e1f2 6c34 c69e  .F!"....'.. | 0000520: b056 3a16 0916 9b46 2122 90cb 9389 27a4  .V:....F!".
0000530: cffa 0037 fc0d 459a 3354 fc58 66c3 b09c  ...7..E.3T. | 0000530: e1f2 6c34 c69e cffa 0037 fc0d 459a d356  ..l4.....7.
...

Same for zip files

bash@smash:~$ diff -y <(cat x.sql | zip -q | xxd) <(cat y.sql | zip -q | xxd) 
...
0000420: 52f1 29dd f6d3 f9fa 97ee b7ba b0c2 ead5  R.)........   0000420: 52f1 29dd f6d3 f9fa 97ee b7ba b0c2 ead5  R.)........
0000430: b2cb 7209 85e5 e727 7db4 f28f f7d1 2d23  ..r....'}..   0000430: b2cb 7209 85e5 e727 7db4 f28f f7d1 2d23  ..r....'}..
0000440: dbf4 d435 853e e5af 6b9d 9ff4 dab5 de2b  ...5.>..k..   0000440: dbf4 d435 853e e5af 6b9d 9ff4 dab5 de2b  ...5.>..k..
0000450: be5b 2d1e 21f1 eba5 0997 e39e 929e 4c01  .[-.!......   0000450: be5b 2d1e 21f1 eba5 0997 e39e 929e 4c01  .[-.!......
0000460: f392 5cf3 e8b8 58b0 8b56 e389 2ee5 a255  ..\...X..V.   0000460: f392 5cf3 e8b8 58b0 8b56 e389 2ee5 a255  ..\...X..V.
0000470: 7fb4 cba1 9bf8 8a8e fc14 d959 74fb 73b8  ........... | 0000470: a72e 5cba 926f feec 1d77 e8ce bea2 773f  ..\..o...w.
0000480: f09c f2cc edbb 7df9 e52e 1c23 e97f 7a82  ......}.... | 0000480: c580 16dd fe1c 7e3d e741 73a3 6f5f 7eb9  ......~=.As
0000490: 4d3f 9e8f e659 3a8e 121d 8cc9 4717 2376  M?...Y:.... | 0000490: 5fc7 913c 6082 4de7 9e8f e659 3a8e 121d  _..<`.M....
00004a0: f4ea 0944 4eff 4062 b431 344b fc03 c9d7  ...DN.@b.14 | 00004a0: a1c9 7117 2376 f4ea 0984 53ff 40b6 b431  ..q.#v....S
00004b0: c6c0 f69b cf7c 2071 a82f 6352 ab33 f446  .....| q./c | 00004b0: 344b fc03 19d9 c6c0 f69b cf7c 2011 ab2f  4K.........
00004c0: 41fa fb66 1944 618a bf10 ed5e b14e 7812  A..f.Da.... | 00004c0: a357 ab33 f446 41fa fb66 1944 618a bf10  .W.3.FA..f.
00004d0: f0a7 b9cd 5224 719c 448a be3d 9edf 591b  ....R$q.D.. | 00004d0: ed5e b14e 7812 f0a7 09cf 5224 719c 448a  .^.Nx.....R
00004e0: c346 6a1d 6b80 fd94 14d3 592c 0e4b 2317  .Fj.k.....Y | 00004e0: d33d 9ef4 591b c346 6a1d 6b28 fe94 14d3  .=..Y..Fj.k
...

Same for xz...

bzip is different from the very beginning it seems.

@calmh
Copy link
Member

calmh commented Nov 12, 2016

Hmm OK I'll investigate

@AudriusButkevicius
Copy link
Member Author

Bump. Not to be forgotten.

@calmh
Copy link
Member

calmh commented Nov 19, 2016

This fails the basic sync test on my computer, which passes on master. Given that it apparently passes for Audrius I need to investigate. I haven't had time to do so.

@AudriusButkevicius
Copy link
Member Author

So I'd ask if you had time for this but I guess it's:

@AudriusButkevicius
Copy link
Member Author

@st-jenkins test this please

@calmh
Copy link
Member

calmh commented Dec 14, 2016

Yeah so now that the other urgent things are maybe sorted for the moment, let me take another look.

@calmh
Copy link
Member

calmh commented Dec 14, 2016

@st-jenkins test this please, i thought audrius fixed you

@calmh
Copy link
Member

calmh commented Dec 14, 2016

@AudriusButkevicius you have a couple of conflicts to resolve at your leisure, but i'll try to test the branch at the state it was in the meantime

@calmh
Copy link
Member

calmh commented Dec 14, 2016

OK I've narrowed it down to us losing events somewhere, which is probably not something you've caused here but is triggered by timing or whatever, looking into that.

calmh added a commit that referenced this pull request Dec 14, 2016
@calmh
Copy link
Member

calmh commented Dec 14, 2016

Now it works for me. Lets get master merged into this so it's clean and make some progress.

calmh added a commit that referenced this pull request Dec 14, 2016
calmh added a commit that referenced this pull request Dec 14, 2016
We lose important events due to the frequency of ItemStarted /
ItemFinished events.
@AudriusButkevicius
Copy link
Member Author

@st-jenkins test this please

@AudriusButkevicius
Copy link
Member Author

@st-jenkins test this please

@AudriusButkevicius
Copy link
Member Author

@st-review merge

@st-review
Copy link

👌 Merged as 0582836. Thanks, @AudriusButkevicius!

@st-review st-review closed this Dec 14, 2016
st-review pushed a commit that referenced this pull request Dec 14, 2016
@AudriusButkevicius
Copy link
Member Author

not sure if you were fine to merge this, but given this is "approved" I assume you were.

@calmh
Copy link
Member

calmh commented Dec 15, 2016

I was going to review it another time, but yeah this is fine. However, I still want stats on this. I don't think anyone will mind adding a usage reporting variable on the hit rate for this block finder, at the same time the feature is added. So some time during the next two weeks, before release, I would very much like something that reports the blocks stats from this thing. I can implement it myself as well if you don't have the bandwidth.

calmh added a commit that referenced this pull request Dec 17, 2016
We lose important events due to the frequency of ItemStarted /
ItemFinished events.
calmh added a commit that referenced this pull request Dec 17, 2016
We lose important events due to the frequency of ItemStarted /
ItemFinished events.
calmh added a commit that referenced this pull request Dec 24, 2016
* v0.14.16-hotfix:
  gui, man: Update docs & translations
  lib/model: Allow empty subdirs in scan request (fixes #3829)
  lib/model: Don't send symlinks to old devices that can't handle them (fixes #3802)
  lib/model: Accept scan requests of paths ending in slash (fixes #3804)
  gui: Avoid pause between event polls (ref #3527)
@st-review st-review added the frozen-due-to-age Issues closed and untouched for a long time, together with being locked for discussion label Dec 15, 2017
@syncthing syncthing locked and limited conversation to collaborators Dec 15, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Issues closed and untouched for a long time, together with being locked for discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.