New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pluggable hash algos #2314
Pluggable hash algos #2314
Conversation
Do we even want this configurable? If not, things become easier, otherwise we need to handle the config change, rescan, etc. |
If we believe it's good enough (and we should make sure it is, as well as evaluate options), we don't need it configurable imho. |
m.fmut.RLock() | ||
for _, folder := range cm.Folders { | ||
var algo scanner.HashAlgorithm | ||
for _, opt := range folder.Options { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a convenience method (* Options) Get(key string) (string, ok)
Can't you just use: https://golang.org/pkg/hash/#Hash ? |
Well my worry about murmur is collisions. |
Yeah the databas stuff may need some tweaking around the hash lengths. Murmur3 sounds collision resistant from what I can find on the internet, and I'm going to write some empirical tests later on (checking that we don't get collisions on a few hundred gigs of data, with very small block sizes, and so on). As for this actual implementation, I'm more and more convinced that we should not have it configurable at all and just make the right voice for everyone. :) |
I think having it configurable is nice, but the implementation becomes much harder. |
Also: cityhash https://github.com/surge/cityhash |
I suggested murmur3 based on http://blog.reverberate.org/2012/01/state-of-hash-functions-2012.html We should focus our benchmarks for the slow platforms imo. My pi is ready for some action:) |
Hi, it's me again )
Please note the SMHasher test suite "which evaluates collision, dispersion and randomness qualities of hash functions." Some results from xxHash home page:
|
This should be pretty much mergeable, barring a few more tests maybe. The short version of what this does, in it's latest iteration:
The last point means that we don't need to worry about hash collisions between random blocks in general, we only need to worry about a given block ending up with the same hash as the previous version of the exact same block which is astronomically unlikely with at least Murmur3. At some point - but maybe not immediately? - we need to handle how to change the hash algorithm. It needs to clear out the folder and reindex from start, possibly requiring a restart. So the existing hash algorithm per folder should be stashed in the database, and if it doesn't match we drop the folder and rescan. |
So two things:
|
For changing algorithm, I think the easiest way to handle it is the same way we handle changing the folder path. That is, we don't allow it at all after the folder has been created. To change it, remove the folder (which we should handle better, causing it to actually be removed immediately, from the config and database) and then re-add it. This is of course intrusive and annoying, but it should be - because it also requires doing the same on other devices and so on, so just updating it and keeping it shared with the devices it was shared with before doesn't make sense. |
So for #2, I'd just error out the folder when my algo doesn't match some other peers algo. |
Just not allowing a change of algorithm on the GUI will lead to problems, people will change it in the config.xml and restart syncthing like they do for folder paths (at least I did that several times after moving the actual folder manually). |
Yeah |
This is starting to look about done. I think this is a v0.13 change as well. It's backwards compatible with v0.12 as long as SHA256 is used... |
So I'd say fuck it, let's cut the mark, and start working on v0.13, which means screw the backwards compatibility, add HashAlgo as a normal field, etc. I'll go through the changes now. |
flags |= protocol.FolderHashMurmur3 << protocol.FolderHashShiftBits | ||
default: | ||
panic("unknown hash algorithm") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have it like:
flags |= hashAlgo.Flags() or Bits()
flag.Parse() | ||
|
||
var algo scanner.HashAlgorithm | ||
if err := algo.UnmarshalText([]byte(*hashAlgo)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have scanner.HashAlgorithmFromString()
Agreed. v0.12 can live for a while in maintenance anyway. And I'll probably view 0.13 as 1.0beta...
Strongly disagree here though, as I think this is exactly what the flag field is for. I'm even thinking to move the compression field into here in the same manner. Fully agree on the |
So I'd say that hash algorithm is quite an integral part. |
Yes on the options field, but no on the flags field. The flags field already covers the introducer, read only (master) and so on bits which are exactly like compression algorithm. The fact that it is integral means that it must be a part of the protocol and have a certain set of defined values and so on. |
Yes, makes sense, I still thought it was part of the options since the first iteration of this. |
Ah 👍 |
Should add the hash algos in use to the usage report. My feeling here is that this is not a new usage reporting version and doesn't require revoking the usage reporting permission. My reasoning is that the user accepted to report features in use, the preview contained a number of per-folder features, and the hash algorithm is just one more feature in that export. Am I wrong here? |
Agree, and it does not reveal anything private. |
Apart from the commented out benchnark block, and not being mergable, I am hapoy to merge. |
Mismatches are handled pseudo-gracefully. In the new code, it detects the mismatch and stops the folder: The error message is kind of ugly as it's very long and gets cut off. I'm not sure how to improve that. The old code doesn't understand that there is a mismatch, as it doesn't look at the flags. It instead gets hash mismatches for everything:
So, do we still want this at all - does it have tangible benefits somewhere, on mobile? If we do, do we want it in a patch release or in v0.13? We could make it v0.13 just to make the change visible, even if we're not "hard" incompatible with previous versions. |
I think we should get more stuff into 0.13, and get it merged there, and class it as incompatible. Also, we might want to add something to 0.12 to make the incompatibility more obvious (such as checking ClusterConfig, atleast while we are not at 1.0) so that there would be less support. I think this is cool as it's modular, and it's a nice addition. |
Detect nonstandard hash algo and stop folder (ref #2314)
Lets open for v0.13 additions, whatever they may be, yes.
Interesting! Indeed, even with 1.5.2;
Absolutely no way whatsoever to test it, but maybe someone has boxes like that out there. |
Finally had the time and motivation to do a test on my "calculator" (raspberry pi 1) where this is could be useful:
My test was adding a 200MiB file on a fast device (so the CPU of the pi is the limitation) and sending it to the pi and measuring the time for the transfer to be completed (pi was at idle at start of transfer). The SHA256-noBlockMap is a self compiled version that has Secure=false also for SHA256 to compare how that affects the performance. Sending from pi to the fast device is always ~3min30sec (tested before thinking about it... there is no hashing in that direction.) Not a huge increase in performance, but still nice to have. I suggest adding the possibility of deactivating blockmap for SHA256 to get some additional performance for slow devices without losing the benefit of efficient copies on faster devices because this option does not need to be set on all devices. edit: scanning the file itself:
Here the difference is a lot more obvious, but at least for me this is the least important part since it's really rare that I actually add a file to a synced folder on the pi itself. |
@alex2108 ❤️ Thanks! |
I think providing configurable hash algos is opening up a new complication that will require maintenance and support. I predict I'll be annoyed seeing those messages in the support forum, but I can keep that to myself. If the benefit of configuration outweighs that indefinite time sink, then let's do it! If it is merged, I recommend modifying the UI message. How does the user know what action to take when a "Hash algorithm mismatch ..." message is displayed? I'm not opposed to only changing the hashing algorithm. I haven't been following the discussion close enough to see if there if an algo that performs fast enough, including on our "calculator" platforms, and satisfies our risk of collisions. |
I think it's such an advanced feature that only people who know how to fix/use it will be the ones that fiddle with it, keeping the support overhead low. |
Here's how I see it. There are advantages and disadvantages to this; 👍 A lower overhead algorithm improves performance on low powered devices. However, I'm not sure by how much. Probably quite a lot when scanning large files; but when syncing small or medium size files I suspect our database shuffling and stuff dominates. The test by @alex2108 above is probably the best case (maximum performance difference due to syncing a single large file) and there the difference is one between 4:17 and 3:31 (a ~21% performance increase). 👎 Code complexity increases. We need to handle the mismatches, do conversions or rescans, take into account the differences between hash algos in a lots of places, from here on and forever into the future. Once we add this there is probably no going back. 👎 Usage complexity increases. It's one more thing to choose from, and that can go wrong. This doesn't necessary need to be a big deal, but handling it super smoothly in all cases (more than just displaying a cryptic error message and failing to do what the user wanted) will take some engineering effort. All in all, this feature is currently on the minus side in my mind. It's a 20% performance improvement, but a negative in all other aspects. On faster hardware it's a net zero - we used a little fewer CPU cycles doing the same work, but we weren't running short of CPU cycles to start with. It's an interesting experiment but to me it feels like it costs more than it's worth, in maintenance and complexity. Taking the long view, I still think that crypto primitives like SHA256 is going to get faster, both due to hardware implementation in otherwise low power devices and potentially due to optimizations in the standard library etc. I propose to shelf this and refocus on other stuff that makes a bigger difference, like temp indexes and variable block size. |
I guess I am sort of with you here. |
How about just having the option to disable block map? |
Possibly yes. TBD after the new puller. |
This is on ice. The code is out there, for interested parties. |
Remaining issues: