Evaluate and merge yves' stats improvements #31

rurban · 2017-03-25T17:00:07Z

See https://github.com/demerphq/smhasher/

That version:

Adds BeagleHash tests, but doesn't have it included.
Use better stats internally (g-test)
Better test how the hash functions are seeded
Support longer than 32bit seeds
Separate timing of hashing from that of initialization
Adds graphs for comparisons
Adds lot of documentation per test

paulie-g · 2019-06-30T09:29:34Z

The really valuable things in Yves' fork are:

Configurable seed, state and hash sizes up to 128 bits (and beyond, presumably)
(Optional) state-generation stage for hash funcs with expensive seed->state
Improved tests with g-test

I can't emphasize 3 enough. There are tons of hash functions which pass your fork and which you recommend, but which fail spectacularly with the stricter tests in Yves'. The avalanche test in particular is a massive improvement. This is not magic or the test being overzealous - I can usually find and fix the problem in a very predictable way (most commonly a weak final mixing stage). Going the other direction, I can easily take a hash function which just passes Yves', remove a bunch of things (again, usually in the final mixing stage where operations are by necessity linearized) and have it pass on yours for a significant speed increase.

I don't recall every single failure, but I recently added xxh3 to my private fork of Yves' and it fails avalanche on 0-3 byte keys. Other failures have been even more spectacular.

Yves hasn't been actively working on his fork in some time. You're actively maintaining yours. It would be ideal if we could have the improvements from Yves' fork in an actively maintained fork. It would also be hugely beneficial if your fork, which is viewed as the canonical one right now, applied the most rigourous standards possible to ensure people used quality hash functions, which is not currently the case.

rurban · 2019-06-30T14:11:07Z

I'd love to accept a PR for this.

paulie-g · 2019-06-30T14:31:29Z

Massive amount of work, that, and probably best done by someone familiar with the internals. Much as I'd like to say otherwise, my comment was a response to the 'evaluate' portion of the issue, rather than the 'merge' part ;)

rurban · 2019-11-11T09:49:20Z

I started working in that now.

Ad state-generation stage for hash funcs with expensive seed->state:
I don't see a real benefit in the extra function for this. It's only needed for siphash and clhash with much larger state than seed, and the initialization in SpookyTest.cpp is even much worse than in mine. This optional init can easily be checked manually, as now with clhash or siphash.
The seed can be 32bit for all, and the ones needing a 64bit seed can stay truncated as is. (as e.g wyhash). The longer seed is only needed for enhanced security, which we don't measure here. MomentChi2 also runs only for 32bit seeds.
We already do the (Optional) state-generation stage for hash funcs with expensive seed->state, just with if hash == ... then hash_init().

I've missed a few bits:

more tests:
The fork has much more tests also: Effs (i.e. Zeroes for 0xFF), Words, and known collisions for Crc, City, Murmur2, Murmur3, but is missing the important HashMap test.
--confidence=float_percentage
With sigmasToProb(confidence)
--sigma=float
With sigma asis
--stream, --stream-key-len=, --rng-seed=

paulie-g · 2019-11-11T11:25:14Z

It makes sense conceptually to have the init state generation and the actual run be separate. When people contribute their hashes, they follow the structure that SMHasher requires of them. If SMHasher doesn't appear to require or offer a separate init stage, they'll stick it in the body and be slower. I've seen this with out-of-tree hashes. I appreciate switching to this has got large edit distance.

Not that my opinion matters, as the one not doing the work, but I do not at all agree about limiting bit lengths. It breaks at least one promising out-of-tree hash I've seen. Folding n>32 bits into 32 is also essentially a free mixing stage, which would cover up seed generation issues. I don't recall the exact examples of this, but I can almost guarantee that doing this would make things pass on your fork that would fail on Yves'. It was my first experience of his fork in fact - I ran some stuff I was working on using your fork on his, it failed, and I sat there whimpering 'b..b..but it passes on rurban..'

Fundamentally, your fork is the canonical one and people code to it. Specifically, its speed results are quoted when people come up with 'the fastest hash in the world' (tm). There are lots of issues with that that I won't go into in this ticket, but the one that specifically relates to this discussion is that any function that has large internal state and wants to initialize it with non-trivially small initial state has to move that into the measured body because of the size limitation.

Where Yves' fork makes changes, it is strictly a superset of yours. If we're going to unify them, those changes need to be back-ported whole imho. Unless and until they are, for the things I do, it'd still make way more sense for me personally to port new hashes from your fork to his and run them there.

I realize this reads as heckling from the peanut gallery, and I do most sincerely apologize for that ;)

gzm55 · 2021-03-20T17:04:46Z

@rurban in the issue tinypeng/pengyhash#1 the Avalanche and Seed tests from demerphq/smhasher fail for pengyhash, while the hash passed all tests here.

rurban mentioned this issue Jul 2, 2017

Are you actively accepting patches? #32

Closed

rurban added the enhancement label Jul 10, 2017

dumblob mentioned this issue Jun 30, 2019

Use Yves tests as they're stricter/better than smhasher wangyi-fudan/wyhash#24

Closed

dumblob mentioned this issue Jul 6, 2019

Add smhasher and Yves's smhasher results to asses hash quality ktprime/ktprime#3

Closed

rurban mentioned this issue May 24, 2020

Hash functions with expensive seeding are not handled optimally #128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate and merge yves' stats improvements #31

Evaluate and merge yves' stats improvements #31

rurban commented Mar 25, 2017 •

edited

Loading

paulie-g commented Jun 30, 2019 •

edited

Loading

rurban commented Jun 30, 2019

paulie-g commented Jun 30, 2019

rurban commented Nov 11, 2019 •

edited

Loading

paulie-g commented Nov 11, 2019 •

edited

Loading

gzm55 commented Mar 20, 2021

Evaluate and merge yves' stats improvements #31

Evaluate and merge yves' stats improvements #31

Comments

rurban commented Mar 25, 2017 • edited Loading

paulie-g commented Jun 30, 2019 • edited Loading

rurban commented Jun 30, 2019

paulie-g commented Jun 30, 2019

rurban commented Nov 11, 2019 • edited Loading

paulie-g commented Nov 11, 2019 • edited Loading

gzm55 commented Mar 20, 2021

rurban commented Mar 25, 2017 •

edited

Loading

paulie-g commented Jun 30, 2019 •

edited

Loading

rurban commented Nov 11, 2019 •

edited

Loading

paulie-g commented Nov 11, 2019 •

edited

Loading