Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate and merge yves' stats improvements #31

Open
rurban opened this issue Mar 25, 2017 · 6 comments
Open

Evaluate and merge yves' stats improvements #31

rurban opened this issue Mar 25, 2017 · 6 comments

Comments

@rurban
Copy link
Owner

rurban commented Mar 25, 2017

See https://github.com/demerphq/smhasher/

That version:

  • Adds BeagleHash tests, but doesn't have it included.
  • Use better stats internally (g-test)
  • Better test how the hash functions are seeded
  • Support longer than 32bit seeds
  • Separate timing of hashing from that of initialization
  • Adds graphs for comparisons
  • Adds lot of documentation per test
@paulie-g
Copy link
Contributor

paulie-g commented Jun 30, 2019

The really valuable things in Yves' fork are:

  1. Configurable seed, state and hash sizes up to 128 bits (and beyond, presumably)
  2. (Optional) state-generation stage for hash funcs with expensive seed->state
  3. Improved tests with g-test

I can't emphasize 3 enough. There are tons of hash functions which pass your fork and which you recommend, but which fail spectacularly with the stricter tests in Yves'. The avalanche test in particular is a massive improvement. This is not magic or the test being overzealous - I can usually find and fix the problem in a very predictable way (most commonly a weak final mixing stage). Going the other direction, I can easily take a hash function which just passes Yves', remove a bunch of things (again, usually in the final mixing stage where operations are by necessity linearized) and have it pass on yours for a significant speed increase.

I don't recall every single failure, but I recently added xxh3 to my private fork of Yves' and it fails avalanche on 0-3 byte keys. Other failures have been even more spectacular.

Yves hasn't been actively working on his fork in some time. You're actively maintaining yours. It would be ideal if we could have the improvements from Yves' fork in an actively maintained fork. It would also be hugely beneficial if your fork, which is viewed as the canonical one right now, applied the most rigourous standards possible to ensure people used quality hash functions, which is not currently the case.

@rurban
Copy link
Owner Author

rurban commented Jun 30, 2019

I'd love to accept a PR for this.

@paulie-g
Copy link
Contributor

Massive amount of work, that, and probably best done by someone familiar with the internals. Much as I'd like to say otherwise, my comment was a response to the 'evaluate' portion of the issue, rather than the 'merge' part ;)

@rurban
Copy link
Owner Author

rurban commented Nov 11, 2019

I started working in that now.

  • Ad state-generation stage for hash funcs with expensive seed->state:
    I don't see a real benefit in the extra function for this. It's only needed for siphash and clhash with much larger state than seed, and the initialization in SpookyTest.cpp is even much worse than in mine. This optional init can easily be checked manually, as now with clhash or siphash.
    The seed can be 32bit for all, and the ones needing a 64bit seed can stay truncated as is. (as e.g wyhash). The longer seed is only needed for enhanced security, which we don't measure here. MomentChi2 also runs only for 32bit seeds.
    We already do the (Optional) state-generation stage for hash funcs with expensive seed->state, just with if hash == ... then hash_init().

I've missed a few bits:

  • more tests:
    The fork has much more tests also: Effs (i.e. Zeroes for 0xFF), Words, and known collisions for Crc, City, Murmur2, Murmur3, but is missing the important HashMap test.

  • --confidence=float_percentage
    With sigmasToProb(confidence)

  • --sigma=float
    With sigma asis

  • --stream, --stream-key-len=, --rng-seed=

@paulie-g
Copy link
Contributor

paulie-g commented Nov 11, 2019

It makes sense conceptually to have the init state generation and the actual run be separate. When people contribute their hashes, they follow the structure that SMHasher requires of them. If SMHasher doesn't appear to require or offer a separate init stage, they'll stick it in the body and be slower. I've seen this with out-of-tree hashes. I appreciate switching to this has got large edit distance.

Not that my opinion matters, as the one not doing the work, but I do not at all agree about limiting bit lengths. It breaks at least one promising out-of-tree hash I've seen. Folding n>32 bits into 32 is also essentially a free mixing stage, which would cover up seed generation issues. I don't recall the exact examples of this, but I can almost guarantee that doing this would make things pass on your fork that would fail on Yves'. It was my first experience of his fork in fact - I ran some stuff I was working on using your fork on his, it failed, and I sat there whimpering 'b..b..but it passes on rurban..'

Fundamentally, your fork is the canonical one and people code to it. Specifically, its speed results are quoted when people come up with 'the fastest hash in the world' (tm). There are lots of issues with that that I won't go into in this ticket, but the one that specifically relates to this discussion is that any function that has large internal state and wants to initialize it with non-trivially small initial state has to move that into the measured body because of the size limitation.

Where Yves' fork makes changes, it is strictly a superset of yours. If we're going to unify them, those changes need to be back-ported whole imho. Unless and until they are, for the things I do, it'd still make way more sense for me personally to port new hashes from your fork to his and run them there.

I realize this reads as heckling from the peanut gallery, and I do most sincerely apologize for that ;)

@gzm55
Copy link

gzm55 commented Mar 20, 2021

@rurban in the issue tinypeng/pengyhash#1 the Avalanche and Seed tests from demerphq/smhasher fail for pengyhash, while the hash passed all tests here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants