Types should not use zero or other small integers as hash results #30

Open
bos opened this Issue Sep 22, 2012 · 9 comments

Projects

None yet

3 participants

@bos
Collaborator
bos commented Sep 22, 2012

There are some bad instances of Hashable that use small integers as hash results.

This is a bad idea, as it makes it trivially easy to exploit arithmetic properties to cause collisions.

Witness the definition of combine:

combine h1 h2 = (h1 * 16777619) `xor` h2

It's common for people combining hashes to write definitions like this. In this case, it is obvious that if you can fix the first hash to be zero, the result will be exactly the second hash.

Worse, if someone chains together a series of multiplications in an instance of their own, they can accidentally end up with hashes always being zero.

I propose two ways to address this problem:

  1. Document it. In the Hashable class, tell people to use large random numbers, and why,
  2. Fix the current vulnerable instances by replacing small random numbers with large.
@tibbe
Owner
tibbe commented Sep 24, 2012

We used to have a combine which didn't have this property:

combine h1 h2 = (h1 + h1 `shiftL` 5) `xor` h2

It got changed when I merged the FNV-1 hash patches to match the combine function used in string hashing. I probably should have kept the old one.

I'm going to be on vacation for a few days. If you feel like creating a pull request with the changes you think ought to be made I can try to review it in the next few days.

There's another related item on my TODO list; integers hash to themselves. This behavior exists in Python, Java, and STL, but I'm not sure if it's a good one always. We might instead want to multiply integers by a large prime.

@tibbe
Owner
tibbe commented Dec 12, 2012

Should be fixed with the switch to SipHash etc.

@tibbe tibbe closed this Dec 12, 2012
@jberryman

Can this be re-opened and discussed since the behavior was reverted?

@tibbe tibbe reopened this Mar 3, 2015
@jberryman

Thanks, Johan. In my data structure I need b hash bits (depending on some parameters), to be divided and used in a few different phases of the algorithm. With a "good" hashing algorithm I should be able to take a few bits here and there from anywhere in the hash value to use for these various phases without bias. The numeric instances are clearly problematic for that kind of usage.

More broadly it makes me nervous that hashable makes no guarantees about "goodness" of these hashes, and that quality of hashes seems to have fluctuated wildly in the past. Either consistently using a documented algorithm (which you've said you don't want to do), or even just having a test suite that runs some avalanche/distribution tests (see below) that we promise to pass would be acceptable for me.

https://code.google.com/p/smhasher/
http://floodyberry.com/noncryptohashzoo/
http://research.neustar.biz/2012/02/02/choosing-a-good-hash-function-part-3/

I can open a new ticket for the latter idea, if you want.

Thanks

@tibbe
Owner
tibbe commented Mar 3, 2015

hashable is designed for a particular use case, supporting hashing-based data structures, rather than for generic hashing (e.g. don't use it to compute checksums for your files). As such it makes some engineering trade-offs, the same as are done in many other languages, that work well in practice for this use case. It turns out in practice that having ints hash to themselves provides good performance (due to simplicity and locality of hash functions).

If your data structures have different needs you can employ the common technique of post-mixing, here you mix the hash a bit more before using it. This is commonly used in e.g. base-2 hash tables for pointer values, as pointer values would have the low bits be always zero.

@jberryman

My application is a bloom filter variant. There are probably many other hash-based data structures for which the current hashable makes too many assumptions to be usable (or is already silently introducing bias, or breaking the time complexity of peoples' data structures without them realizing). As you say I guess I can implement a better, more mixy hash function based on hashable and make sure it passes my own tests.

FWIW this combined with #73 (users want to serialize their bloomfilter) and #74 (instances for algebraic prelude types seem unprincipled, and make me nervous), makes it unlikely I'll be able to use this for anything more than to bootstrap a v0.0 for this particular use case.

@jberryman

(on re-reading, I think my last comment might across as disparaging or argumentative which isn't my intention! I understand your trade-offs argument, I just wanted to voice that they don't seem to fit my needs very well even though I think my use-case is one hashable claims to be targeting. Thanks for your work on the library!)

@tibbe
Owner
tibbe commented Mar 5, 2015

@jberryman we're basically at an uneasy stand still. We don't have an alternative that works better in practice. This seems to be the same conclusion that other communities have come to, so here we are.

@jberryman

If anyone's interested I've released a rewrite of hashable called hashabler which concerns itself with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment