New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Types should not use zero or other small integers as hash results #30

Open
bos opened this Issue Sep 22, 2012 · 9 comments

Comments

Projects
None yet
3 participants
@bos
Collaborator

bos commented Sep 22, 2012

There are some bad instances of Hashable that use small integers as hash results.

This is a bad idea, as it makes it trivially easy to exploit arithmetic properties to cause collisions.

Witness the definition of combine:

combine h1 h2 = (h1 * 16777619) `xor` h2

It's common for people combining hashes to write definitions like this. In this case, it is obvious that if you can fix the first hash to be zero, the result will be exactly the second hash.

Worse, if someone chains together a series of multiplications in an instance of their own, they can accidentally end up with hashes always being zero.

I propose two ways to address this problem:

  1. Document it. In the Hashable class, tell people to use large random numbers, and why,
  2. Fix the current vulnerable instances by replacing small random numbers with large.
@tibbe

This comment has been minimized.

Show comment
Hide comment
@tibbe

tibbe Sep 24, 2012

Owner

We used to have a combine which didn't have this property:

combine h1 h2 = (h1 + h1 `shiftL` 5) `xor` h2

It got changed when I merged the FNV-1 hash patches to match the combine function used in string hashing. I probably should have kept the old one.

I'm going to be on vacation for a few days. If you feel like creating a pull request with the changes you think ought to be made I can try to review it in the next few days.

There's another related item on my TODO list; integers hash to themselves. This behavior exists in Python, Java, and STL, but I'm not sure if it's a good one always. We might instead want to multiply integers by a large prime.

Owner

tibbe commented Sep 24, 2012

We used to have a combine which didn't have this property:

combine h1 h2 = (h1 + h1 `shiftL` 5) `xor` h2

It got changed when I merged the FNV-1 hash patches to match the combine function used in string hashing. I probably should have kept the old one.

I'm going to be on vacation for a few days. If you feel like creating a pull request with the changes you think ought to be made I can try to review it in the next few days.

There's another related item on my TODO list; integers hash to themselves. This behavior exists in Python, Java, and STL, but I'm not sure if it's a good one always. We might instead want to multiply integers by a large prime.

@tibbe

This comment has been minimized.

Show comment
Hide comment
@tibbe

tibbe Dec 12, 2012

Owner

Should be fixed with the switch to SipHash etc.

Owner

tibbe commented Dec 12, 2012

Should be fixed with the switch to SipHash etc.

@jberryman

This comment has been minimized.

Show comment
Hide comment
@jberryman

jberryman Mar 3, 2015

Can this be re-opened and discussed since the behavior was reverted?

jberryman commented Mar 3, 2015

Can this be re-opened and discussed since the behavior was reverted?

@tibbe tibbe reopened this Mar 3, 2015

@jberryman

This comment has been minimized.

Show comment
Hide comment
@jberryman

jberryman Mar 3, 2015

Thanks, Johan. In my data structure I need b hash bits (depending on some parameters), to be divided and used in a few different phases of the algorithm. With a "good" hashing algorithm I should be able to take a few bits here and there from anywhere in the hash value to use for these various phases without bias. The numeric instances are clearly problematic for that kind of usage.

More broadly it makes me nervous that hashable makes no guarantees about "goodness" of these hashes, and that quality of hashes seems to have fluctuated wildly in the past. Either consistently using a documented algorithm (which you've said you don't want to do), or even just having a test suite that runs some avalanche/distribution tests (see below) that we promise to pass would be acceptable for me.

https://code.google.com/p/smhasher/
http://floodyberry.com/noncryptohashzoo/
http://research.neustar.biz/2012/02/02/choosing-a-good-hash-function-part-3/

I can open a new ticket for the latter idea, if you want.

Thanks

jberryman commented Mar 3, 2015

Thanks, Johan. In my data structure I need b hash bits (depending on some parameters), to be divided and used in a few different phases of the algorithm. With a "good" hashing algorithm I should be able to take a few bits here and there from anywhere in the hash value to use for these various phases without bias. The numeric instances are clearly problematic for that kind of usage.

More broadly it makes me nervous that hashable makes no guarantees about "goodness" of these hashes, and that quality of hashes seems to have fluctuated wildly in the past. Either consistently using a documented algorithm (which you've said you don't want to do), or even just having a test suite that runs some avalanche/distribution tests (see below) that we promise to pass would be acceptable for me.

https://code.google.com/p/smhasher/
http://floodyberry.com/noncryptohashzoo/
http://research.neustar.biz/2012/02/02/choosing-a-good-hash-function-part-3/

I can open a new ticket for the latter idea, if you want.

Thanks

@tibbe

This comment has been minimized.

Show comment
Hide comment
@tibbe

tibbe Mar 3, 2015

Owner

hashable is designed for a particular use case, supporting hashing-based data structures, rather than for generic hashing (e.g. don't use it to compute checksums for your files). As such it makes some engineering trade-offs, the same as are done in many other languages, that work well in practice for this use case. It turns out in practice that having ints hash to themselves provides good performance (due to simplicity and locality of hash functions).

If your data structures have different needs you can employ the common technique of post-mixing, here you mix the hash a bit more before using it. This is commonly used in e.g. base-2 hash tables for pointer values, as pointer values would have the low bits be always zero.

Owner

tibbe commented Mar 3, 2015

hashable is designed for a particular use case, supporting hashing-based data structures, rather than for generic hashing (e.g. don't use it to compute checksums for your files). As such it makes some engineering trade-offs, the same as are done in many other languages, that work well in practice for this use case. It turns out in practice that having ints hash to themselves provides good performance (due to simplicity and locality of hash functions).

If your data structures have different needs you can employ the common technique of post-mixing, here you mix the hash a bit more before using it. This is commonly used in e.g. base-2 hash tables for pointer values, as pointer values would have the low bits be always zero.

@jberryman

This comment has been minimized.

Show comment
Hide comment
@jberryman

jberryman Mar 3, 2015

My application is a bloom filter variant. There are probably many other hash-based data structures for which the current hashable makes too many assumptions to be usable (or is already silently introducing bias, or breaking the time complexity of peoples' data structures without them realizing). As you say I guess I can implement a better, more mixy hash function based on hashable and make sure it passes my own tests.

FWIW this combined with #73 (users want to serialize their bloomfilter) and #74 (instances for algebraic prelude types seem unprincipled, and make me nervous), makes it unlikely I'll be able to use this for anything more than to bootstrap a v0.0 for this particular use case.

jberryman commented Mar 3, 2015

My application is a bloom filter variant. There are probably many other hash-based data structures for which the current hashable makes too many assumptions to be usable (or is already silently introducing bias, or breaking the time complexity of peoples' data structures without them realizing). As you say I guess I can implement a better, more mixy hash function based on hashable and make sure it passes my own tests.

FWIW this combined with #73 (users want to serialize their bloomfilter) and #74 (instances for algebraic prelude types seem unprincipled, and make me nervous), makes it unlikely I'll be able to use this for anything more than to bootstrap a v0.0 for this particular use case.

@jberryman

This comment has been minimized.

Show comment
Hide comment
@jberryman

jberryman Mar 3, 2015

(on re-reading, I think my last comment might across as disparaging or argumentative which isn't my intention! I understand your trade-offs argument, I just wanted to voice that they don't seem to fit my needs very well even though I think my use-case is one hashable claims to be targeting. Thanks for your work on the library!)

jberryman commented Mar 3, 2015

(on re-reading, I think my last comment might across as disparaging or argumentative which isn't my intention! I understand your trade-offs argument, I just wanted to voice that they don't seem to fit my needs very well even though I think my use-case is one hashable claims to be targeting. Thanks for your work on the library!)

@tibbe

This comment has been minimized.

Show comment
Hide comment
@tibbe

tibbe Mar 5, 2015

Owner

@jberryman we're basically at an uneasy stand still. We don't have an alternative that works better in practice. This seems to be the same conclusion that other communities have come to, so here we are.

Owner

tibbe commented Mar 5, 2015

@jberryman we're basically at an uneasy stand still. We don't have an alternative that works better in practice. This seems to be the same conclusion that other communities have come to, so here we are.

@jberryman

This comment has been minimized.

Show comment
Hide comment
@jberryman

jberryman Aug 17, 2015

If anyone's interested I've released a rewrite of hashable called hashabler which concerns itself with this issue.

jberryman commented Aug 17, 2015

If anyone's interested I've released a rewrite of hashable called hashabler which concerns itself with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment