Use a MurmurHash instead of MD5 for HyperLogLog #8

juanplopes opened this Issue Aug 9, 2012 · 9 comments


None yet

7 participants


Cryptographic hashes are often slower than non-cryptographic. And MD5 suffers from known collision problems.

I've been using MurmurHash in my hyperloglog impl with good results. Scala is not my language of choice, but I'd provide a patch if you guys wanted.


Sure. That would be great. We're mostly on hadoop where the cost is due to IO and this is not a bottleneck, but elsewhere it may be.

I'd love a patch for 128 bit murmur (hopefully without pulling in a giant set of dependencies).


Yeah, I'm not taking a Guava as dependency here. If someone wants to extract the Murmur128 implementation, that's fine. That said, I've heard that md5 is only like 2-3x slower, and given that I doubt the actual hashing is the bottleneck, I'd like to make sure we don't increase build complexity for a hypothetical gain.

Twitter, Inc. member
Twitter, Inc. member

It's also not self contained... e.g., extends org.apache.hadoop.util.hash.Hash


There is an old MurmurHash implementation for 32 bits in the scala library, and it looks like they're writing a newer one for 2.10 called MurmurHash3. Not sure how much hacking it would require, but at the very least you would need to change their magic numbers, which are all 4 bytes.


Done with: #69

@johnynek johnynek closed this Dec 13, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment