Cryptographic hashes are often slower than non-cryptographic. And MD5 suffers from known collision problems.
I've been using MurmurHash in my hyperloglog impl with good results. Scala is not my language of choice, but I'd provide a patch if you guys wanted.
Sure. That would be great. We're mostly on hadoop where the cost is due to IO and this is not a bottleneck, but elsewhere it may be.
I'd love a patch for 128 bit murmur (hopefully without pulling in a giant set of dependencies).
Yeah, I'm not taking a Guava as dependency here. If someone wants to extract the Murmur128 implementation, that's fine. That said, I've heard that md5 is only like 2-3x slower, and given that I doubt the actual hashing is the bottleneck, I'd like to make sure we don't increase build complexity for a hypothetical gain.
You could just consume one from any of the ASF projects...
Actually Hadoop has it:
Why don't we use that instead?
It's also not self contained... e.g., extends org.apache.hadoop.util.hash.Hash
There is an old MurmurHash implementation for 32 bits in the scala library, and it looks like they're writing a newer one for 2.10 called MurmurHash3. Not sure how much hacking it would require, but at the very least you would need to change their magic numbers, which are all 4 bytes.
Done with: #69