Skip to content

Use better char data for benchmarking #8

Open
jmaessen opened this Issue Mar 1, 2011 · 2 comments

2 participants

@jmaessen
jmaessen commented Mar 1, 2011

Lowercase alphabetic strings happen to not induce any hash collisions with the current choice of hash function. Sprinkling in digits so that characters are chosen from ['0'..'9']++['a'..'z'] sends the collision rate through the roof, to something like 80%. Really, any choice of >33 characters seems to work reasonably well; you can for example try mixed upper and lower case. Using more interesting string data would permit realistic benchmarking (and would encourage the use of a less problematic hash function, I bet).

@tibbe
Owner
tibbe commented Mar 1, 2011

We should switch to MurmurHash sooner rather than later. The reason I haven't done so yet is that

  • I was waiting for MurmurHash3 to mature, and
  • the benchmark I ran showed that using MurmurhHash2 was slower than DJB.

Most likely the benchmarks were flawed. I ran two kind of benchmarks: one benchmark measured hashing throughput in MB/s and the other was the unordered-containers benchmark.

Perhaps an alternate strategy is in order:

  1. Alter the unordered-containers benchmarks so that they show the true cost of having a poor hash function. Perhaps we should run one benchmark with ASCII strings and one with UTF-8 encoded Unicode strings containing e.g. Japanese.
  2. Implement MurmurHash2. We can directly call the C++ implementation (just like we do now for DJB), so this shouldn't be too much work. We'll need some CPP to pick the right version depending on platform.
@jmaessen
jmaessen commented Mar 2, 2011

It might be worth changing the test data as described, and then repeating the benchmark. As I noted, you happened to have chosen a small enough range of characters that collisions don't really happen. Crank the range up a tiny bit, and I expect MurmurHash will suddenly look much better. :-)

Note that Mumurhash has a final mix step that can be skipped in the particular case of HashMap. This step just moves data around the hash so that (eg) using the low-order or high-order bits will yield low collision rates. The final mix can only increase collisions if you're using all the bits of the hash. HashMap is weird in this respect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.