Improve partitioning hash function #815

StephanEwen · 2014-05-14T19:40:36Z

Right now, the partitioner (OutputEmitter) used directly the hash code produced by the partitioning elements. Types like Integer have very weak hash functions, so the hash partitioning is very susceptible to skew there.

The text was updated successfully, but these errors were encountered:

vasia · 2014-05-16T14:22:47Z

I'll start working on this.
The idea so far is to use something similar to MurmurHash. I also found that there exists a Guava implementation for the MurmurHash.
Let me know if you have other ideas/suggestions.

StephanEwen · 2014-05-16T14:30:22Z

The hash implementations are all expecting a byte buffer and hash over the
bytes. Can we use an implementation tailored towards an int (4 bytes) ?

Alternatives could be:

We took the HashTable hash function from the Jenkins suite. We need to make
sure that we use a different here.

vasia · 2014-05-16T16:00:46Z

Murmur loops over 4-byte chunks of the input, so I guess we can use it performing just one loop on the int value. Otherwise, I see you have used the "4-byte integer hash, full avalanche" from the website you gave, for the HashTable. This seems to be the best among the ones described there and then there is the one that uses 7 shifts (doesn't really have a name).

I can try both, but I'm not really sure how to test which is better for what we want. Any hints on that?

vasia · 2014-05-19T12:38:28Z

Hey,
here's my first take on this.
I did some very basic tests based on the OutputEmitterTest to check the behavior of the hash function.
I'm attaching 3 diagrams, one for integer records, one for strings and one for records with an integer, a string and a double field. Each diagram shows boxplots of the distribution of values to channels. The values are generated as in OutputEmitterTest. I'm using 100000 records and varying the number of channels from 10-100, with a step of 10.
Let me know what you think!

StephanEwen added the runtime label May 14, 2014

StephanEwen modified the milestones: Release 0.5, Release 0.5.1 May 14, 2014

vasia mentioned this issue May 26, 2014

Improved partitioning hash #869

Closed

rmetzger modified the milestones: Release 0.6 (unplanned), Release 0.5.1 Jun 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve partitioning hash function #815

Improve partitioning hash function #815

StephanEwen commented May 14, 2014

vasia commented May 16, 2014

StephanEwen commented May 16, 2014

vasia commented May 16, 2014

vasia commented May 19, 2014

Improve partitioning hash function #815

Improve partitioning hash function #815

Comments

StephanEwen commented May 14, 2014

vasia commented May 16, 2014

StephanEwen commented May 16, 2014

vasia commented May 16, 2014

vasia commented May 19, 2014