Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve partitioning hash function #815

Open
StephanEwen opened this issue May 14, 2014 · 4 comments
Open

Improve partitioning hash function #815

StephanEwen opened this issue May 14, 2014 · 4 comments

Comments

@StephanEwen
Copy link
Contributor

Right now, the partitioner (OutputEmitter) used directly the hash code produced by the partitioning elements. Types like Integer have very weak hash functions, so the hash partitioning is very susceptible to skew there.

@vasia
Copy link
Contributor

vasia commented May 16, 2014

I'll start working on this.
The idea so far is to use something similar to MurmurHash. I also found that there exists a Guava implementation for the MurmurHash.
Let me know if you have other ideas/suggestions.

@StephanEwen
Copy link
Contributor Author

The hash implementations are all expecting a byte buffer and hash over the
bytes. Can we use an implementation tailored towards an int (4 bytes) ?

Alternatives could be:

We took the HashTable hash function from the Jenkins suite. We need to make
sure that we use a different here.

@vasia
Copy link
Contributor

vasia commented May 16, 2014

Murmur loops over 4-byte chunks of the input, so I guess we can use it performing just one loop on the int value. Otherwise, I see you have used the "4-byte integer hash, full avalanche" from the website you gave, for the HashTable. This seems to be the best among the ones described there and then there is the one that uses 7 shifts (doesn't really have a name).

I can try both, but I'm not really sure how to test which is better for what we want. Any hints on that?

@vasia
Copy link
Contributor

vasia commented May 19, 2014

Hey,
here's my first take on this.
I did some very basic tests based on the OutputEmitterTest to check the behavior of the hash function.
I'm attaching 3 diagrams, one for integer records, one for strings and one for records with an integer, a string and a double field. Each diagram shows boxplots of the distribution of values to channels. The values are generated as in OutputEmitterTest. I'm using 100000 records and varying the number of channels from 10-100, with a step of 10.
Let me know what you think!
mixed
strings
integers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants