New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve partitioning hash function #815
Comments
I'll start working on this. |
The hash implementations are all expecting a byte buffer and hash over the Alternatives could be: We took the HashTable hash function from the Jenkins suite. We need to make |
Murmur loops over 4-byte chunks of the input, so I guess we can use it performing just one loop on the int value. Otherwise, I see you have used the "4-byte integer hash, full avalanche" from the website you gave, for the HashTable. This seems to be the best among the ones described there and then there is the one that uses 7 shifts (doesn't really have a name). I can try both, but I'm not really sure how to test which is better for what we want. Any hints on that? |
Hey, |
Right now, the partitioner (
OutputEmitter
) used directly the hash code produced by the partitioning elements. Types likeInteger
have very weak hash functions, so the hash partitioning is very susceptible to skew there.The text was updated successfully, but these errors were encountered: