Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different hash functions in HashingVectorizer #10

Closed
rth opened this issue Dec 14, 2018 · 2 comments
Closed

Support different hash functions in HashingVectorizer #10

rth opened this issue Dec 14, 2018 · 2 comments

Comments

@rth
Copy link
Owner

rth commented Dec 14, 2018

Currently, we use the MurmurHash3 hash function from the rust-fasthash (to be more similar to scikit-learn implementation). That crate also supports a number of other hash functions,

City Hash
Farm Hash
Metro Hash
Mum Hash
Sea Hash
Spooky Hash
T1 Hash
xx Hash

I'm not convinced hashing is currently the performance bottleneck, but in any case using a faster hash function such as xxhash would not hurt.

This would involve updating the text-vectorize crate and adding hasher parameter to the HashingVectorizer python estimator.

Another use case could to use different hash functions to reduce the effect of collisions Svenstrup et. al. 2017, discussed e.g. in https://stackoverflow.com/q/53767469/1791279

@rth
Copy link
Owner Author

rth commented Dec 18, 2018

Just to confirm, that choice of the hash function has mostly no impact on performance as it is not the bottleneck.

@rth
Copy link
Owner Author

rth commented Mar 2, 2019

Just to confirm, that choice of the hash function has mostly no impact on performance as it is not the bottleneck.

Closing.

@rth rth closed this as completed Mar 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant