Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark different filters and hash functions to use for previously visited nodes #2

Closed
vadixidav opened this issue Jul 28, 2019 · 2 comments

Comments

@vadixidav
Copy link
Member

Currently a HashSet<u32> is being used to check if a node is in the set of seen nodes. The u32 in the set is the index in the zero layer which is unique for all elements. The u32 is very small and we need to be able to perform a hash on it very quickly. Additionally, the data structure for this filter should likely be changed for performance and memory consumption reasons.

Since HNSW performs an ANN search, we can use things like bloom filters which produce some false positives, so long as there are no false negatives (could introduce infinite cycles in the graph).

I would like to investigate and benchmark the following filters and hash algorithms (more will be added to this list as needed):

Filters:

  • Bloom filter
  • Cuckoo filter

Hash functions:

  • SipHash
  • Anything in this crate
  • aHash or other AES-based keyed-hashes

All of these should be benchmarked in every combination with each other specifically for the purposes of the HNSW nearest-neighbor search. Whatever gives the best performance with negligable impact on recall will be chosen.

@vadixidav
Copy link
Member Author

vadixidav commented Aug 18, 2019

Quick note. I have benchmarked adding Bloom filters and it gave poor results. rustc-hash gave me the best results out of everything tried. Looking things up in the hashtable to see if they have been visited is currently the most expensive thing that HNSW does. A better idea might be to just keep a bitmask, although it isn't clear if making such a bitmask would actually speed anything up since its memory needs to be cleared, which could be expensive. It should still be attempted and benchmarked.

@vadixidav
Copy link
Member Author

Currently, rustc-hash::FxHasher is being used in conjunction with the HashSet. It was seen to have the best performance. A different issue should be opened if FxHasher has issues in certain circumstances non-malicious circumstances. It can be DoS attacked easily if that is ever an issue since it is easy to produce a collision by hand. If this is an issue, please open an issue on this repository to explain the use-case. This crate can be adapted to support other hashers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant