Benchmark different filters and hash functions to use for previously visited nodes #2

vadixidav · 2019-07-28T00:25:17Z

Currently a HashSet<u32> is being used to check if a node is in the set of seen nodes. The u32 in the set is the index in the zero layer which is unique for all elements. The u32 is very small and we need to be able to perform a hash on it very quickly. Additionally, the data structure for this filter should likely be changed for performance and memory consumption reasons.

Since HNSW performs an ANN search, we can use things like bloom filters which produce some false positives, so long as there are no false negatives (could introduce infinite cycles in the graph).

I would like to investigate and benchmark the following filters and hash algorithms (more will be added to this list as needed):

Filters:

Bloom filter
Cuckoo filter

Hash functions:

SipHash
Anything in this crate
aHash or other AES-based keyed-hashes

All of these should be benchmarked in every combination with each other specifically for the purposes of the HNSW nearest-neighbor search. Whatever gives the best performance with negligable impact on recall will be chosen.

The text was updated successfully, but these errors were encountered:

vadixidav · 2019-08-18T02:46:48Z

Quick note. I have benchmarked adding Bloom filters and it gave poor results. rustc-hash gave me the best results out of everything tried. Looking things up in the hashtable to see if they have been visited is currently the most expensive thing that HNSW does. A better idea might be to just keep a bitmask, although it isn't clear if making such a bitmask would actually speed anything up since its memory needs to be cleared, which could be expensive. It should still be attempted and benchmarked.

vadixidav · 2020-01-08T02:40:19Z

Currently, rustc-hash::FxHasher is being used in conjunction with the HashSet. It was seen to have the best performance. A different issue should be opened if FxHasher has issues in certain circumstances non-malicious circumstances. It can be DoS attacked easily if that is ever an issue since it is easy to produce a collision by hand. If this is an issue, please open an issue on this repository to explain the use-case. This crate can be adapted to support other hashers.

vadixidav closed this as completed Jan 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark different filters and hash functions to use for previously visited nodes #2

Benchmark different filters and hash functions to use for previously visited nodes #2

vadixidav commented Jul 28, 2019

vadixidav commented Aug 18, 2019 •

edited

vadixidav commented Jan 8, 2020

Benchmark different filters and hash functions to use for previously visited nodes #2

Benchmark different filters and hash functions to use for previously visited nodes #2

Comments

vadixidav commented Jul 28, 2019

vadixidav commented Aug 18, 2019 • edited

vadixidav commented Jan 8, 2020

vadixidav commented Aug 18, 2019 •

edited