You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently a HashSet<u32> is being used to check if a node is in the set of seen nodes. The u32 in the set is the index in the zero layer which is unique for all elements. The u32 is very small and we need to be able to perform a hash on it very quickly. Additionally, the data structure for this filter should likely be changed for performance and memory consumption reasons.
Since HNSW performs an ANN search, we can use things like bloom filters which produce some false positives, so long as there are no false negatives (could introduce infinite cycles in the graph).
I would like to investigate and benchmark the following filters and hash algorithms (more will be added to this list as needed):
All of these should be benchmarked in every combination with each other specifically for the purposes of the HNSW nearest-neighbor search. Whatever gives the best performance with negligable impact on recall will be chosen.
The text was updated successfully, but these errors were encountered:
Quick note. I have benchmarked adding Bloom filters and it gave poor results. rustc-hash gave me the best results out of everything tried. Looking things up in the hashtable to see if they have been visited is currently the most expensive thing that HNSW does. A better idea might be to just keep a bitmask, although it isn't clear if making such a bitmask would actually speed anything up since its memory needs to be cleared, which could be expensive. It should still be attempted and benchmarked.
Currently, rustc-hash::FxHasher is being used in conjunction with the HashSet. It was seen to have the best performance. A different issue should be opened if FxHasher has issues in certain circumstances non-malicious circumstances. It can be DoS attacked easily if that is ever an issue since it is easy to produce a collision by hand. If this is an issue, please open an issue on this repository to explain the use-case. This crate can be adapted to support other hashers.
Currently a
HashSet<u32>
is being used to check if a node is in the set of seen nodes. Theu32
in the set is the index in the zero layer which is unique for all elements. Theu32
is very small and we need to be able to perform a hash on it very quickly. Additionally, the data structure for this filter should likely be changed for performance and memory consumption reasons.Since HNSW performs an ANN search, we can use things like bloom filters which produce some false positives, so long as there are no false negatives (could introduce infinite cycles in the graph).
I would like to investigate and benchmark the following filters and hash algorithms (more will be added to this list as needed):
Filters:
Hash functions:
All of these should be benchmarked in every combination with each other specifically for the purposes of the HNSW nearest-neighbor search. Whatever gives the best performance with negligable impact on recall will be chosen.
The text was updated successfully, but these errors were encountered: