Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nearest neighbors and approximate clustering #15

Closed
vadixidav opened this issue Dec 26, 2019 · 6 comments
Closed

Nearest neighbors and approximate clustering #15

vadixidav opened this issue Dec 26, 2019 · 6 comments

Comments

@vadixidav
Copy link

Feel free to use the HNSW crate I created which ports HNSW to Rust: https://github.com/rust-photogrammetry/hnsw. If you are interested I can add it as an ANN algorithm. I am not as familiar with clustering, but you should be able to build a very fast approximate clustering algorithm utiliIng HNSW.

Let me know if this would be a good addition and what to add to linfa (and where). HNSW should perform quite well from a benchmark perspective, but I haven't ran it neck-to-neck with existing implementations using the exact same data and bench harness yet, so I am interested to try that if python bindings are added.

@LukeMathWalker
Copy link
Contributor

I would definitely be interested in adding it, first of all, as a variant of the nearest neighbors algorithm (similarly to LSHForest in scikit-learn).

A couple of questions though concerning SIMD: how much of it leaks to the public interface? In other words, do I - as a user of the algorithm - have to be aware of SIMD-specific datatype when using the algorithm? Can that implementation complexity be hidden away from me?

@vadixidav
Copy link
Author

@LukeMathWalker The reason that SIMD types are used in the public API as vectors for nearest neighbor is mostly due to alignment issues. By using the SIMD types I can ensure that Rust will always put them in vectors, structs, and the stack at the correct alignment. This could be hidden from the user, but I would need to specially take care of alignment issues manually in the code. I could create a wrapper type that just took in an array that creates a SIMD wrapper internally as necessary.

@LukeMathWalker
Copy link
Contributor

That would be ideal @vadixidav - I'd like to avoid each model forcing the user to learn a different input type. It would significantly complicate the usage of the overall crate.

@vadixidav
Copy link
Author

Alright, I am a bit focused on some other open source stuff currently, but I will come back to this and simplify that API. Technically it could be used now easily given a speed penalty for not using SIMD, but I can make wrappers so we can get speed without exposing the SIMD types.

@vadixidav
Copy link
Author

vadixidav commented Jan 8, 2020

Alright, I have done my best to make it as easy as possible to interact with arrays from HNSW. It now has extensive impls of From and Into, including From to convert from slices of u8 and f32. While you do have to worry about how many bytes or floats you have, you now no longer need to interact with SIMD types beyond adding the packed_simd crate so you can write HNSW<Euclidean<[u8x64; 2]>> and typing out array.into() to convert your array/slice types of f32 or u8 into Euclidean or Hamming types respectively.

https://docs.rs/hnsw/0.4.0/hnsw/struct.Hamming.html

If desired, I can even make type aliases to avoid needing the packed_simd crate for specifying which HNSW you want. That would not be a breaking change.

Let me know if you need anything else to be done.

@YuhanLiin
Copy link
Collaborator

Pretty sure this issue can be closed, since we already have #119 for nearest-neighbours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants