Skip to content

Vectors & Cosine Similarity #207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mickdelaney opened this issue Jan 30, 2021 · 11 comments
Closed

Vectors & Cosine Similarity #207

mickdelaney opened this issue Jan 30, 2021 · 11 comments

Comments

@mickdelaney
Copy link

Hi,

This is a feature request/roadmap question.

Maybe this is the wrong place ?

I was wondering if any thought has gone into supprting a numeric vector with cosine similarity indexing ?
Modern NLP leverages these vectors as inputs and outputs, e.g. Word2Vec, and a common deployment story is to encode the text query as a vector and encoded the documents index time, and then leverage cosine similarity between the doc & query vectors.

we've done this in elasticsearch, and more recently in the vector database milvus. but having a combine search & vector index allows you to combine NLP/Machine Learning & Information Retrieval techniques.

Anyways, just an idea.

Regards

@jasonbosco
Copy link
Member

Hi @mickdelaney, we've been talking about doing something similar as part of #130. Could you expand on how you're doing this with ES currently? Happy to jump on a quick call to discuss this if that's quicker.

@mickdelaney
Copy link
Author

well actually we've moved out of ES due to issues with maintaining our plugin.
ES themselves now support it natively, albeit with perf issues.
https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch

we've also used postgres in the past, for many years to achieve the same goal of query + vector math.
specifically using https://madlib.apache.org, storing 100 dim vectors & just using its SQL.
right now we're using https://milvus.io, but this whilst giving us a very powerful vector story means its a 1 phase proess, grabbing keys from the vector search in milvus, then passing them to the search engine.

basically you have some model that produces vectors (it doesnt really matter how, it could be proprietary, or vanilla word2vec etc), and you need to index those vectors, and also to use the same model to create a search vector from the query. then you end up with vector to vector operations, so cosine similarity or example.
https://aws.amazon.com/about-aws/whats-new/2020/07/cosine-similarity-support-in-amazon-elasticsearch-service/

does that explain it ?

@kishorenc
Copy link
Member

@mickdelaney Thank you for the details. I've already done some work on this front and your suggestion is something we're interested to pursue. I've one further question:

allows you to combine NLP/Machine Learning & Information Retrieval techniques

It is straightforward to fetch the nearest K results for a given search query vector, but that would just be just a search in the vector space. Are you also looking to mix these vector search results with keyword based search results? One approach I've seen being used is to use the vector dimensions as weights (like weighted TF-IDF) so you get both keyword matching and semantic relevancy.

@mickdelaney
Copy link
Author

Yeah that sounds interesting. I think if you can support that & top N + query filters your good. The issue is you often want to reduce the vector space as much as possible, eg filter by region, then by X or Y property, then use the vector space.

@mape
Copy link

mape commented Apr 20, 2021

As I was asked to add my use case here:

It would be useful to be able to filter/sort results based on euclidean distance between vectors.
For example saving colors in LAB color space and using that to find products of similar color.

@janaka
Copy link

janaka commented Apr 9, 2022

@kishorenc any updates on this? I was trying to find my thread on this topic. I think it was on Slack and beyond the retention period. iirc my thought was it would be a great start if TypeSense exposed thr index and vector data via APIs. I feel like there's opportunitiy for a extension system that enables different engines to be plugged in for semanric search type use cases.

@janaka
Copy link

janaka commented Apr 9, 2022

Found the Slack thread which explains my use case.

https://typesense-community.slack.com/archives/C01P749MET0/p1632955847305200?thread_ts=1632955847.305200&cid=C01P749MET0

This other thread is also highly related.

FYI I'm not an expert by any means on this topic. :)

@kishorenc
Copy link
Member

@janaka This is still something that we wish to tackle some time, but it's not on our immediate roadmap.

@jasonbosco
Copy link
Member

jasonbosco commented Oct 19, 2022

We've now added support for vector search in 0.24.0.rcn34.

Here are instructions on how to use the feature: https://gist.github.com/kishorenc/f008c3a60ee58cb084b0c33c0dbce148

@jasonbosco
Copy link
Member

See practical usage example here: #130 (comment)

@jasonbosco
Copy link
Member

v0.24 is now available with this feature: https://typesense.org/docs/0.24.0/api/vector-search.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants