Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy string matching indexes #3526

Open
igrekun opened this issue Dec 21, 2021 · 6 comments
Open

Fuzzy string matching indexes #3526

igrekun opened this issue Dec 21, 2021 · 6 comments
Labels
community Source: who proposed the issue type/feature req Type: feature request

Comments

@igrekun
Copy link
Contributor

igrekun commented Dec 21, 2021

Is your feature request related to a problem? Please describe.
Running full fledged Elasticsearch cluster to search short strings seems like an overkill.

Describe the solution you'd like
Basic tri-gram or tf-idf / cosine distance for simple fuzzy string matching.

Describe alternatives you've considered
Elasticsearch.

Additional context
I will more than likely work on implementing those indexes, but Jamie Liu told me it's better to get alignment with the team by filing an issue first.

What are your thoughts on a simple but native text index for Nebula?
It can't replace Elasticsearch but should suffice for matching names and similar use cases.

The options under consideration are

  1. GIN / GiST indexes pretty much like Postres implements them
  2. Approximate KNN over TF-IDF embeddings like NMSLIB / FAISS does it.
@igrekun igrekun added the type/feature req Type: feature request label Dec 21, 2021
@Sophie-Xie
Copy link
Contributor

@cangfengzhs @critical27 What do you think about this issue?

@jamieliu1023
Copy link
Contributor

@igrekun Thanks for bringing this up for discussion! I do think it is a recommended practice to first file an issue to get alignment between the two sides before you actually start working on it, which can avoid waste in time and energy just in case the solution you provide does not match the design of the core team. Hope that makes sense to you.

Please @critical27 @cangfengzhs provide your insight on this issue.

@critical27
Copy link
Contributor

Thx for contacting us. So you are going to add tri-gram index in nebula instead of es, right?

@cangfengzhs
Copy link
Contributor

These are indeed some very useful features. I have learned some knowledge of NLP, but I don't fully grasp it. And I have the following questions:

  1. If I am a user, and I use string index based on cos distance. But how do I do word2vec? Does nebula have a built-in vocabulary or do I upload a vocabulary myself? Or even further, is it possible to provide fasttext directly as a plug-in?
  2. How to calculate IDF in a constantly changing data set?
  3. I have not learned about GIN / GiST. But in my impression, both nmslib and faiss need a lot of memory, right? Is there a way to solve this problem? The CPU may have the same problem.
  4. How to make persistence? For example, when nmslib performs persistence, it serializes all data together into a binary file. So we need to reduce unnecessary write overhead in the form of LOG and checkpoint. This is just a preliminary idea, and a detailed design is needed.

I am looking forward to the realization of native fuzzy string index. Maybe you can make a simple design first, and then we will discuss its possible problems and try to solve them. Moreover, I am also very willing to participate in the development process of this feature.

@cangfengzhs
Copy link
Contributor

And, I have a bolder idea. The key point of this feature is the vector ANN search. We may be able to support a vector data type and expose it to users. We are only responsible for the search of vectors. After all, the performance of textCNN/ELMo/Bert will be better than tri-gram.

@igrekun
Copy link
Contributor Author

igrekun commented Dec 24, 2021

@cangfengzhs I very much like the bold idea (:
Generic vector type that supports cosine / L2 distance to better handle user supplied vectors should cover it.

Fuzzy search is then done by treating each trigram as a word and searching closest by cosine. Anything more fancy should then be "bring your own vectors" not to clutter the core codebase.

Personally I fancy the idea of generic ANN more than GIN / GiST since it is more general. Given the vector type, graph and basic math one could run ML algorithms without reaching for spark.

If we go with ANNs then I almost have a design to build upon, just let me know where to move further technical discussion if we decide to proceed on this!

@Sophie-Xie Sophie-Xie assigned igrekun and unassigned igrekun Jan 7, 2022
@Sophie-Xie Sophie-Xie added the community Source: who proposed the issue label Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Source: who proposed the issue type/feature req Type: feature request
Projects
None yet
Development

No branches or pull requests

5 participants