Fuzzy string matching indexes #3526

igrekun · 2021-12-21T13:33:53Z

Is your feature request related to a problem? Please describe.
Running full fledged Elasticsearch cluster to search short strings seems like an overkill.

Describe the solution you'd like
Basic tri-gram or tf-idf / cosine distance for simple fuzzy string matching.

Describe alternatives you've considered
Elasticsearch.

Additional context
I will more than likely work on implementing those indexes, but Jamie Liu told me it's better to get alignment with the team by filing an issue first.

What are your thoughts on a simple but native text index for Nebula?
It can't replace Elasticsearch but should suffice for matching names and similar use cases.

The options under consideration are

GIN / GiST indexes pretty much like Postres implements them
Approximate KNN over TF-IDF embeddings like NMSLIB / FAISS does it.

Sophie-Xie · 2021-12-22T01:11:33Z

@cangfengzhs @critical27 What do you think about this issue？

jamieliu1023 · 2021-12-23T06:07:38Z

@igrekun Thanks for bringing this up for discussion! I do think it is a recommended practice to first file an issue to get alignment between the two sides before you actually start working on it, which can avoid waste in time and energy just in case the solution you provide does not match the design of the core team. Hope that makes sense to you.

Please @critical27 @cangfengzhs provide your insight on this issue.

critical27 · 2021-12-24T04:49:19Z

Thx for contacting us. So you are going to add tri-gram index in nebula instead of es, right?

cangfengzhs · 2021-12-24T05:26:56Z

These are indeed some very useful features. I have learned some knowledge of NLP, but I don't fully grasp it. And I have the following questions:

If I am a user, and I use string index based on cos distance. But how do I do word2vec? Does nebula have a built-in vocabulary or do I upload a vocabulary myself? Or even further, is it possible to provide fasttext directly as a plug-in?
How to calculate IDF in a constantly changing data set?
I have not learned about GIN / GiST. But in my impression, both nmslib and faiss need a lot of memory, right? Is there a way to solve this problem? The CPU may have the same problem.
How to make persistence? For example, when nmslib performs persistence, it serializes all data together into a binary file. So we need to reduce unnecessary write overhead in the form of LOG and checkpoint. This is just a preliminary idea, and a detailed design is needed.

I am looking forward to the realization of native fuzzy string index. Maybe you can make a simple design first, and then we will discuss its possible problems and try to solve them. Moreover, I am also very willing to participate in the development process of this feature.

cangfengzhs · 2021-12-24T05:36:16Z

And, I have a bolder idea. The key point of this feature is the vector ANN search. We may be able to support a vector data type and expose it to users. We are only responsible for the search of vectors. After all, the performance of textCNN/ELMo/Bert will be better than tri-gram.

igrekun · 2021-12-24T08:56:04Z

@cangfengzhs I very much like the bold idea (:
Generic vector type that supports cosine / L2 distance to better handle user supplied vectors should cover it.

Fuzzy search is then done by treating each trigram as a word and searching closest by cosine. Anything more fancy should then be "bring your own vectors" not to clutter the core codebase.

Personally I fancy the idea of generic ANN more than GIN / GiST since it is more general. Given the vector type, graph and basic math one could run ML algorithms without reaching for spark.

If we go with ANNs then I almost have a design to build upon, just let me know where to move further technical discussion if we decide to proceed on this!

igrekun added the type/feature req Type: feature request label Dec 21, 2021

jamieliu1023 mentioned this issue Dec 25, 2021

Weekly Report 2021-12-24 vesoft-inc/nebula-community#80

Closed

Sophie-Xie assigned igrekun and unassigned igrekun Jan 7, 2022

Sophie-Xie added the community Source: who proposed the issue label Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzzy string matching indexes #3526

Fuzzy string matching indexes #3526

igrekun commented Dec 21, 2021 •

edited

Sophie-Xie commented Dec 22, 2021

jamieliu1023 commented Dec 23, 2021

critical27 commented Dec 24, 2021

cangfengzhs commented Dec 24, 2021

cangfengzhs commented Dec 24, 2021

igrekun commented Dec 24, 2021

Fuzzy string matching indexes #3526

Fuzzy string matching indexes #3526

Comments

igrekun commented Dec 21, 2021 • edited

Sophie-Xie commented Dec 22, 2021

jamieliu1023 commented Dec 23, 2021

critical27 commented Dec 24, 2021

cangfengzhs commented Dec 24, 2021

cangfengzhs commented Dec 24, 2021

igrekun commented Dec 24, 2021

igrekun commented Dec 21, 2021 •

edited