Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy matching? #12

Open
tmpfs opened this issue Aug 11, 2022 · 16 comments
Open

Fuzzy matching? #12

tmpfs opened this issue Aug 11, 2022 · 16 comments

Comments

@tmpfs
Copy link
Collaborator

tmpfs commented Aug 11, 2022

I have a use case where I expect to have a small number of documents (several hundred to several thousand) and I would want the search to do fuzzy matching on sub-strings within terms so the search would be as greedy as possible.

This would affect performance so should be optional for this sort of "autocomplete" use case. Is this something that interests you?

Happy to do the legwork under your guidance to land this 👍

@marcus-pousette
Copy link
Collaborator

Sounds great! I have also had the idea of adding this at some point. But never had the time to dig into the details of common/efficient ways of implementing it.

Yes. I would assume there is some associated cost, depending on how you implement. It sounds like your solution you have in mind does not affect the index, but just does some "path finding" in the index during query, where you try different paths and deduce score depending on how far we are from the search query we end up (in a true greedy fashion)?

Another note, I have never implemented fuzzy matching before but I recall from experience when using it with other search engines that it might be interesting to learn about N-gram tokenisers in order to create good search behaviours.

Would be super nice if you want to try to solve this problem. I would will be very happy to give guidance to the best of my knowledge and time.

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 12, 2022

Great! Glad this interests you; I don't have much experience in this domain so I will do some research and see what I turn up.

I am not in a rush to land this but I think between us we can discuss and work out the best approach 👍 In the meantime I will try to land a few small PRs to get familiar with the codebase.

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 18, 2022

I just noticed that the querying is case-sensitive so I will bump this up my TODO list as it feels very counter-intuitive for my use case.

BTW, the library integrated nicely with my existing project compiling to webassembly, thanks 🙏

@marcus-pousette
Copy link
Collaborator

You can make the index case-insensitive by using a "to lower case" Filter that you would pass both when adding documents, but also when querying.

Great to hear that! 👍

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 19, 2022

Thanks @marcus-pousette, that makes sense, i will give it a try.

I think for this fuzzy matching feature the implementation should follow the same pattern and supply a impl Matcher argument to the query so we could support different implementations. This library could provide a basic matcher which is the existing "exact match" behavior and then we can look into other implementations. I can imagine a simple fuzzy matcher would split the query into words and then do levenshtein distance matching on each index entry.

Should we do the refactor to a Matcher trait and then take it from there?

@marcus-pousette
Copy link
Collaborator

I like that idea. We want to the API the be very generic so that users could repurpose however they want.

The only thing I am thinking about it is whether it is this easy or possible to make this abstraction to make the implementation you want to achieve, since doing Levenshtein distance matching (would represent an index scan with potentially some smart caching) is fundamentally different from the current matching algorithm (trie walk).

But I would suggest you to try it out, if it turns out to be a hard then it is a good learning opportunity on the way of finding a better solution.

@marcus-pousette
Copy link
Collaborator

marcus-pousette commented Aug 19, 2022

Another note, I have already made the "score calculation" in to a trait see bm25 and zero_to_one for implementations of this trait. Maybe this could come handy in some way, maybe this trait is all you need? This is something that you already pass with query

The purpose of this trait for so make an abstraction of potential scoring calculations. You might want to improve this trait (potentially) to solve the problem you are facing

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 19, 2022

Thanks for the pointer @marcus-pousette, i will take a closer look at the code flow for scoring but would be wary of conflating scoring with matching, intuitively feels like they are different operations. Will take a look and feedback...

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 20, 2022

I think we should try to leverage the strsim crate for this feature, it would bump the dependencies up for this library to two but strsim has zero dependencies so we don't inherit any transitive dependencies. What do you think?

@marcus-pousette
Copy link
Collaborator

marcus-pousette commented Aug 20, 2022

It might be necessary yes, try it out!

One thing to test also (as I previously mentioned). Is to implement an N-gram tokenizer. This should be a very simple thing to do. Hence something that is worth exploring before deep diving into fuzzy maching algorithms that change the query algorithm on a deeper level. This is just a Tokenizer implementation that you pass during indexation and also (?) query.

This tokenizer makes following transformation:
"hello world" -> ["hel", "ell", "llo", "wor", "orl", "rld"]

This means when you are searching for "helo word" you will match again "hello world", even though you did not type it correctly.

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 20, 2022

That makes a lot of sense @marcus-pousette, I guess with the N-gram tokenizer we are trading memory for speed and using a distance matching algorithm on query keeps the index more compact but the trade off will be CPU cycles at query time.

I think it makes sense to explore the N-gram tokenizer first, good call 🙏

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 26, 2022

@marcus-pousette, I started to play around with n-gram indexing and it appears to work well. I added these tokenizers to my project:

// Index tokenizer.
fn tokenizer(s: &str) -> Vec<Cow<'_, str>> {
    let ngrams: HashSet<&str> = n_slice(2).featurize(s);

    let words = s.split(' ')
        .into_iter()
        .collect::<HashSet<_>>();

    let mut tokens: Vec<Cow<str>> = Vec::new();
    for token in words.union(&ngrams) {
        tokens.push(Cow::Owned(token.to_lowercase()))
    }
    tokens
}

// Query tokenizer.
fn query_tokenizer(s: &str) -> Vec<Cow<'_, str>> {
    s.split(' ')
        .into_iter()
        .map(|s| s.to_lowercase())
        .map(Cow::Owned)
        .collect::<Vec<_>>()
}

And the search results use fuzzy matching and are correct as far as my intuition for what should be returned goes!

I used this ngram implementation as it yields references rather than owned data which would be faster it I didn't want the case-insensitive matching.

At the moment I am testing with very small amounts of data, I wonder what a benchmark of ngram indexing with a real-world corpus of data would yield.

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 26, 2022

I would prefer to avoid the additional dependency if possible, is it worth us supplying our own n_slice function behind an ngram feature flag?

Edit: if we built our n-gram iterator on top of unicode-segmentation then it would work for surrogate pairs by iterating over graphemes as opposed to char which would work better for multi-lingual support.

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 26, 2022

Found this article interesting.

@marcus-pousette
Copy link
Collaborator

marcus-pousette commented Aug 26, 2022

Great work. Looks like a very clean solution. That sounds very convenient if it is behind a feature flag!

Yes. I see! I was playing around with the segmentation a few days ago and I think this make sense. Maybe it is possible to make it a generic type of an Index?

Currently we use char and invoke term.chars() on various places to return the segmentation, we could have a SegmentationTrait for some segment<S> that the index implements so you can do segment: S, self.segments(term)(Iter<S>). This way we can CharIndex and UnicodeGraphemeIndex, UnicodeWordIndex, etc.

@marcus-pousette
Copy link
Collaborator

Another cool segmentation you can do (if above is implemented), is that you can represent word in "N-Dimensional" space as vectors (word2vec). I mentioned this vaguely before:

Instead of "cat" -> ["c", "a", "t"] (char segmentation)

You do with help of "word2vec" (N = 3):

"cat" -> [0,0.33,0.33]
"lion" -> [0,0.34,0.34]
"dog -> [0, 0.55,0.55]

Now if you search on "lion" you could get "cat" even though they have no characters in common if you define "equality" to be true if the words are sufficiently close in space. The word2vec model is a NLP model that is trained on various corpus to understand what words are related and not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants