-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy matching #84
Fuzzy matching #84
Conversation
…kes in testing doc
Hello! I tested your code and it works very well for me. One problem is the script throws an
|
@spooknik I will try to update it ASAP def _levenshtein_rec(self, char, node, word, rows, max_cost, depth=0):
n_columns = len(word) + 1
new_rows = [rows[0] + 1]
cost = 0 This is a mistake from my side, it's usually a bad practice to instantiate a variable in a scope (a for loop in this case) and access it out from the scope ... |
Thanks for the reply and the update.
|
@spooknik do you have a test case to provide ? |
Sure thing. Using your version of keywords.py: from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=60)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 654, in replace_keywords
current_dict_continued, cost, _ = next(
File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 786, in levensthein
yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 801, in _levenshtein_rec
stop_crit = node.keys() & (self._white_space_chars | {self._keyword})
AttributeError: 'str' object has no attribute 'keys' It works if: keyword_processor.replace_keywords('No. of colours', max_cost=60)
'Número de colores' |
@spooknik first thing first, in practice max_cost should not be that big, usually we allow for around 1 to 3 editions on strings if we want to stay robust. Are you sure it makes sense for you to keep a max_cost of 60 ? |
I think I misunderstood the value of this variable. 1-3 seems like a more sane value. I just want to catch some of the the different spelling variations in English (between UK and US for example). Thanks for taking a look 👍 |
Hello again, small update to the bug report. Upon looking back into this It seems the problem is related with the max_cost variable. For example: from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=1)
Número de colores: 10 but if I change max_cost=2 (or higher) then it crashes. |
@spooknik I added a test case for this, it should be OK now ;) |
Did some quick tests with my data and it works brilliantly now. Thank you so much 👍 |
Hi, a very good implementation of fuzzy match! A small proposal, would it be better if the |
I'll chime in with my solution that I found. I wrote a simple conditional to check the length of the keyword and then use the relevant max_cost setting. if len(term) <= 3:
keyword_processor.replace_keywords(term)
elif len(term) <= 8:
keyword_processor.replace_keywords(term, max_cost=1)
else:
keyword_processor.replace_keywords(term, max_cost=2) I agree however max_ratio would be more direct perhaps :) |
Good suggestion. I don't see exactly what you mean though. Would the ratio be relative to the length of the current word, or the length of the the keyword ? |
Didn't think of that... Nice example! Based on the example you mentioned, it would make more sense if the max_ratio is relative to the length of the current word, which allows more matches to be returned. If the result is not ideal, we can always adjust the ratio to achieve higher accuracy. |
OK I see. @zheyaf do you have a few examples to provide, so I can make sure this leads to smarter fuzzy matching ?? |
Hi! Is it ok that I send some examples to you email? |
@zheyaf my point was to use some of them as unittests potentially, can you use github instead ? so we get all the tracking of the PR in the same interface |
@remiadon Hi, sorry, I'll put examples here.
The ideal match is exact match for |
@zheyaf please provide a minimal example (eg. strip the useless parts in the strings), including |
was this PR merged? |
@nkrot unfortunately no |
Hi, sorry for so much delay on this, was busy with other stuff going on in life right now. Will take out time this weekend and get this done before Monday IST. Thanks for your patience. |
merged to master. |
Thanks everyone for all the help with this, sorry for all the delay. Was inactive on Github for a long time cos of some personal reasons and life 😄 |
@spooknik just for curiosity what was your exact example ? |
Hello @vi3k6i5 , would it be possible to build and push this commit to PyPi too?.. This is a very useful feature and I'm sure many developers would be more than excited to see this pushed to PyPi, me included 😃 |
@lingvisa you can have a look in this PR "files changed" section, especially Aside from this I think we still have to make some efforts to make fuzzy matching really usable. |
@remiadon Have you also tested the performance of fuzzy match compared to exact match? Supposedly, will it be several times slower? |
@lingvisa I believe so, yes |
This PR tries to introduce
fuzzy matching
in flashtext, as mentioned in these issues:Guidelines
Features included
max_cost
, which is the maximum levensthein distance accepted to perform fuzzy matching on a single keywordlevensthein
function, which tries to find a match with a given word,with respect to a provided max_cost
, and returns a node in the trie, which will be used to continue the searchget_next_word
, which just retrieve the next word in the sequence. Tests are included intest/test_kp_next_word.py
Optimizations to keep focus on performance
current cost
(initialized tomax_cost
) every time we trigger fuzzy matching (on a word), so if all the cost frommax_cost
already have been "consumed" by other words in the current keyword, we do not trigger fuzzy matching. E.g : when trying to extract the keyword "here you are" from "heere you are" with amax_cost
of 1, the current cost will gets all consumed after the first word ("heere"), so no fuzzy matching will be performed on the other words ("you" and "are")Limitations