Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy matching #84

Merged
merged 17 commits into from
May 3, 2020
Merged

Fuzzy matching #84

merged 17 commits into from
May 3, 2020

Conversation

remiadon
Copy link
Contributor

@remiadon remiadon commented Jun 5, 2019

This PR tries to introduce fuzzy matching in flashtext, as mentioned in these issues:


Guidelines

  • rely the most we can on the existing algorithm : only trigger fuzzy matching when we have a mismatch, so we keep a focus on performance
  • if adding new parameters, the function should keep the exact same behaviour when this parameter is left to its default value, so we don't get any conflict with the other tests
  • modify the less code as possible

Features included

  1. KeywordProcessor.extract_keywords and KeywordProcessor.replace_keywords both have a new optional parameter : max_cost, which is the maximum levensthein distance accepted to perform fuzzy matching on a single keyword
  2. KeywordProcessor implements a levensthein function, which tries to find a match with a given word, with respect to a provided max_cost, and returns a node in the trie, which will be used to continue the search
  3. a new function has been included : get_next_word, which just retrieve the next word in the sequence. Tests are included in test/test_kp_next_word.py

Optimizations to keep focus on performance

  • We set the current_dict to the first node yielded by the levenshtein function, so we get back to static matching as soon as possible
  • We decrement the current cost (initialized to max_cost) every time we trigger fuzzy matching (on a word), so if all the cost from max_cost already have been "consumed" by other words in the current keyword, we do not trigger fuzzy matching. E.g : when trying to extract the keyword "here you are" from "heere you are" with a max_cost of 1, the current cost will gets all consumed after the first word ("heere"), so no fuzzy matching will be performed on the other words ("you" and "are")

Limitations

  • Well, as this is a pure python implementation don't expect this to be blazingly fast, even if we cut the recursion inside the levenshtein inner function ASAP, fuzzy matching still requires a lot more operations than exact matching

@coveralls
Copy link

coveralls commented Jun 5, 2019

Coverage Status

Coverage increased (+0.1%) to 99.43% when pulling 9cc6d0b on Quantmetry:fuzzy_matching into 50c45f1 on vi3k6i5:master.

@remiadon remiadon changed the title Fuzzy matching [WIP] Fuzzy matching Jun 6, 2019
@remiadon remiadon changed the title [WIP] Fuzzy matching Fuzzy matching Jun 13, 2019
@spooknik
Copy link

spooknik commented Dec 17, 2019

Hello! I tested your code and it works very well for me. One problem is the script throws an UnboundLocalError: local variable 'cost' referenced before assignment if the input text ends with a symbol like a period, colon, etc. or of the text contains a a symbol like period or colon somewhere.

File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 656, in replace_keywords
    ({}, 0, 0)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 786, in levensthein
    yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
    yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
    yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
    yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
  [Previous line repeated 6 more times]
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 802, in _levenshtein_rec
    yield node, cost, depth
UnboundLocalError: local variable 'cost' referenced before assignment

@remiadon
Copy link
Contributor Author

remiadon commented Dec 17, 2019

@spooknik I will try to update it ASAP
For the moment I would say instantiate cost at the beginning at the function (lines 789-791)
like

def _levenshtein_rec(self, char, node, word, rows, max_cost, depth=0): 
    n_columns = len(word) + 1
    new_rows = [rows[0] + 1]
    cost = 0

This is a mistake from my side, it's usually a bad practice to instantiate a variable in a scope (a for loop in this case) and access it out from the scope ...

@spooknik
Copy link

Thanks for the reply and the update.
I've added cost = 0at line 792 but now there is another error:

  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 654, in replace_keywords
    current_dict_continued, cost, _ = next(
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 786, in levensthein
    yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 801, in _levenshtein_rec
    stop_crit = node.keys() & (self._white_space_chars | {self._keyword})
AttributeError: 'str' object has no attribute 'keys'

@remiadon
Copy link
Contributor Author

@spooknik do you have a test case to provide ?

@spooknik
Copy link

spooknik commented Dec 17, 2019

Sure thing. Using your version of keywords.py:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=60)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 654, in replace_keywords
    current_dict_continued, cost, _ = next(
  File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 786, in levensthein
    yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
  File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 801, in _levenshtein_rec
    stop_crit = node.keys() & (self._white_space_chars | {self._keyword})
AttributeError: 'str' object has no attribute 'keys'

It works if:

keyword_processor.replace_keywords('No. of colours', max_cost=60)
'Número de colores'

@remiadon
Copy link
Contributor Author

@spooknik first thing first, in practice max_cost should not be that big, usually we allow for around 1 to 3 editions on strings if we want to stay robust. Are you sure it makes sense for you to keep a max_cost of 60 ?
Anyway I'll try to fix it this week

@spooknik
Copy link

I think I misunderstood the value of this variable. 1-3 seems like a more sane value. I just want to catch some of the the different spelling variations in English (between UK and US for example).

Thanks for taking a look 👍

@spooknik
Copy link

spooknik commented Jan 2, 2020

Hello again, small update to the bug report. Upon looking back into this It seems the problem is related with the max_cost variable. For example:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=1)
Número de colores: 10

but if I change max_cost=2 (or higher) then it crashes.

@remiadon
Copy link
Contributor Author

remiadon commented Jan 7, 2020

@spooknik I added a test case for this, it should be OK now ;)

@spooknik
Copy link

spooknik commented Jan 7, 2020

Did some quick tests with my data and it works brilliantly now. Thank you so much 👍

@zheyaf
Copy link

zheyaf commented Jan 25, 2020

Hi, a very good implementation of fuzzy match! A small proposal, would it be better if the max_cost is a ratio rather than number of characters:)? A smaller than 3 max_cost works fine for relatively long keywords (more than 8 characters), but when the keyword only contains 3 characters, it can basically match anything. It would be great if the max_cost can be max_ratio!

@spooknik
Copy link

spooknik commented Jan 26, 2020

I'll chime in with my solution that I found.

I wrote a simple conditional to check the length of the keyword and then use the relevant max_cost setting.

if len(term) <= 3:
 keyword_processor.replace_keywords(term)
elif len(term) <= 8:
 keyword_processor.replace_keywords(term, max_cost=1)
else:
 keyword_processor.replace_keywords(term, max_cost=2)

I agree however max_ratio would be more direct perhaps :)

@remiadon
Copy link
Contributor Author

Hi, a very good implementation of fuzzy match! A small proposal, would it be better if the max_cost is a ratio rather than number of characters:)? A smaller than 3 max_cost works fine for relatively long keywords (more than 8 characters), but when the keyword only contains 3 characters, it can basically match anything. It would be great if the max_cost can be max_ratio!

Good suggestion. I don't see exactly what you mean though. Would the ratio be relative to the length of the current word, or the length of the the keyword ?
If I set max_ratio to 0.3 and try to extract "I do" in "I doo what I love", in the first case that would not match because the distance between "do" and "doo" is 1, divided by 3 (length of "doo") we get 33%
I the second case that would work, because a distance of 1 relative to a length of 5 (length of "I doo") is equal to 20%, which is less than the max_ratio of 30%

@zheyaf
Copy link

zheyaf commented Jan 26, 2020

Didn't think of that... Nice example! Based on the example you mentioned, it would make more sense if the max_ratio is relative to the length of the current word, which allows more matches to be returned. If the result is not ideal, we can always adjust the ratio to achieve higher accuracy.

@remiadon
Copy link
Contributor Author

OK I see. @zheyaf do you have a few examples to provide, so I can make sure this leads to smarter fuzzy matching ??
Concerning the implementation, I should get something working without much effort

@zheyaf
Copy link

zheyaf commented Jan 27, 2020

Hi! Is it ok that I send some examples to you email?

@remiadon
Copy link
Contributor Author

@zheyaf my point was to use some of them as unittests potentially, can you use github instead ? so we get all the tracking of the PR in the same interface

@zheyaf
Copy link

zheyaf commented Jan 28, 2020

@remiadon Hi, sorry, I'll put examples here.

s1 = 'XXX Beheer BV Omschrijving: Managemen t F ee april 2018 IBAN: NL12KNAB0222222222 Valutadatum: 24-04-2018'
s2 = 'XXX VAN DORT Omschrijving: VOORSCHOT SALARIS IBAN: NL34RABO888888888 Valutadatum: 04-02-2019'
keywords = ['sal', 'VPB', 'BTW', 'management fee']

The ideal match is exact match for 'sal', 'VPB', 'BTW', fuzzy match for 'management fee' in s1. Thanks!

@remiadon
Copy link
Contributor Author

remiadon commented Feb 7, 2020

@zheyaf please provide a minimal example (eg. strip the useless parts in the strings), including
function calls and parameters for the match to occur

@nkrot
Copy link

nkrot commented Apr 30, 2020

was this PR merged?

@remiadon
Copy link
Contributor Author

@nkrot unfortunately no
I have no answer from the maintainer, despite some people seem to be interested in this feature

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Apr 30, 2020

Hi,

sorry for so much delay on this, was busy with other stuff going on in life right now. Will take out time this weekend and get this done before Monday IST.

Thanks for your patience.

@vi3k6i5 vi3k6i5 merged commit b316c7e into vi3k6i5:master May 3, 2020
@vi3k6i5
Copy link
Owner

vi3k6i5 commented May 3, 2020

merged to master.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented May 3, 2020

Thanks everyone for all the help with this, sorry for all the delay. Was inactive on Github for a long time cos of some personal reasons and life 😄

@remiadon
Copy link
Contributor Author

remiadon commented May 9, 2020

I'll chime in with my solution that I found.

I wrote a simple conditional to check the length of the keyword and then use the relevant max_cost setting.

if len(term) <= 3:
 keyword_processor.replace_keywords(term)
elif len(term) <= 8:
 keyword_processor.replace_keywords(term, max_cost=1)
else:
 keyword_processor.replace_keywords(term, max_cost=2)

I agree however max_ratio would be more direct perhaps :)

@spooknik just for curiosity what was your exact example ?

@bennyhawk
Copy link

Hello @vi3k6i5 , would it be possible to build and push this commit to PyPi too?.. This is a very useful feature and I'm sure many developers would be more than excited to see this pushed to PyPi, me included 😃

@lingvisa
Copy link

@remiadon @vi3k6i5 Can you provide one example on how to enable the fuzzy match feature in the api?

@remiadon
Copy link
Contributor Author

@lingvisa you can have a look in this PR "files changed" section, especially test/test_extract_fuzzy.py

Aside from this I think we still have to make some efforts to make fuzzy matching really usable.
The main part to work on is to allow for a max_ratio arg to be passed, instead of the current max_cost, which matches everything for small keywords for example

@lingvisa
Copy link

@remiadon Have you also tested the performance of fuzzy match compared to exact match? Supposedly, will it be several times slower?

@remiadon
Copy link
Contributor Author

@lingvisa I believe so, yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants