Fuzzy matching #84

remiadon · 2019-06-05T16:25:22Z

This PR tries to introduce fuzzy matching in flashtext, as mentioned in these issues:

Guidelines

rely the most we can on the existing algorithm : only trigger fuzzy matching when we have a mismatch, so we keep a focus on performance
if adding new parameters, the function should keep the exact same behaviour when this parameter is left to its default value, so we don't get any conflict with the other tests
modify the less code as possible

Features included

KeywordProcessor.extract_keywords and KeywordProcessor.replace_keywords both have a new optional parameter : max_cost, which is the maximum levensthein distance accepted to perform fuzzy matching on a single keyword
KeywordProcessor implements a levensthein function, which tries to find a match with a given word, with respect to a provided max_cost, and returns a node in the trie, which will be used to continue the search
a new function has been included : get_next_word, which just retrieve the next word in the sequence. Tests are included in test/test_kp_next_word.py

Optimizations to keep focus on performance

We set the current_dict to the first node yielded by the levenshtein function, so we get back to static matching as soon as possible
We decrement the current cost (initialized to max_cost) every time we trigger fuzzy matching (on a word), so if all the cost from max_cost already have been "consumed" by other words in the current keyword, we do not trigger fuzzy matching. E.g : when trying to extract the keyword "here you are" from "heere you are" with a max_cost of 1, the current cost will gets all consumed after the first word ("heere"), so no fuzzy matching will be performed on the other words ("you" and "are")

Limitations

Well, as this is a pure python implementation don't expect this to be blazingly fast, even if we cut the recursion inside the levenshtein inner function ASAP, fuzzy matching still requires a lot more operations than exact matching

…with extract_keywords

coveralls · 2019-06-05T16:27:20Z

Coverage increased (+0.1%) to 99.43% when pulling 9cc6d0b on Quantmetry:fuzzy_matching into 50c45f1 on vi3k6i5:master.

…kes in testing doc

…ediate match

spooknik · 2019-12-17T15:09:02Z

Hello! I tested your code and it works very well for me. One problem is the script throws an UnboundLocalError: local variable 'cost' referenced before assignment if the input text ends with a symbol like a period, colon, etc. or of the text contains a a symbol like period or colon somewhere.

File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 656, in replace_keywords
    ({}, 0, 0)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 786, in levensthein
    yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
    yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
    yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
    yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
  [Previous line repeated 6 more times]
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 802, in _levenshtein_rec
    yield node, cost, depth
UnboundLocalError: local variable 'cost' referenced before assignment

remiadon · 2019-12-17T15:41:04Z

@spooknik I will try to update it ASAP
For the moment I would say instantiate cost at the beginning at the function (lines 789-791)
like

def _levenshtein_rec(self, char, node, word, rows, max_cost, depth=0): 
    n_columns = len(word) + 1
    new_rows = [rows[0] + 1]
    cost = 0

This is a mistake from my side, it's usually a bad practice to instantiate a variable in a scope (a for loop in this case) and access it out from the scope ...

spooknik · 2019-12-17T17:14:43Z

Thanks for the reply and the update.
I've added cost = 0at line 792 but now there is another error:

  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 654, in replace_keywords
    current_dict_continued, cost, _ = next(
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 786, in levensthein
    yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 801, in _levenshtein_rec
    stop_crit = node.keys() & (self._white_space_chars | {self._keyword})
AttributeError: 'str' object has no attribute 'keys'

remiadon · 2019-12-17T17:20:46Z

@spooknik do you have a test case to provide ?

spooknik · 2019-12-17T17:45:23Z

Sure thing. Using your version of keywords.py:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=60)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 654, in replace_keywords
    current_dict_continued, cost, _ = next(
  File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 786, in levensthein
    yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
  File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 801, in _levenshtein_rec
    stop_crit = node.keys() & (self._white_space_chars | {self._keyword})
AttributeError: 'str' object has no attribute 'keys'

It works if:

keyword_processor.replace_keywords('No. of colours', max_cost=60)
'Número de colores'

remiadon · 2019-12-17T17:50:47Z

@spooknik first thing first, in practice max_cost should not be that big, usually we allow for around 1 to 3 editions on strings if we want to stay robust. Are you sure it makes sense for you to keep a max_cost of 60 ?
Anyway I'll try to fix it this week

spooknik · 2019-12-17T18:16:32Z

I think I misunderstood the value of this variable. 1-3 seems like a more sane value. I just want to catch some of the the different spelling variations in English (between UK and US for example).

Thanks for taking a look 👍

spooknik · 2020-01-02T18:23:29Z

Hello again, small update to the bug report. Upon looking back into this It seems the problem is related with the max_cost variable. For example:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=1)
Número de colores: 10

but if I change max_cost=2 (or higher) then it crashes.

remiadon · 2020-01-07T09:51:25Z

@spooknik I added a test case for this, it should be OK now ;)

spooknik · 2020-01-07T09:58:27Z

Did some quick tests with my data and it works brilliantly now. Thank you so much 👍

zheyaf · 2020-01-25T20:21:17Z

Hi, a very good implementation of fuzzy match! A small proposal, would it be better if the max_cost is a ratio rather than number of characters:)? A smaller than 3 max_cost works fine for relatively long keywords (more than 8 characters), but when the keyword only contains 3 characters, it can basically match anything. It would be great if the max_cost can be max_ratio!

spooknik · 2020-01-26T08:02:35Z

I'll chime in with my solution that I found.

I wrote a simple conditional to check the length of the keyword and then use the relevant max_cost setting.

if len(term) <= 3:
 keyword_processor.replace_keywords(term)
elif len(term) <= 8:
 keyword_processor.replace_keywords(term, max_cost=1)
else:
 keyword_processor.replace_keywords(term, max_cost=2)

I agree however max_ratio would be more direct perhaps :)

remiadon · 2020-01-26T17:03:32Z

Hi, a very good implementation of fuzzy match! A small proposal, would it be better if the max_cost is a ratio rather than number of characters:)? A smaller than 3 max_cost works fine for relatively long keywords (more than 8 characters), but when the keyword only contains 3 characters, it can basically match anything. It would be great if the max_cost can be max_ratio!

Good suggestion. I don't see exactly what you mean though. Would the ratio be relative to the length of the current word, or the length of the the keyword ?
If I set max_ratio to 0.3 and try to extract "I do" in "I doo what I love", in the first case that would not match because the distance between "do" and "doo" is 1, divided by 3 (length of "doo") we get 33%
I the second case that would work, because a distance of 1 relative to a length of 5 (length of "I doo") is equal to 20%, which is less than the max_ratio of 30%

zheyaf · 2020-01-26T17:22:54Z

Didn't think of that... Nice example! Based on the example you mentioned, it would make more sense if the max_ratio is relative to the length of the current word, which allows more matches to be returned. If the result is not ideal, we can always adjust the ratio to achieve higher accuracy.

remiadon · 2020-01-27T09:05:24Z

OK I see. @zheyaf do you have a few examples to provide, so I can make sure this leads to smarter fuzzy matching ??
Concerning the implementation, I should get something working without much effort

zheyaf · 2020-01-27T09:08:08Z

Hi! Is it ok that I send some examples to you email?

remiadon · 2020-01-27T10:29:55Z

@zheyaf my point was to use some of them as unittests potentially, can you use github instead ? so we get all the tracking of the PR in the same interface

zheyaf · 2020-01-28T20:56:40Z

@remiadon Hi, sorry, I'll put examples here.

s1 = 'XXX Beheer BV Omschrijving: Managemen t F ee april 2018 IBAN: NL12KNAB0222222222 Valutadatum: 24-04-2018'
s2 = 'XXX VAN DORT Omschrijving: VOORSCHOT SALARIS IBAN: NL34RABO888888888 Valutadatum: 04-02-2019'
keywords = ['sal', 'VPB', 'BTW', 'management fee']

The ideal match is exact match for 'sal', 'VPB', 'BTW', fuzzy match for 'management fee' in s1. Thanks!

remiadon · 2020-02-07T09:43:53Z

@zheyaf please provide a minimal example (eg. strip the useless parts in the strings), including
function calls and parameters for the match to occur

nkrot · 2020-04-30T10:53:09Z

was this PR merged?

remiadon · 2020-04-30T10:56:55Z

@nkrot unfortunately no
I have no answer from the maintainer, despite some people seem to be interested in this feature

vi3k6i5 · 2020-04-30T11:03:39Z

Hi,

sorry for so much delay on this, was busy with other stuff going on in life right now. Will take out time this weekend and get this done before Monday IST.

Thanks for your patience.

vi3k6i5 · 2020-05-03T07:13:33Z

merged to master.

vi3k6i5 · 2020-05-03T07:19:07Z

Thanks everyone for all the help with this, sorry for all the delay. Was inactive on Github for a long time cos of some personal reasons and life 😄

remiadon · 2020-05-09T17:33:19Z

I'll chime in with my solution that I found.

I wrote a simple conditional to check the length of the keyword and then use the relevant max_cost setting.
if len(term) <= 3:
 keyword_processor.replace_keywords(term)
elif len(term) <= 8:
 keyword_processor.replace_keywords(term, max_cost=1)
else:
 keyword_processor.replace_keywords(term, max_cost=2)
I agree however max_ratio would be more direct perhaps :)

@spooknik just for curiosity what was your exact example ?

bennyhawk · 2020-12-03T04:20:15Z

Hello @vi3k6i5 , would it be possible to build and push this commit to PyPi too?.. This is a very useful feature and I'm sure many developers would be more than excited to see this pushed to PyPi, me included 😃

lingvisa · 2022-03-25T18:20:59Z

@remiadon @vi3k6i5 Can you provide one example on how to enable the fuzzy match feature in the api?

remiadon · 2022-03-26T11:41:21Z

@lingvisa you can have a look in this PR "files changed" section, especially test/test_extract_fuzzy.py

Aside from this I think we still have to make some efforts to make fuzzy matching really usable.
The main part to work on is to allow for a max_ratio arg to be passed, instead of the current max_cost, which matches everything for small keywords for example

lingvisa · 2022-03-26T15:31:56Z

@remiadon Have you also tested the performance of fuzzy match compared to exact match? Supposedly, will it be several times slower?

remiadon · 2022-04-24T16:53:57Z

@lingvisa I believe so, yes

remiadon added 3 commits June 1, 2019 15:29

first implementation of fuzzy matching via leventhstein, first tests …

241b158

…with extract_keywords

fuzzy extract_keywords : [FIX] mismatch on first character

9eb55f4

fuzzy extract_keywords : small fix on unused variable

a28b620

remiadon added 3 commits June 5, 2019 18:48

fuzzy extract_keywords : [ADD] testing for intermediate matches

d4c7e79

fuzzy extract_keywords : [ADD] test for substitutions

8b22cf0

fuzzy extract_keywords : simplify code for intermediate matches

a61dcbf

remiadon changed the title ~~Fuzzy matching~~ [WIP] Fuzzy matching Jun 6, 2019

remiadon added 8 commits June 6, 2019 09:54

fuzzy extract_keywords : improve performance for intermediate matches

bb56cb6

fuzzy extract_keywords : [ADD] examples for levensthein and fix mista…

fec427c

…kes in testing doc

fuzzy matching: [ADD] test for get_next_word

bf3b689

fuzzy extract: [ADD] test for intermediate match

e8372c8

fuzzy matching: [FIX] last test on intermediate match

c3c2ee6

[ADD] fuzzy replacement and tests for it

7a4190d

fuzzy_matching [FIX] typos and import in tests

6a0aaaf

fuzzy_matching replacement [ADD] test for intermediate match

097cc0d

remiadon changed the title ~~[WIP] Fuzzy matching~~ Fuzzy matching Jun 13, 2019

remiadon added 2 commits June 13, 2019 15:50

fuzzy replacement [ADD] improve performance and [FIX] test for interm…

9e1b5f4

…ediate match

[FIX] fuzzy replacement case sensitive when tiggering levenshtein

a5d8b8d

[FIX] case for match while remaining subword

9cc6d0b

vi3k6i5 merged commit b316c7e into vi3k6i5:master May 3, 2020

ecwootten mentioned this pull request Jan 25, 2022

Can I use stemmed version of keyphrases to extract them? #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzzy matching #84

Fuzzy matching #84

remiadon commented Jun 5, 2019 •

edited

Loading

coveralls commented Jun 5, 2019 •

edited

Loading

spooknik commented Dec 17, 2019 •

edited

Loading

remiadon commented Dec 17, 2019 •

edited

Loading

spooknik commented Dec 17, 2019

remiadon commented Dec 17, 2019

spooknik commented Dec 17, 2019 •

edited

Loading

remiadon commented Dec 17, 2019

spooknik commented Dec 17, 2019

spooknik commented Jan 2, 2020 •

edited

Loading

remiadon commented Jan 7, 2020

spooknik commented Jan 7, 2020

zheyaf commented Jan 25, 2020

spooknik commented Jan 26, 2020 •

edited

Loading

remiadon commented Jan 26, 2020

zheyaf commented Jan 26, 2020

remiadon commented Jan 27, 2020

zheyaf commented Jan 27, 2020

remiadon commented Jan 27, 2020

zheyaf commented Jan 28, 2020

remiadon commented Feb 7, 2020

nkrot commented Apr 30, 2020

remiadon commented Apr 30, 2020

vi3k6i5 commented Apr 30, 2020

vi3k6i5 commented May 3, 2020

vi3k6i5 commented May 3, 2020

remiadon commented May 9, 2020

bennyhawk commented Dec 3, 2020

lingvisa commented Mar 25, 2022

remiadon commented Mar 26, 2022

lingvisa commented Mar 26, 2022

remiadon commented Apr 24, 2022

Fuzzy matching #84

Fuzzy matching #84

Conversation

remiadon commented Jun 5, 2019 • edited Loading

Guidelines

Features included

Optimizations to keep focus on performance

Limitations

coveralls commented Jun 5, 2019 • edited Loading

spooknik commented Dec 17, 2019 • edited Loading

remiadon commented Dec 17, 2019 • edited Loading

spooknik commented Dec 17, 2019

remiadon commented Dec 17, 2019

spooknik commented Dec 17, 2019 • edited Loading

remiadon commented Dec 17, 2019

spooknik commented Dec 17, 2019

spooknik commented Jan 2, 2020 • edited Loading

remiadon commented Jan 7, 2020

spooknik commented Jan 7, 2020

zheyaf commented Jan 25, 2020

spooknik commented Jan 26, 2020 • edited Loading

remiadon commented Jan 26, 2020

zheyaf commented Jan 26, 2020

remiadon commented Jan 27, 2020

zheyaf commented Jan 27, 2020

remiadon commented Jan 27, 2020

zheyaf commented Jan 28, 2020

remiadon commented Feb 7, 2020

nkrot commented Apr 30, 2020

remiadon commented Apr 30, 2020

vi3k6i5 commented Apr 30, 2020

vi3k6i5 commented May 3, 2020

vi3k6i5 commented May 3, 2020

remiadon commented May 9, 2020

bennyhawk commented Dec 3, 2020

lingvisa commented Mar 25, 2022

remiadon commented Mar 26, 2022

lingvisa commented Mar 26, 2022

remiadon commented Apr 24, 2022

remiadon commented Jun 5, 2019 •

edited

Loading

coveralls commented Jun 5, 2019 •

edited

Loading

spooknik commented Dec 17, 2019 •

edited

Loading

remiadon commented Dec 17, 2019 •

edited

Loading

spooknik commented Dec 17, 2019 •

edited

Loading

spooknik commented Jan 2, 2020 •

edited

Loading

spooknik commented Jan 26, 2020 •

edited

Loading