Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeywordProcessor returns wrong span for text containing non-ascii characters when case_sentsitive=False #119

Open
MauroLuzzatto opened this issue Nov 15, 2020 · 1 comment · Fixed by openfoodfacts/robotoff#1108

Comments

@MauroLuzzatto
Copy link

MauroLuzzatto commented Nov 15, 2020

Hi all, first thanks a lot for the great library you created, I really appreciate it!

When working with non-ascii characters I found a case, where the span returned by the KeywordProcessor is wrong, when case_sentsitive=False.

Please find a sample below that reproduces the error:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keyword('Bay Area')

text = 'İ I love big Apple and Bay Area.'  # added the "İ" non-ascii character 

keywords_found = keyword_processor.extract_keywords(text, span_info=True)

for match in keywords_found:
    print(match)
    print(text[match[1]:match[2]])

Output:

('Bay Area', 24, 32)
ay Area. # the span is shifted by one

When looking in the error, I figured out, that the length of the “İ” changes from 1 (when uppercase) to 2 (when lowercase), which I believe results in the span shift (because the span is only wrong when non-case sensitive).

len("İ")
Out[39]: 1

len("İ".lower())
Out[40]: 2

Could any of the authors comment on the issue and mention, if they intent to do something about it or if it is out of scope?

Thanks a lot!

@NLPShenanigans
Copy link

Hey Mauro, it doesn't look like the repo is being actively maintained these days. As a pet project, I was going to go through the codebase and give this a revamp, and given this issue is not exceptionally common, non-ascii character or otherwise, what I've done to address the issues amounts to the following:

  1. inserting some thoughtfully-place if statements to catch instances where the lengths differ over lowercasing, and raise a ValueError in such cases.
  2. ensure appropriate text normalisation prior to inputting the text as an argument to functions which make use of lowercasing.

In such instances, the onus is usually on the user to make sure the text is normalised, and this is fundamentally a text cleanliness issue, rather than an issue with calculating the spans, which thus far looks to be behaving as it should in this case. If you modify the length of the string part way through, I would consider raising an error to be sensible and block the span from calculating an incorrect value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants