Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

unintuitively high score for single letter choices #12

Closed
Midnighter opened this Issue Aug 21, 2012 · 3 comments

Comments

Projects
None yet
3 participants

Let me state here that I did not look into the fuzzy matching algorithms yet but I came across the following example and it struck me as humanly odd. Any suggestions, workarounds, explanations?

In [57]: process.extract("pheL", ["h", "phe-L", "trnaphe", "p", "e", "L"], limit=None)

Out[57]: [('h', 90), ('p', 90), ('e', 90), ('L', 90), ('phe-L', 89), ('trnaphe', 76)]
Owner

acslater00 commented Aug 21, 2012

I agree these results don't pass the sniff test. The reason is that process.extract uses the WRatio method by default, which has poor results for single letter strings.

For this application, you'd probably want to use the more basic fuzz.ratio() method as the scorer, which will not allow substrings to get high match scores.

In [6]: process.extract("pheL", ["h", "phe-L", "trnaphe", "p", "e", "L"], scorer=fuzz.ratio, limit=None)

Out[6]: [('phe-L', 88), ('trnaphe', 54), ('h', 40), ('p', 40), ('e', 40), ('L', 40)]

This looks more reasonable to me.

That looks much better and thank you for the prompt reply. I get much better results overall for my strings with the simple ratio (not just this example).

@Midnighter Midnighter closed this Aug 21, 2012

@ acslater00 thanks this solved my problem too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment