Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature to robustly handle token ordernings #272

Open
shbunder opened this issue Apr 2, 2020 · 0 comments
Open

Feature to robustly handle token ordernings #272

shbunder opened this issue Apr 2, 2020 · 0 comments

Comments

@shbunder
Copy link

shbunder commented Apr 2, 2020

Hi,

I use fuzzywuzzy to match full names extracted from documents to names in a database. Discarding order is import for this matching goal. Typically I use fuzz.token_sort_ratio to obtain:

fuzz.token_sort_ratio("fuzzy wuzzy", "wuzzy fuzzy")
> 100

As the names suggest this function sorts the individual tokens, however in multiple instances this gave undesirable results, e.g.

fuzz.token_sort_ratio("willy` wonka", "willy zonka")
> 91
fuzz.token_sort_ratio("willy` wonka", "willy vonka")
> 45

To cope with this I would propose a robust token_sim_ratio function that sorts the second list of tokens according to its similarity with the tokens in the first list. I have currently implemented a light-weight solution based on ngram-matching that is robust to mistakes in the first letter of tokens.

My question; is there a general appetite for such a functionality, and if so should I proceed with making a PR for this feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant