Feature to robustly handle token ordernings #272

shbunder · 2020-04-02T10:52:13Z

Hi,

I use fuzzywuzzy to match full names extracted from documents to names in a database. Discarding order is import for this matching goal. Typically I use fuzz.token_sort_ratio to obtain:

fuzz.token_sort_ratio("fuzzy wuzzy", "wuzzy fuzzy")
> 100

As the names suggest this function sorts the individual tokens, however in multiple instances this gave undesirable results, e.g.

fuzz.token_sort_ratio("willy` wonka", "willy zonka")
> 91
fuzz.token_sort_ratio("willy` wonka", "willy vonka")
> 45

To cope with this I would propose a robust token_sim_ratio function that sorts the second list of tokens according to its similarity with the tokens in the first list. I have currently implemented a light-weight solution based on ngram-matching that is robust to mistakes in the first letter of tokens.

My question; is there a general appetite for such a functionality, and if so should I proceed with making a PR for this feature?

The text was updated successfully, but these errors were encountered:

Implemented solution to: seatgeek#272

shbunder mentioned this issue Apr 2, 2020

[Feature Suggestion] Token permutation ratio #249

Open

Exquisition added a commit to Exquisition/fuzzywuzzy that referenced this issue Dec 17, 2020

implemented token_sim_ratio() function with cosine similarity

f88f7fc

Implemented solution to: seatgeek#272

Exquisition mentioned this issue Dec 17, 2020

implemented token_sim_ratio() function with cosine similarity #296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature to robustly handle token ordernings #272

Feature to robustly handle token ordernings #272

shbunder commented Apr 2, 2020

Feature to robustly handle token ordernings #272

Feature to robustly handle token ordernings #272

Comments

shbunder commented Apr 2, 2020