Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Implemented sort order matches by common letter count largest to smallest #295

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ZihangH
Copy link

@ZihangH ZihangH commented Dec 17, 2020

This pull request addresses the problem in this issue: #280

All the code changes + unit tests are in process.py and test_fuzzywuzzy.py.

All the old and new test cases in test_fuzzywuzzy.py are passed.

sorted(sl, key=lambda i: i[1], reverse=True)
else:
best_list = sorted(sl, key=lambda i: i[1], reverse=True)
return sortByCommonLetter(sl, query)[0: min(limit, len(sl))] if limit is not None else \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a performance standpoint this is a bad idea. When the user is only interested in 5 elements (the default), but has e.g. 1 million choices, this will sort 1 million choices (with a slow algorithm, since it counts all the letters) and then only takes the best 5 elements.

Comment on lines +242 to +264
def sortByCommonLetter(sl, query):
"""This function further sorts the strings with the same scores by common letter count to the query."""
current_score, last_index = -1, -1
# Iterate over list and look for words with the same scores
for i in range(0, len(sl)):
# Identify the indexes of the strings with the same scores
if sl[i][1] != current_score or i == len(sl) - 1:
current_score = sl[i][1]
# First iteration, there are no previous words so we do not have to do anything
if last_index == -1:
last_index = i
continue
# Found a group of words with the same scores! Now sort them
if i - last_index > 1:
count_list = []
for j in range(last_index, i):
count_list.append((sl[j][0], calculateCommonLetter(query, sl[j][0])))
count_list = sorted(count_list, key=lambda k: k[1], reverse=True)
# Copy the sorted portion
for j in range(0, len(count_list)):
sl[last_index + j] = (count_list[j][0], current_score)
last_index = i
return sl

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm is broken for quite a few cases. A couple quick examples:

  • When a processor is used the results might not be correct:
query_1 = 'COMPANY'
choices_1 = ['Company 1', 'company', 'company', 'Company 2', 'Awesome Company']
process.extractOne(query_1, choices_1, scorer=fuzz.partial_token_set_ratio, char_sort=True)
-> ('Company 1', 100)
  • it matters wether a element is query or choice
query_1 = 'Company'
choices_1 = ['Company 1', 'Company', 'Company 2', 'Awesome Company']
process.extractOne(query_1, choices_1, scorer=fuzz.partial_token_set_ratio, char_sort=True)
-> ('Company 1', 100)

@maxbachmann
Copy link

Overall I am personally not really convinced, this should be added at all for two reasons

  1. it adds more arguments, which makes it increasingly hard to use the function in the correct way
  2. In my opinion this does not belong into process.*. When using token_set_ratio these results are indeed all as similar. So taking the first is a well defined behaviour. When the user wants to prefer matches that have e.g. many characters in common he should use a different scorer, that combines the result of multiple string metrics. A good example for this is fuzz.WRatio, that is implemented as a separate scorer, that combines multiple metrics.

Added penalty for character mismatches in sortByCommonLetter
Added call to full_process in sortByCommonLetter
Fixed incorrect variable names (weird!)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants