Implemented sort order matches by common letter count largest to smallest #295

ZihangH · 2020-12-17T02:28:12Z

This pull request addresses the problem in this issue: #280

All the code changes + unit tests are in process.py and test_fuzzywuzzy.py.

All the old and new test cases in test_fuzzywuzzy.py are passed.

All the test cases are passed

maxbachmann · 2020-12-17T12:33:29Z

fuzzywuzzy/process.py

+            sorted(sl, key=lambda i: i[1], reverse=True)
+    else:
+        best_list = sorted(sl, key=lambda i: i[1], reverse=True)
+        return sortByCommonLetter(sl, query)[0: min(limit, len(sl))] if limit is not None else \


from a performance standpoint this is a bad idea. When the user is only interested in 5 elements (the default), but has e.g. 1 million choices, this will sort 1 million choices (with a slow algorithm, since it counts all the letters) and then only takes the best 5 elements.

maxbachmann · 2020-12-17T12:42:07Z

fuzzywuzzy/process.py

+def sortByCommonLetter(sl, query):
+    """This function further sorts the strings with the same scores by common letter count to the query."""
+    current_score, last_index = -1, -1
+    # Iterate over list and look for words with the same scores
+    for i in range(0, len(sl)):
+        # Identify the indexes of the strings with the same scores
+        if sl[i][1] != current_score or i == len(sl) - 1:
+            current_score = sl[i][1]
+            # First iteration, there are no previous words so we do not have to do anything
+            if last_index == -1:
+                last_index = i
+                continue
+            # Found a group of words with the same scores! Now sort them
+            if i - last_index > 1:
+                count_list = []
+                for j in range(last_index, i):
+                    count_list.append((sl[j][0], calculateCommonLetter(query, sl[j][0])))
+                count_list = sorted(count_list, key=lambda k: k[1], reverse=True)
+                # Copy the sorted portion
+                for j in range(0, len(count_list)):
+                    sl[last_index + j] = (count_list[j][0], current_score)
+            last_index = i
+    return sl


This algorithm is broken for quite a few cases. A couple quick examples:

When a processor is used the results might not be correct:

query_1 = 'COMPANY' choices_1 = ['Company 1', 'company', 'company', 'Company 2', 'Awesome Company'] process.extractOne(query_1, choices_1, scorer=fuzz.partial_token_set_ratio, char_sort=True) -> ('Company 1', 100)

it matters wether a element is query or choice

query_1 = 'Company' choices_1 = ['Company 1', 'Company', 'Company 2', 'Awesome Company'] process.extractOne(query_1, choices_1, scorer=fuzz.partial_token_set_ratio, char_sort=True) -> ('Company 1', 100)

maxbachmann · 2020-12-17T12:50:49Z

Overall I am personally not really convinced, this should be added at all for two reasons

it adds more arguments, which makes it increasingly hard to use the function in the correct way
In my opinion this does not belong into process.*. When using token_set_ratio these results are indeed all as similar. So taking the first is a well defined behaviour. When the user wants to prefer matches that have e.g. many characters in common he should use a different scorer, that combines the result of multiple string metrics. A good example for this is fuzz.WRatio, that is implemented as a separate scorer, that combines multiple metrics.

Added penalty for character mismatches in sortByCommonLetter Added call to full_process in sortByCommonLetter Fixed incorrect variable names (weird!)

Not-Forrest and others added 3 commits December 12, 2020 00:55

Implemented character count ordering for strings with the same score

134074b

Fixed coding style errors

ccb4773

Added test cases for sort_by_common_letter_count feature.

1406166

All the test cases are passed

maxbachmann reviewed Dec 17, 2020

View reviewed changes

Fixed issue with sortByCommonLetter, fixed incorrect variable names.

38e3419

Added penalty for character mismatches in sortByCommonLetter Added call to full_process in sortByCommonLetter Fixed incorrect variable names (weird!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented sort order matches by common letter count largest to smallest #295

Implemented sort order matches by common letter count largest to smallest #295

ZihangH commented Dec 17, 2020

maxbachmann Dec 17, 2020

maxbachmann Dec 17, 2020

maxbachmann commented Dec 17, 2020

Implemented sort order matches by common letter count largest to smallest #295

Are you sure you want to change the base?

Implemented sort order matches by common letter count largest to smallest #295

Conversation

ZihangH commented Dec 17, 2020

maxbachmann Dec 17, 2020

Choose a reason for hiding this comment

maxbachmann Dec 17, 2020

Choose a reason for hiding this comment

maxbachmann commented Dec 17, 2020