Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

_set_token_ratio now keeps tokenization. #300

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

_set_token_ratio now keeps tokenization. #300

wants to merge 2 commits into from

Conversation

MWLever
Copy link

@MWLever MWLever commented Feb 21, 2021

Previously, set_token ratio, through a mixture of join and split and strip concatenated all tokens together with no whitespace. This allowed for partial matches across token boundaries. This can occur in practice when a human enters a search, but is rare.

Change: Implement Levenshtein's setratio() scoring and preserve tokenization in fuzz._set_token_ratio

Now fails 2 tests due to score changes, which should be expected.

testTokenSetRatio: score improves
testWithCutOff: score improves to above 50

Previous issue: partial_token_set_ratio matching strings across tokens.

Fix: Preserve tokenization of the comparison sets and use Levenshtein's setratio/seqratio over ratio.

Detail:
Previously token_set_ratio used python's strip to remove white space. Since strip removes all whitespace, the set comparisons are not tokenized. So when using partial_token_set_ratio, you would be able to match strings across word boundaries. This is generally unexpected behavior. This change should allow a more bag of words.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant