Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix comparing strings with multibyte chars #3

Merged
merged 1 commit into from
Apr 18, 2016

Conversation

WanzenBug
Copy link
Contributor

Using x.len() on a str returns the number of bytes, which is not
always the same as the number of characters. For example, the characters
'ö' or '香' are representing using more than one byte. This leads
to strange behaviour (e.g. levenshtein("a", "ä") would return 0).
Replace calls to str.len() with str.chars().count(), which returns
the correct number of characters in a string.

Using `x.len()` on a str returns the number of bytes, which is not
always the same as the number of characters. For example, the characters
'ö' or '香' are representing using more than one byte. This leads
to strange behaviour (e.g. `levenshtein("a", "ä")` would return 0).
Replace calls to `str.len()` with `str.chars().count()`, which returns
the correct number of characters in a string.
@dguo dguo added the bug label Apr 18, 2016
@dguo dguo merged commit 14cf91e into rapidfuzz:master Apr 18, 2016
@dguo
Copy link
Member

dguo commented Apr 18, 2016

Thanks for this! I just published it to Cargo as v0.4.1.

@WanzenBug
Copy link
Contributor Author

Thanks for the quick response :)

jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018
The new 'normalised' form does char counting on top of the standard
algorithm it calls. This change avoids unnecessary recalculation by
moving the main algorithm to a private function which takes `Option`
wrapped counts, allowing the normalised form to pass in the values it
calculates, thus letting this be done only once.
jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018
as per levenshtein optimisation rapidfuzz#2
jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018
avoid clone() on a usize, it's copy-able so let's just copy it
jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018
more simple, and construction with vec! here should be more optimal over
a push() loop (see the implementation which uses the private extend_with()
function)
jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018
use vec! for `curr_distances` construction, as with jaro optimisation rapidfuzz#3
jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018
same as with levenstein optimisation rapidfuzz#3 - avoid unnecessary repeated char
counting with 'normalised' form.
jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 8, 2018
Taking the j-w optimisations further, this makes use of the prefix
splitting helper within the inner Jaro algorithm.

The function has been modified such that instead of taking a char-count of
the size of the common prefix removed from the pair of strings, it now
optionally takes a pointer to return the count, obtaining it within the
function through use of the helper internally.

Using the prefix splitting helper within the function means that we avoid
doing a `.chars().count()` iteration over the prefix twice, once going
over `a` and once going over `b`. It also then allows the main part of the
algorithm to completely avoid processing the common prefix portion of the
strings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants