Fix comparing strings with multibyte chars #3

WanzenBug · 2016-04-17T20:34:29Z

Using x.len() on a str returns the number of bytes, which is not
always the same as the number of characters. For example, the characters
'ö' or '香' are representing using more than one byte. This leads
to strange behaviour (e.g. levenshtein("a", "ä") would return 0).
Replace calls to str.len() with str.chars().count(), which returns
the correct number of characters in a string.

Using `x.len()` on a str returns the number of bytes, which is not always the same as the number of characters. For example, the characters 'ö' or '香' are representing using more than one byte. This leads to strange behaviour (e.g. `levenshtein("a", "ä")` would return 0). Replace calls to `str.len()` with `str.chars().count()`, which returns the correct number of characters in a string.

dguo · 2016-04-18T13:02:42Z

Thanks for this! I just published it to Cargo as v0.4.1.

WanzenBug · 2016-04-18T13:49:14Z

Thanks for the quick response :)

The new 'normalised' form does char counting on top of the standard algorithm it calls. This change avoids unnecessary recalculation by moving the main algorithm to a private function which takes `Option` wrapped counts, allowing the normalised form to pass in the values it calculates, thus letting this be done only once.

as per levenshtein optimisation rapidfuzz#2

avoid clone() on a usize, it's copy-able so let's just copy it

more simple, and construction with vec! here should be more optimal over a push() loop (see the implementation which uses the private extend_with() function)

use vec! for `curr_distances` construction, as with jaro optimisation rapidfuzz#3

same as with levenstein optimisation rapidfuzz#3 - avoid unnecessary repeated char counting with 'normalised' form.

Taking the j-w optimisations further, this makes use of the prefix splitting helper within the inner Jaro algorithm. The function has been modified such that instead of taking a char-count of the size of the common prefix removed from the pair of strings, it now optionally takes a pointer to return the count, obtaining it within the function through use of the helper internally. Using the prefix splitting helper within the function means that we avoid doing a `.chars().count()` iteration over the prefix twice, once going over `a` and once going over `b`. It also then allows the main part of the algorithm to completely avoid processing the common prefix portion of the strings.

dguo added the bug label Apr 18, 2016

dguo merged commit 14cf91e into rapidfuzz:master Apr 18, 2016

dguo mentioned this pull request Aug 12, 2018

Add normalized Levenshtein and Damerau-Levenstein #20

Merged

jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018

osa optimisation rapidfuzz#3

ce21126

as per levenshtein optimisation rapidfuzz#2

jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018

d-l optimisation rapidfuzz#3

afc7524

avoid clone() on a usize, it's copy-able so let's just copy it

jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018

jaro optimisation rapidfuzz#3

960ff55

more simple, and construction with vec! here should be more optimal over a push() loop (see the implementation which uses the private extend_with() function)

jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018

osa optimisation rapidfuzz#6

56a2286

use vec! for `curr_distances` construction, as with jaro optimisation rapidfuzz#3

jnqnfe added a commit to jnqnfe/strsim-rs that referenced this pull request Nov 4, 2018

d-l optimisation rapidfuzz#5

501a5a4

same as with levenstein optimisation rapidfuzz#3 - avoid unnecessary repeated char counting with 'normalised' form.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix comparing strings with multibyte chars #3

Fix comparing strings with multibyte chars #3

WanzenBug commented Apr 17, 2016

dguo commented Apr 18, 2016

WanzenBug commented Apr 18, 2016

Fix comparing strings with multibyte chars #3

Fix comparing strings with multibyte chars #3

Conversation

WanzenBug commented Apr 17, 2016

dguo commented Apr 18, 2016

WanzenBug commented Apr 18, 2016