Jaro's implementation doesn't match with reference implementations #7

Dynom · 2020-07-03T07:57:20Z

Hi,

Thanks for your work on these implementations!

Recently I've been tinkering with several distancing algorithms but I have inconsistent results when using xrash/smetrics compared to other Jaro implementations.

I've compared the result with:

https://github.com/masatana/go-textdistance (Go)
https://github.com/Simmetrics/simmetrics (Java)
https://pypi.org/project/jaro-winkler/ (Python)

And on at least the following values it differs:

"DIXON" vs. "DICKSONX" (smetrics.Jaro scores lower with 0.683333, reference implementations all score: 0.766667)
"gmilcon" vs. "gmilno" (smetrics.Jaro scores lower with 0.849206, reference implementations all score: 0.896825)

Will you accept a PR? I'm not sure if I find the time to figure out why the implementation is off, but I'd thought to first reach out and get a sign of life :-)

The text was updated successfully, but these errors were encountered:

xrash · 2020-07-03T20:40:25Z

Interesting! I'd gladly accept a PR.

xrash · 2020-07-10T20:19:12Z

@Dynom I've checked it out and it seems my implementation is wrong. Maybe I got the wrong idea when I wrote it? This code is at least 7 years old (probably more) and I'm sure I didn't have any reference implementation back then. I wrote it while reading the algorithm description in English somewhere. If you didn't work on it already, please don't bother doing it, I'll fix it soon. (I'm assuming the common implementations are correct.)

Dynom · 2020-07-15T11:50:56Z

@xrash I haven't worked on it yet no. For now I've just used a different implementation. But comparing those implementations with the reference material from the original publication isn't correct for some combinations either. However those difference are acceptable for my situation, since a score seems better to me than having a 0 score.

I've pushed some tests that might save you some work: Dynom/TySug@50e356e
If you run the tests with the two flags (-with-jaro-reference optionally with -fail-on-reference), they'll include very verbose output against the reference list and a list of homophones.

xrash · 2020-07-18T08:57:42Z

@Dynom I've pushed a fixed version to the develop branch. You can point your go mod @develop and check it out.

I've also found another article by Winkler with a larger table of tests, and apparently closer to our results. The article is inside the docs/ dir in the develop branch, if you wanna check it out.

Later I'll opt-in to go mod (I think I'll start a v2, it seems like the recommended approach).

xrash · 2020-07-24T00:50:00Z

The code is @ master now, If it's alright I'll close this issue.

Dynom · 2020-07-26T20:24:32Z

I'll take it for a spin on Monday!

Dynom · 2020-07-27T11:04:39Z

I'm testing with f06e43cca1ab now and it seems to cover much more cases. The only one I'm finding right now is:

    algorithm_test.go:513: 0.777778 != 0.750000 testing i 1 (a: "disburse", b: "disperse")
    algorithm_test.go:515: input: "disburse"/"disperse"
    algorithm_test.go:516: smetrics.Jaro        0.777778
    algorithm_test.go:517: RosettaJaroV0        0.750000
    algorithm_test.go:518: RosettaJaroV1        0.750000
    algorithm_test.go:519: JaroDistanceMasatana 0.750000

Test: https://github.com/Dynom/TySug/blob/master/finder/algorithm_test.go#L484
Test data: https://github.com/Dynom/TySug/blob/master/finder/algorithm_test.go#L44

However, I'm not sure which implementation is more correct.

Less to the point, but in case you're interested, these are the numbers when comparing them in terms of speed:

BenchmarkJaroImplementations/"gmilcon"_"gmilno"/JaroDistanceMasatana-8         	 5099064	       223 ns/op	     136 B/op	       4 allocs/op
BenchmarkJaroImplementations/"gmilcon"_"gmilno"/RosettaJaro_V0-8               	14578916	        79.1 ns/op	      16 B/op	       2 allocs/op
BenchmarkJaroImplementations/"gmilcon"_"gmilno"/RosettaJaro_V1-8               	17326620	        70.3 ns/op	      16 B/op	       1 allocs/op
BenchmarkJaroImplementations/"gmilcon"_"gmilno"/smetrics.Jaro-8                	10355952	       113 ns/op	      16 B/op	       2 allocs/op

Benchmark: https://github.com/Dynom/TySug/blob/master/finder/algorithm_test.go#L642

xrash · 2020-07-28T05:58:45Z

I have a few points to make regarding your tests:

The "ITMAN/SMITH" and "HARDIN/MARTINEZ" cases are expected to have Jaro == 0 in the article you've quoted, as follows:

However, in another article I've found, Winkler, W. E. (1994), "Advanced Methods for Record Linkage", their expected result is different, as follows:

That second table matches our results. I didn't find an explanation for that difference yet, so for now I'd just keep that in mind. The second table might be the correct one.

In that same article, Winkler mentions something that might explain why the "JON/JAN" case is expected to have Jaro == 0: "The return value of zero is justified because if each of the strings has three or less characters, then they necessarily disagree on at least one." I'm not sure if I understand what he means, and I'm not sure if I'll hardcode this behavior... also I didn't see anyone implementing that.
The "disburse/disperse" case differs because of a specific part of the algorithm, where the number of transpositions is obtained. It is obtained by dividing the number of unaligned matches by 2. Then the following question arises: should the result of this division be rounded or not? The rosetta code for Go doesn't round it; my code does. That is the difference. I've found some codes rounding it, and other not rounding it. This page http://www.alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html says it should be rounded down; Wikipedia doesn't mention rounding at all. I couldn't find an official source for it.

Please tell me if this helps clarifying anything.

Regarding the performance, I'm curious to check it out, will do it when I find the time.

Dynom · 2020-07-28T06:46:21Z

Excellent reply, I enjoyed reading it!

I don't know enough of the algorithm to weigh in on the correctness of the rounding. For now I've changed my Rosetta improved version to include the rounding and this aligns the implementations to have an equal score. I'll be using that for now, since speed is a significant element for my application.

Results so far for the specific disburse/disperse pair:

    algorithm_test.go:513: 0.777778 != 0.750000 testing i 1 (a: "disburse", b: "disperse")
    algorithm_test.go:515: input: "disburse"/"disperse"
    algorithm_test.go:516: smetrics.Jaro        0.777778
    algorithm_test.go:517: RosettaJaroV0        0.750000
    algorithm_test.go:518: RosettaJaroV1        0.777778
    algorithm_test.go:519: JaroDistanceMasatana 0.750000

Thanks so much for your time @xrash, it's greatly appreciated!

adamdecaf · 2020-07-29T23:13:51Z

Is the same string supposed to match at 0.000? I'm seeing that occur with jaro("w", "w") == 0.00 after upgrading.

See: xrash/smetrics#7

xrash · 2020-07-30T01:49:27Z

@adamdecaf It seems to be returning 0 for the specific case when both strings are len == 1. The Rosetta code has the same behavior. I think I already know what's happening... I'll explain it thoroughly and send a fix in a few hours.

xrash · 2020-07-30T06:45:54Z

@adamdecaf Sorry for the delay. Here is my take on it.

Both Wikipedia and Rosetta Code describe the matching range as this:

When both |s1| and |s2| are 1, the matching range is -1, and the algorithm doesn't work as expected. My old code had this edge case covered, that's why it didn't fail. Try the new version @ master. I've also added more test cases. Please, tell me if it worked for you. Thank you.

adamdecaf · 2020-07-30T15:55:17Z

Awesome and thanks @xrash!

I updated from master and removed my patch. Watching CI over at moov-io/watchman#282

xrash · 2020-08-06T11:35:57Z

Hey guys, If it's alright I'll close this issue.

adamdecaf · 2020-08-06T19:41:40Z

Sounds good! Thanks for the fix.

adamdecaf added a commit to adamdecaf/watchman that referenced this issue Jul 29, 2020

search: tweak match percentages after smetrics upgrade

4546e14

See: xrash/smetrics#7

adamdecaf mentioned this issue Jul 29, 2020

search: upgrade smetrics with fixed Jaro implementation moov-io/watchman#282

Merged

Dynom closed this as completed Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaro's implementation doesn't match with reference implementations #7

Jaro's implementation doesn't match with reference implementations #7

Dynom commented Jul 3, 2020

xrash commented Jul 3, 2020

xrash commented Jul 10, 2020 •

edited

Dynom commented Jul 15, 2020

xrash commented Jul 18, 2020

xrash commented Jul 24, 2020

Dynom commented Jul 26, 2020

Dynom commented Jul 27, 2020

xrash commented Jul 28, 2020

Dynom commented Jul 28, 2020

adamdecaf commented Jul 29, 2020

xrash commented Jul 30, 2020

xrash commented Jul 30, 2020 •

edited

adamdecaf commented Jul 30, 2020

xrash commented Aug 6, 2020

adamdecaf commented Aug 6, 2020

Jaro's implementation doesn't match with reference implementations #7

Jaro's implementation doesn't match with reference implementations #7

Comments

Dynom commented Jul 3, 2020

xrash commented Jul 3, 2020

xrash commented Jul 10, 2020 • edited

Dynom commented Jul 15, 2020

xrash commented Jul 18, 2020

xrash commented Jul 24, 2020

Dynom commented Jul 26, 2020

Dynom commented Jul 27, 2020

xrash commented Jul 28, 2020

Dynom commented Jul 28, 2020

adamdecaf commented Jul 29, 2020

xrash commented Jul 30, 2020

xrash commented Jul 30, 2020 • edited

adamdecaf commented Jul 30, 2020

xrash commented Aug 6, 2020

adamdecaf commented Aug 6, 2020

xrash commented Jul 10, 2020 •

edited

xrash commented Jul 30, 2020 •

edited