Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaro's implementation doesn't match with reference implementations #7

Closed
Dynom opened this issue Jul 3, 2020 · 15 comments
Closed

Jaro's implementation doesn't match with reference implementations #7

Dynom opened this issue Jul 3, 2020 · 15 comments

Comments

@Dynom
Copy link

Dynom commented Jul 3, 2020

Hi,

Thanks for your work on these implementations!

Recently I've been tinkering with several distancing algorithms but I have inconsistent results when using xrash/smetrics compared to other Jaro implementations.

I've compared the result with:

And on at least the following values it differs:

  • "DIXON" vs. "DICKSONX" (smetrics.Jaro scores lower with 0.683333, reference implementations all score: 0.766667)
  • "gmilcon" vs. "gmilno" (smetrics.Jaro scores lower with 0.849206, reference implementations all score: 0.896825)

Will you accept a PR? I'm not sure if I find the time to figure out why the implementation is off, but I'd thought to first reach out and get a sign of life :-)

@xrash
Copy link
Owner

xrash commented Jul 3, 2020

Interesting! I'd gladly accept a PR.

@xrash
Copy link
Owner

xrash commented Jul 10, 2020

@Dynom I've checked it out and it seems my implementation is wrong. Maybe I got the wrong idea when I wrote it? This code is at least 7 years old (probably more) and I'm sure I didn't have any reference implementation back then. I wrote it while reading the algorithm description in English somewhere. If you didn't work on it already, please don't bother doing it, I'll fix it soon. (I'm assuming the common implementations are correct.)

@Dynom
Copy link
Author

Dynom commented Jul 15, 2020

@xrash I haven't worked on it yet no. For now I've just used a different implementation. But comparing those implementations with the reference material from the original publication isn't correct for some combinations either. However those difference are acceptable for my situation, since a score seems better to me than having a 0 score.

I've pushed some tests that might save you some work: Dynom/TySug@50e356e
If you run the tests with the two flags (-with-jaro-reference optionally with -fail-on-reference), they'll include very verbose output against the reference list and a list of homophones.

@xrash
Copy link
Owner

xrash commented Jul 18, 2020

@Dynom I've pushed a fixed version to the develop branch. You can point your go mod @develop and check it out.

I've also found another article by Winkler with a larger table of tests, and apparently closer to our results. The article is inside the docs/ dir in the develop branch, if you wanna check it out.

Later I'll opt-in to go mod (I think I'll start a v2, it seems like the recommended approach).

@xrash
Copy link
Owner

xrash commented Jul 24, 2020

The code is @ master now, If it's alright I'll close this issue.

@Dynom
Copy link
Author

Dynom commented Jul 26, 2020

I'll take it for a spin on Monday!

@Dynom
Copy link
Author

Dynom commented Jul 27, 2020

I'm testing with f06e43cca1ab now and it seems to cover much more cases. The only one I'm finding right now is:

    algorithm_test.go:513: 0.777778 != 0.750000 testing i 1 (a: "disburse", b: "disperse")
    algorithm_test.go:515: input: "disburse"/"disperse"
    algorithm_test.go:516: smetrics.Jaro        0.777778
    algorithm_test.go:517: RosettaJaroV0        0.750000
    algorithm_test.go:518: RosettaJaroV1        0.750000
    algorithm_test.go:519: JaroDistanceMasatana 0.750000

Test: https://github.com/Dynom/TySug/blob/master/finder/algorithm_test.go#L484
Test data: https://github.com/Dynom/TySug/blob/master/finder/algorithm_test.go#L44

However, I'm not sure which implementation is more correct.

Less to the point, but in case you're interested, these are the numbers when comparing them in terms of speed:

BenchmarkJaroImplementations/"gmilcon"_"gmilno"/JaroDistanceMasatana-8         	 5099064	       223 ns/op	     136 B/op	       4 allocs/op
BenchmarkJaroImplementations/"gmilcon"_"gmilno"/RosettaJaro_V0-8               	14578916	        79.1 ns/op	      16 B/op	       2 allocs/op
BenchmarkJaroImplementations/"gmilcon"_"gmilno"/RosettaJaro_V1-8               	17326620	        70.3 ns/op	      16 B/op	       1 allocs/op
BenchmarkJaroImplementations/"gmilcon"_"gmilno"/smetrics.Jaro-8                	10355952	       113 ns/op	      16 B/op	       2 allocs/op

Benchmark: https://github.com/Dynom/TySug/blob/master/finder/algorithm_test.go#L642

@xrash
Copy link
Owner

xrash commented Jul 28, 2020

I have a few points to make regarding your tests:

  1. The "ITMAN/SMITH" and "HARDIN/MARTINEZ" cases are expected to have Jaro == 0 in the article you've quoted, as follows:

x

However, in another article I've found, Winkler, W. E. (1994), "Advanced Methods for Record Linkage", their expected result is different, as follows:

y

That second table matches our results. I didn't find an explanation for that difference yet, so for now I'd just keep that in mind. The second table might be the correct one.

  1. In that same article, Winkler mentions something that might explain why the "JON/JAN" case is expected to have Jaro == 0: "The return value of zero is justified because if each of the strings has three or less characters, then they necessarily disagree on at least one." I'm not sure if I understand what he means, and I'm not sure if I'll hardcode this behavior... also I didn't see anyone implementing that.

  2. The "disburse/disperse" case differs because of a specific part of the algorithm, where the number of transpositions is obtained. It is obtained by dividing the number of unaligned matches by 2. Then the following question arises: should the result of this division be rounded or not? The rosetta code for Go doesn't round it; my code does. That is the difference. I've found some codes rounding it, and other not rounding it. This page http://www.alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html says it should be rounded down; Wikipedia doesn't mention rounding at all. I couldn't find an official source for it.

Please tell me if this helps clarifying anything.

Regarding the performance, I'm curious to check it out, will do it when I find the time.

@Dynom
Copy link
Author

Dynom commented Jul 28, 2020

Excellent reply, I enjoyed reading it!

I don't know enough of the algorithm to weigh in on the correctness of the rounding. For now I've changed my Rosetta improved version to include the rounding and this aligns the implementations to have an equal score. I'll be using that for now, since speed is a significant element for my application.

Results so far for the specific disburse/disperse pair:

    algorithm_test.go:513: 0.777778 != 0.750000 testing i 1 (a: "disburse", b: "disperse")
    algorithm_test.go:515: input: "disburse"/"disperse"
    algorithm_test.go:516: smetrics.Jaro        0.777778
    algorithm_test.go:517: RosettaJaroV0        0.750000
    algorithm_test.go:518: RosettaJaroV1        0.777778
    algorithm_test.go:519: JaroDistanceMasatana 0.750000

Thanks so much for your time @xrash, it's greatly appreciated!

@adamdecaf
Copy link
Contributor

Is the same string supposed to match at 0.000? I'm seeing that occur with jaro("w", "w") == 0.00 after upgrading.

@xrash
Copy link
Owner

xrash commented Jul 30, 2020

@adamdecaf It seems to be returning 0 for the specific case when both strings are len == 1. The Rosetta code has the same behavior. I think I already know what's happening... I'll explain it thoroughly and send a fix in a few hours.

@xrash
Copy link
Owner

xrash commented Jul 30, 2020

@adamdecaf Sorry for the delay. Here is my take on it.

Both Wikipedia and Rosetta Code describe the matching range as this:

x

When both |s1| and |s2| are 1, the matching range is -1, and the algorithm doesn't work as expected. My old code had this edge case covered, that's why it didn't fail. Try the new version @ master. I've also added more test cases. Please, tell me if it worked for you. Thank you.

@adamdecaf
Copy link
Contributor

Awesome and thanks @xrash!

I updated from master and removed my patch. Watching CI over at moov-io/watchman#282

@xrash
Copy link
Owner

xrash commented Aug 6, 2020

Hey guys, If it's alright I'll close this issue.

@adamdecaf
Copy link
Contributor

Sounds good! Thanks for the fix.

@Dynom Dynom closed this as completed Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants