New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jaro's implementation doesn't match with reference implementations #7
Comments
Interesting! I'd gladly accept a PR. |
@Dynom I've checked it out and it seems my implementation is wrong. Maybe I got the wrong idea when I wrote it? This code is at least 7 years old (probably more) and I'm sure I didn't have any reference implementation back then. I wrote it while reading the algorithm description in English somewhere. If you didn't work on it already, please don't bother doing it, I'll fix it soon. (I'm assuming the common implementations are correct.) |
@xrash I haven't worked on it yet no. For now I've just used a different implementation. But comparing those implementations with the reference material from the original publication isn't correct for some combinations either. However those difference are acceptable for my situation, since a score seems better to me than having a 0 score. I've pushed some tests that might save you some work: Dynom/TySug@50e356e |
@Dynom I've pushed a fixed version to the I've also found another article by Winkler with a larger table of tests, and apparently closer to our results. The article is inside the Later I'll opt-in to go mod (I think I'll start a v2, it seems like the recommended approach). |
The code is @ master now, If it's alright I'll close this issue. |
I'll take it for a spin on Monday! |
I'm testing with f06e43cca1ab now and it seems to cover much more cases. The only one I'm finding right now is:
Test: https://github.com/Dynom/TySug/blob/master/finder/algorithm_test.go#L484 However, I'm not sure which implementation is more correct. Less to the point, but in case you're interested, these are the numbers when comparing them in terms of speed:
Benchmark: https://github.com/Dynom/TySug/blob/master/finder/algorithm_test.go#L642 |
I have a few points to make regarding your tests:
However, in another article I've found, Winkler, W. E. (1994), "Advanced Methods for Record Linkage", their expected result is different, as follows: That second table matches our results. I didn't find an explanation for that difference yet, so for now I'd just keep that in mind. The second table might be the correct one.
Please tell me if this helps clarifying anything. Regarding the performance, I'm curious to check it out, will do it when I find the time. |
Excellent reply, I enjoyed reading it! I don't know enough of the algorithm to weigh in on the correctness of the rounding. For now I've changed my Rosetta improved version to include the rounding and this aligns the implementations to have an equal score. I'll be using that for now, since speed is a significant element for my application. Results so far for the specific
Thanks so much for your time @xrash, it's greatly appreciated! |
Is the same string supposed to match at |
@adamdecaf It seems to be returning 0 for the specific case when both strings are len == 1. The Rosetta code has the same behavior. I think I already know what's happening... I'll explain it thoroughly and send a fix in a few hours. |
@adamdecaf Sorry for the delay. Here is my take on it. Both Wikipedia and Rosetta Code describe the matching range as this: When both |s1| and |s2| are 1, the matching range is -1, and the algorithm doesn't work as expected. My old code had this edge case covered, that's why it didn't fail. Try the new version @ master. I've also added more test cases. Please, tell me if it worked for you. Thank you. |
Awesome and thanks @xrash! I updated from master and removed my patch. Watching CI over at moov-io/watchman#282 |
Hey guys, If it's alright I'll close this issue. |
Sounds good! Thanks for the fix. |
Hi,
Thanks for your work on these implementations!
Recently I've been tinkering with several distancing algorithms but I have inconsistent results when using xrash/smetrics compared to other Jaro implementations.
I've compared the result with:
And on at least the following values it differs:
0.683333
, reference implementations all score:0.766667
)0.849206
, reference implementations all score:0.896825
)Will you accept a PR? I'm not sure if I find the time to figure out why the implementation is off, but I'd thought to first reach out and get a sign of life :-)
The text was updated successfully, but these errors were encountered: