-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match short strings more strictly when missing context #1
Conversation
In a scenario where we have a prefix and a suffix but fail to find them, we should be skeptical that the text is actually going to be in the document. In such a scenario, short selections are problematic, because they are likely to match against the document purely by chance. This commit: - Adds support for finding the suffix and adjusting the search location accordingly. - Drops the diff-match-patch Match_Threshold (i.e. the laxness) to 0.25 if the exact string is less than 32 characters long and prefix or suffix have failed to match. This absolutely doesn't cover all cases -- very short strings ("a", "the") will continue to match with high probability even if the prefix and suffix fail -- but it does fix one common problem case.
Won't this also introduce new cases where a word that is very unambiguously right, but not in the right position, will no longer match? |
I guess that depends on your interpretation. I'm struggling to think of a probable scenario in which both prefix and suffix fail to match but it's somehow possible to determine what's the unambiguous "right match" would be. In the event that there's no prefix or suffix provided the behaviour is unchanged. |
This seems sensible to me as a heuristic. If there are cases where we think it will cause new problems, we should make several regression tests against a text corpus that includes selections and their {pre|suf}fixes of various lengths with permutations of location and content shift; this will reveal how severe any of these changes would be, and whether those are acceptable in terms of user expectation. Robust anchoring is going to be an art, even if it's replicable. Different languages are likely to need different treatments, for example. But we won't discover those patterns if we don't tweak the sliders and record the results via a progressively grown test suite. |
I can imagine something as simple as two words that got transposed when scribed, and then fixed. Neither the prefix nor suffix would be a perfect match, and possibly not a match at all, but to a human it'd be obvious. I do really like that you've incorporated the suffix again, because I had left it out. I suggest that we take a first, smaller incremental step. Change |
👍 @shepazu. For instance, I'm not super happy merging code where the comment rationale is based around English language average word length. This library has done a great job of extracting what we were already doing in Hypothesis into a library that's got a simpler API and doesn't rely on a fork of DiffMatchPatch, but we really need lower level tools and to separate the distance from the edit distance. |
I completely agree, but the distance metrics used by DMP are already based on that assumption. (A single character difference makes a much bigger difference in an ideographic language, but AFAIK DMP makes no affordances for that). |
That assumption might better be described as "latin-1 strings" than "English language", but I get your point: the edit distance is in terms of a character code units and not code points. Sadly, JavaScript. To the topic at hand, though, let's try this piecemeal. I am 100% in favor of the use of suffix. I had, at one point, required a smaller match distance between chunks than I did for finding the prefix. I think that makes sense. If the context is found, either the prefix or suffix, one should expect every chunk to be found close to the last one. Here's my proposed set of changes we should consider, in turn:
I am very bullish on all but the last one. The last feels, to me, like it's reaching for an easy way around the fact that DMP combines distance and edit distance into one score. Let's start by 1) using more context, 2) using the context more strictly in general and 3) lowering the tolerance when we don't need to tolerate such large distances. If Hypothesis needs to tweak this further, based on the string length, I would rather add something as a key to the options hash than bake that in to the implementation. |
for some additional context, the first 4 of 6 annotations on this page are incorrectly anchored: one of them is another instance of the "linus" annotation struggling to anchor. |
I am very hesitant to make a change for short strings because short does not imply less distinct (a large word may appear many times, or with many small variations, while some small words may appear only once). The particular issue from hypothesis/h#2634 that spawned this thread, which @jeremydean has just copied here, is also not clearly a symptom of the existing algorithm to do the right thing because, as noted on that issue, the algorithm is being applied to the wrong text. I've laid out three things that I think are likely to be objective improvements to the algorithm as designed, and would like to see these implemented. Beyond that, I recommend that Hypothesis take steps, outside the scope of this library, to avoid searching for selections where they should not be found, e.g. trying to match selections from one chapter of a paper in a different chapter. I don't have strong recommendations about how to do that, but given that I know @nickstenning has discussed doing all the metadata extraction used for document equivalence server-side perhaps some TF/IDF or such could be used as a fingerprint to help decide whether URLs are, in fact, similar enough to be considered equivalent. |
I should have also added that DMP already considers the length of the pattern when calculating the score. That is part of why I hesitate to handle small strings differently. All strings are already handled with consideration paid to their length. |
@tilgovi i am working with a partner who's found what may be a bug, i can report it but am not finding an Issues tab here. |
@judell fixed |
In a scenario where we have a prefix and a suffix but fail to find them, we should be skeptical that the text is actually going to be in the document. In such a scenario, short selections are problematic, because they are likely to match against the document purely by chance.
This commit:
This absolutely doesn't cover all cases -- very short strings ("a", "the") will continue to match with high probability even if the prefix and suffix fail -- but it does fix one common problem case.