Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match short strings more strictly when missing context #1

Closed

Conversation

nickstenning
Copy link
Contributor

In a scenario where we have a prefix and a suffix but fail to find them, we should be skeptical that the text is actually going to be in the document. In such a scenario, short selections are problematic, because they are likely to match against the document purely by chance.

This commit:

  • Adds support for finding the suffix and adjusting the search location accordingly.
  • Drops the diff-match-patch Match_Threshold (i.e. the laxness) to 0.25 if the exact string is less than 32 characters long and prefix or suffix have failed to match.

This absolutely doesn't cover all cases -- very short strings ("a", "the") will continue to match with high probability even if the prefix and suffix fail -- but it does fix one common problem case.

In a scenario where we have a prefix and a suffix but fail to find them,
we should be skeptical that the text is actually going to be in the
document. In such a scenario, short selections are problematic, because
they are likely to match against the document purely by chance.

This commit:

- Adds support for finding the suffix and adjusting the search location
  accordingly.
- Drops the diff-match-patch Match_Threshold (i.e. the laxness) to 0.25
  if the exact string is less than 32 characters long and prefix or
  suffix have failed to match.

This absolutely doesn't cover all cases -- very short strings ("a",
"the") will continue to match with high probability even if the prefix
and suffix fail -- but it does fix one common problem case.
@tilgovi
Copy link
Owner

tilgovi commented Oct 20, 2015

Won't this also introduce new cases where a word that is very unambiguously right, but not in the right position, will no longer match?

@nickstenning
Copy link
Contributor Author

I guess that depends on your interpretation. I'm struggling to think of a probable scenario in which both prefix and suffix fail to match but it's somehow possible to determine what's the unambiguous "right match" would be.

In the event that there's no prefix or suffix provided the behaviour is unchanged.

@shepazu
Copy link

shepazu commented Oct 20, 2015

This seems sensible to me as a heuristic. If there are cases where we think it will cause new problems, we should make several regression tests against a text corpus that includes selections and their {pre|suf}fixes of various lengths with permutations of location and content shift; this will reveal how severe any of these changes would be, and whether those are acceptable in terms of user expectation.

Robust anchoring is going to be an art, even if it's replicable. Different languages are likely to need different treatments, for example. But we won't discover those patterns if we don't tweak the sliders and record the results via a progressively grown test suite.

@tilgovi
Copy link
Owner

tilgovi commented Oct 20, 2015

I can imagine something as simple as two words that got transposed when scribed, and then fixed. Neither the prefix nor suffix would be a perfect match, and possibly not a match at all, but to a human it'd be obvious.

I do really like that you've incorporated the suffix again, because I had left it out.

I suggest that we take a first, smaller incremental step.

Change contextMissing to contextFound and if it's true after looking for the prefix and suffix then reduce the match distance because we have some better expectation about where the match should be.

@tilgovi
Copy link
Owner

tilgovi commented Oct 20, 2015

👍 @shepazu. For instance, I'm not super happy merging code where the comment rationale is based around English language average word length.

This library has done a great job of extracting what we were already doing in Hypothesis into a library that's got a simpler API and doesn't rely on a fork of DiffMatchPatch, but we really need lower level tools and to separate the distance from the edit distance.

@nickstenning
Copy link
Contributor Author

I'm not super happy merging code where the comment rationale is based around English language average word length.

I completely agree, but the distance metrics used by DMP are already based on that assumption. (A single character difference makes a much bigger difference in an ideographic language, but AFAIK DMP makes no affordances for that).

@tilgovi
Copy link
Owner

tilgovi commented Oct 20, 2015

That assumption might better be described as "latin-1 strings" than "English language", but I get your point: the edit distance is in terms of a character code units and not code points. Sadly, JavaScript.

To the topic at hand, though, let's try this piecemeal. I am 100% in favor of the use of suffix. I had, at one point, required a smaller match distance between chunks than I did for finding the prefix. I think that makes sense.

If the context is found, either the prefix or suffix, one should expect every chunk to be found close to the last one.

Here's my proposed set of changes we should consider, in turn:

  • Use the suffix
  • Narrow the tolerance for chunk succession
  • Consider narrowing the tolerance if we have a position hint
  • Consider narrowing the tolerance for short strings

I am very bullish on all but the last one. The last feels, to me, like it's reaching for an easy way around the fact that DMP combines distance and edit distance into one score.

Let's start by 1) using more context, 2) using the context more strictly in general and 3) lowering the tolerance when we don't need to tolerate such large distances.

If Hypothesis needs to tweak this further, based on the string length, I would rather add something as a key to the options hash than bake that in to the implementation.

@jeremydean
Copy link

for some additional context, the first 4 of 6 annotations on this page are incorrectly anchored:

http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:6/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1

one of them is another instance of the "linus" annotation struggling to anchor.

@tilgovi
Copy link
Owner

tilgovi commented Oct 20, 2015

I am very hesitant to make a change for short strings because short does not imply less distinct (a large word may appear many times, or with many small variations, while some small words may appear only once).

The particular issue from hypothesis/h#2634 that spawned this thread, which @jeremydean has just copied here, is also not clearly a symptom of the existing algorithm to do the right thing because, as noted on that issue, the algorithm is being applied to the wrong text.

I've laid out three things that I think are likely to be objective improvements to the algorithm as designed, and would like to see these implemented.

Beyond that, I recommend that Hypothesis take steps, outside the scope of this library, to avoid searching for selections where they should not be found, e.g. trying to match selections from one chapter of a paper in a different chapter.

I don't have strong recommendations about how to do that, but given that I know @nickstenning has discussed doing all the metadata extraction used for document equivalence server-side perhaps some TF/IDF or such could be used as a fingerprint to help decide whether URLs are, in fact, similar enough to be considered equivalent.

@tilgovi
Copy link
Owner

tilgovi commented Oct 20, 2015

I should have also added that DMP already considers the length of the pattern when calculating the score. That is part of why I hesitate to handle small strings differently. All strings are already handled with consideration paid to their length.

@judell
Copy link

judell commented Oct 26, 2015

@tilgovi i am working with a partner who's found what may be a bug, i can report it but am not finding an Issues tab here.

@tilgovi
Copy link
Owner

tilgovi commented Oct 26, 2015

@judell fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants