-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
difflib.SequenceMatcher.get_matching_blocks omits common strings #79260
Comments
Hello, I'm using difflib's SequenceMatcher to locate common substrings. It seems like the matcher is missing a common substrings. I'm guessing this is a rather low-level issue in difflib. The autojunk parameter has no effect for obvious reasons. Alternate pairwise comparisons between the following 3 strings omit the 2-character match 'AC' GATTACA The following Github gist captures the issue, which I'll repeat here for redundancy https://gist.github.com/MatthewRalston/b0ab6ac1dbe322cb12063310ccdbb786
|
Sorry, I find this too hard to follow. At the end:
the comment claims it's missing 'CA', while the output plainly shows it's _not_ missing 'CA'. If your complaint is that's missing 'AC', yes, it is. It's not intended to find _overlapping_ matches at all (read the docs). The longest matching blocks in TAGACCA are in fact TA, CA, and AC, but after the leftmost-longest TA is matched first, AC no longer exists _to_ match in the second string. Only CA does. |
We can assume that "substring 'CA'" was meant to be "substring 'AC'", but as explained, missing 'AC' is not a bug. (Tim wrote the module.) I read the doc, and 'non-overlapping' is implied in the SequenceMatcher entry at the top of the file. "The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; ... The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence." However, a user of SequenceMatcher could easily miss that, and its implication, as Springem did. For clarity, I think we should add 'non-overlapping to the first line of the .get_matching_blocks entry, which is in the middle of the page. "Return list of triples describing non-overlapping matching subsequences." I also think "i+n != i' or j+n != j'" should be changed to "i+n < i' or j+n < j'" as '>' would mean overlapping. So != must mean <. I will prepare a doc PR later. |
I don't object to spelling it out more, and your (Terry's) suggestions are fine. On the other hand, this module has been around for a loooong time now, and this is the first instance I've seen of someone surprised that it doesn't produce overlapping matches - it's obvious from "The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence" that a matching subsequence is wholly eliminated from further consideration. At some point of ever-more tedious elaboration, the docs risk missing the forest for the trees. I don't think these docs are quite there yet - although the docs for Which can be horridly inefficient; e.g., find all overlapping matches between 'A' * 100 and 'A' * 150 |
Tim, I share your concern about bloating docs, but think this one word worthwhile. I suspect that people are conditioned to accept 'non-overlapping' because that is ofter (usually?) the default for linear searches and regex matching. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: