difflib.SequenceMatcher.get_matching_blocks omits common strings #79260
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/terryjreedy' closed_at = <Date 2018-10-27.03:23:10.962> created_at = <Date 2018-10-26.19:00:41.504> labels = ['3.7', '3.8', 'type-bug', 'library', 'docs'] title = 'difflib.SequenceMatcher.get_matching_blocks omits common strings' updated_at = <Date 2018-10-27.03:23:10.961> user = 'https://bugs.python.org/SpringemSpringsbee'
activity = <Date 2018-10-27.03:23:10.961> actor = 'terry.reedy' assignee = 'terry.reedy' closed = True closed_date = <Date 2018-10-27.03:23:10.962> closer = 'terry.reedy' components = ['Documentation', 'Library (Lib)'] creation = <Date 2018-10-26.19:00:41.504> creator = 'Springem Springsbee' dependencies =  files =  hgrepos =  issue_num = 35079 keywords = ['patch'] message_count = 9.0 messages = ['328593', '328595', '328602', '328606', '328627', '328628', '328629', '328630', '328631'] nosy_count = 5.0 nosy_names = ['tim.peters', 'terry.reedy', 'miss-islington', 'xtreak', 'Springem Springsbee'] pr_nums = ['10144', '10145', '10146', '10147'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue35079' versions = ['Python 2.7', 'Python 3.6', 'Python 3.7', 'Python 3.8']
The text was updated successfully, but these errors were encountered:
Hello, I'm using difflib's SequenceMatcher to locate common substrings. It seems like the matcher is missing a common substrings. I'm guessing this is a rather low-level issue in difflib. The autojunk parameter has no effect for obvious reasons. Alternate pairwise comparisons between the following 3 strings omit the 2-character match 'AC'
The following Github gist captures the issue, which I'll repeat here for redundancy https://gist.github.com/MatthewRalston/b0ab6ac1dbe322cb12063310ccdbb786
Sorry, I find this too hard to follow. At the end:
the comment claims it's missing 'CA', while the output plainly shows it's _not_ missing 'CA'.
If your complaint is that's missing 'AC', yes, it is. It's not intended to find _overlapping_ matches at all (read the docs). The longest matching blocks in
are in fact TA, CA, and AC, but after the leftmost-longest TA is matched first, AC no longer exists _to_ match in the second string. Only CA does.
We can assume that "substring 'CA'" was meant to be "substring 'AC'", but as explained, missing 'AC' is not a bug. (Tim wrote the module.)
I read the doc, and 'non-overlapping' is implied in the SequenceMatcher entry at the top of the file.
"The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; ... The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence."
However, a user of SequenceMatcher could easily miss that, and its implication, as Springem did. For clarity, I think we should add 'non-overlapping to the first line of the .get_matching_blocks entry, which is in the middle of the page. "Return list of triples describing non-overlapping matching subsequences."
I also think "i+n != i' or j+n != j'" should be changed to "i+n < i' or j+n < j'" as '>' would mean overlapping. So != must mean <.
I will prepare a doc PR later.
I don't object to spelling it out more, and your (Terry's) suggestions are fine. On the other hand, this module has been around for a loooong time now, and this is the first instance I've seen of someone surprised that it doesn't produce overlapping matches - it's obvious from "The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence" that a matching subsequence is wholly eliminated from further consideration.
At some point of ever-more tedious elaboration, the docs risk missing the forest for the trees. I don't think these docs are quite there yet - although the docs for
Which can be horridly inefficient; e.g., find all overlapping matches between
'A' * 100
'A' * 150