difflib.SequenceMatcher not matching long sequences #47235

hagna · 2008-05-27T20:29:56Z

BPO	2986
Nosy	@tim-one, @warsaw, @birkenfeld, @terryjreedy, @jcea, @pitrou, @bitdancer
PRs	bpo-2986: Allow disabling difflib's automatic junk heuristic #17082
Superseder	bpo-10534: difflib.SequenceMatcher: expose junk sets, deprecate undocumented isb... functions.
Files	difflib_test_inq.py: test file for difflib.SequenceMatcher comparing strings with minimal differences issue2986.docs26.1.patch issue2986.fix27.4.patch issue2986.docs31.1.patch issue2986.fix27.5.patch issue2986.fix32.5.patch: Version of issue2986.fix27.5.patch that applies and passes tests in Python 3.2a.

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/terryjreedy'
closed_at = <Date 2010-11-25.20:23:36.839>
created_at = <Date 2008-05-27.20:29:56.025>
labels = ['type-bug', 'library', 'docs']
title = 'difflib.SequenceMatcher not matching long sequences'
updated_at = <Date 2019-11-07.16:24:52.380>
user = 'https://bugs.python.org/hagna'

bugs.python.org fields:

activity = <Date 2019-11-07.16:24:52.380>
actor = 'python-dev'
assignee = 'terry.reedy'
closed = True
closed_date = <Date 2010-11-25.20:23:36.839>
closer = 'terry.reedy'
components = ['Documentation', 'Library (Lib)']
creation = <Date 2008-05-27.20:29:56.025>
creator = 'hagna'
dependencies = []
files = ['17001', '18171', '18719', '19538', '19569', '19675']
hgrepos = []
issue_num = 2986
keywords = ['patch']
message_count = 37.0
messages = ['67428', '84387', '84446', '84449', '93438', '103660', '108636', '108856', '109090', '109442', '109507', '109636', '109639', '109654', '110251', '110261', '111372', '111425', '112116', '112120', '112490', '115335', '115419', '115787', '120713', '120927', '120939', '120992', '121079', '121596', '121662', '121697', '121902', '122335', '122337', '122338', '122401']
nosy_count = 19.0
nosy_names = ['tim.peters', 'barry', 'georg.brandl', 'terry.reedy', 'jcea', 'jimjjewett', 'sjmachin', 'gjb1002', 'ggenellina', 'pitrou', 'rtvd', 'vbr', 'LambertDW', 'hodgestar', 'hagna', 'r.david.murray', 'eli.bendersky', 'janpf', 'mrotondo']
pr_nums = ['17082']
priority = 'high'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = '10534'
type = 'behavior'
url = 'https://bugs.python.org/issue2986'
versions = ['Python 2.6', 'Python 3.1', 'Python 2.7', 'Python 3.2']

hagna · 2008-05-27T20:29:55Z

The following code shows no matches though the strings clearly match.

from difflib import * 

a =
'''3904320338155955662857322172779218727992471109386112515279452352973279311752006856588512503244702012502812653160306927721351031250270279878152125021081471125246894603319162986283456469448293252335442814953964029718671705515246437056879456095915444174665464026255415736754542680178373675412998898571410483714801783736754144828361714801783736754133068408714801783736754140859665714801783736754153851004471480178373675415715864371410690714801783736754147488890714801783736205957668017837367545448801783104170539154677705102536314736754477780178373675415217103227148017837367541737811137714801783736754172791151671480178373675417692995271480178373675417575983571480178373675417398965871480178310417055026467770551235573705687945609591544562532964082675415736300610425832914520311514810301595721999571547897879113780178373618951021983280377781981989237498913678981414213198924949892679989164882577810944751102884217048258978791137801783104170511836542073627327981801279360326159714801783736171798080178310415420736447510213871790638471586131412631592131012571210126718031314200414571314893700123874777987006697747115770067074789312578013869801783104120529166337056879456095918495136604565251349544838956219513495753741344870733943253617458316356794745831634651172458316348316144586052838244151360641656349118903581890331689038658903263218549028909605134957536316060'''
b =
'''4634320338155955662857322172779218727992471109386112515279452352973279311752006856588512503244702012502812653160306927721351031250270279878152125021081471125246894603319162986283456469448293252335442814953964029718671705515246437056879456095915444174665464026255415736754542680178373675412998898571410483714801783736754144828361714801783736754133068408714801783736754140859665714801783736754153851004471480178373675415715864371410690714801783736754147488890714801783736205957668017837367545448801783104170539154677705102536314736754477780178373675413182108117148017837367541737811137714801783736754172791151671480178373675417692995271480178373675417575983571480178373675417398965871480178310417055026467770551235573705687945609591544562532964082675415736300610425832914520311514810301595721999571547897879113780178373618951021983280377781981989237498913678981414213198924949892679989164882577810944751102884217048258978791137801783104170511836542073627327981801279360326159714801783736171798080178310415420736447510213871790638471412131420041457131485122165131466702097131466731723131466741536131466751581131466771649131466761975131467212090131467261974131467231858131467201556131467212538131467221553131467221943131467231748131466711452131467271787131412578013869801783104154307361718482280178373638585436251621338931320893185072980138084820801545115716861861152948618615002682261422349251058108327767521397977810837298017831041205291663370568794560959184951366045652513495448389562195134957537413448707339432536174583163'''
lst = [(a,b)]
for a, b in lst:
    print "---------------------------"
    s = SequenceMatcher(None, a, b)
    print "length of a is %d" % len(a)
    print "length of b is %d" % len(b)
    print s.find_longest_match(0, len(a), 0, len(b))
    print s.ratio()
    for block in s.get_matching_blocks():
        m = a[block[0]:block[0]+block[2]]
        print "a[%d] and b[%d] match for %d elements and it is \"%s\"" %
(block[0], block[1], block[2], m)

birkenfeld · 2009-03-29T12:59:19Z

Tim, I think you've had some enlightening comments about difflib issues
in the past.

mrotondo · 2009-03-30T00:40:14Z

From the source, it seems that there is undocumented behavior to
SequenceMatcher which is causing this error. If b is longer than 200
characters, it will consider any element x in b that takes up more than
1% of it's contents as "popular", and thus junk.

So, in this case, difflib is treating each individual digit as an
element of your sequences, and each one takes up more than 1% of the
complete sequence b. Therefore, each one is "popular", and therefore
ignored.

A snippet which demonstrates this:

from difflib import SequenceMatcher
for i in range(1, 202)[::10]:
  a = "a" * i
  b = "b" + "a" * i
  s = SequenceMatcher(None, a, b)
  print s.find_longest_match(0, len(a), 0, len(b))

Up til i=200, the strings match, but afterwards they do not because "a"
is "popular".

Strangely, if you get rid of the "b" at the beginning of b, they
continue to match at lengths greater than 200. This may be a bug, I'll
keep looking into it but someone who knows more should probably take a
look too.

The comments from difflib.py say some interesting things:
# b2j also does not contain entries for "popular" elements, meaning
# elements that account for more than 1% of the total elements, and
# when the sequence is reasonably large (>= 200 elements); this can
# be viewed as an adaptive notion of semi-junk, and yields an enormous
# speedup when, e.g., comparing program files with hundreds of
# instances of "return NULL;"

This seems to mean that you won't actually get an accurate diff in
certain cases, which seems odd. At the very least, this behavior should
probably be documented. Do people think it should be changed to get rid
of the "popularity" heuristic?

bitdancer · 2009-03-30T01:33:30Z

On Mon, 30 Mar 2009 at 00:40, Mike Rotondo wrote:

This seems to mean that you won't actually get an accurate diff in
certain cases, which seems odd. At the very least, this behavior should
probably be documented. Do people think it should be changed to get rid
of the "popularity" heuristic?

A better way, I think, would be to provide a way to turn
it off (and then document it, of course).

pitrou · 2009-10-02T10:58:47Z

The popularity heuristic could be tuned to depend on the number N of
distinct elements in the sequence, and kick in if an element appears say
more than 1/(N**0.5) of the time.

vbr · 2010-04-19T23:25:05Z

I just stumbled on some seemingly different unexpected behaviour of
difflib.SequenceMatcher, but it turns out, it may have the same cause, i.e. the "popular" heuristics.
I hopefully managed to replicate it on an illustrative sample text - in as included in the attached file. (I also mentioned this issue in hte python-list
http://mail.python.org/pipermail/python-list/2010-April/1241951.html but as there were no replies I eventually found, this might be more appropriate place.)
Both strings differ in a minimal way, each having one extra character
in a "strategic" position, which probably meets some pathological case
for difflib.
Instead of just reporting the insertion and deletion of these single
characters (which works well for most cases - with most other
positions of the differing characters), the output of the
SequenceMatcher decides to delete a large part of the string in
between the differences and to insert the almost same text after that.
The attached code simply prints the results of the comparison with the
respective tags, and substrings. No junk function is used.
I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3.
I didn't find any plausible mentions of such cases in the documentation, but after some searching I found several reports in the bug tracker mentioning the erroneous output of SequenceMatcher on longer repetitive sequences.

besides this
http://bugs.python.org/issue2986
e.g.
http://bugs.python.org/issue1711800
http://bugs.python.org/issue4622
http://bugs.python.org/issue1528074

In my case, disabling the "popular" heuristics as mentioned by John Machin in
http://bugs.python.org/issue1528074#msg29269

seems to have solved the problem; with a modified version of difflib containing:

                if 0:   # disable popular heuristics
                    if n >= 200 and len(indices) * 100 > n:
                        populardict[elt] = 1
                        del indices[:]

the comparison catches the differences in the test strings as expected - i.e. one character addition and deletion only. It is likely, that some other use cases for difflib may rely on the "popular"-heuristics but it also seems useful to have some control over this behaviour, which might not be appropriate in all cases.
(The issue seems to be the same in python 2.5, 2.6 and 3.1.)

regards,
vbr

terryjreedy · 2010-06-25T21:56:19Z

This appears to be one of at least three duplicate issues: bpo-1528074, bpo-2986, and bpo-4622. I am closing two, leaving 2986 open, and merging the nearly disjoint nosy lists. (If no longer interested, you can delete yourself from 2986.) bpo-1711800 appears to be slightly different (if not, it could be closed also.)

Whether or not a new feature is ever added (earliest, now, 3.2), it appears that the docs need improvement to at least explain the current behavior. If someone who understands the issue could open a separate doc issue (for 2.6/7/3.1/2) with a suggested addition, that would be great.

terryjreedy · 2010-06-28T19:35:37Z

The discussion on bpo-152807 references two other closed tracker issues:
bpo-1678339 Test case that currently fails
bpo-1678345 Patch to change behavior - rejected because crippled behavior is supposedly intentional and removing the change would slow things down.

The patch simply removes the internal heuristic. I think a better patch would be to make it optional, with a tunable popularity threshold.

I say 'supposedly intentional' because the code comments only justify the popularity hack for code line comparison and give no indication of awareness that it disables SequenceMatcher for general purpose use, and in particular, for non-toy finite character set comparisons of the type (ascii) used in all the examples.

elibendersky · 2010-07-02T07:16:08Z

The new "junk heuristic" has been added to difflib.py in SVN revision 26661 in 2002 (which is, incidentally, the last revision to modify difflib.py). Its commit log says:

---------------------------------------------
Mostly in SequenceMatcher.{__chain_b, find_longest_match}:
This now does a dynamic analysis of which elements are so frequently
repeated as to constitute noise. The primary benefit is an enormous
speedup in find_longest_match, as the innermost loop can have factors
of 100s less potential matches to worry about, in cases where the
sequences have many duplicate elements. In effect, this zooms in on
sequences of non-ubiquitous elements now.

While I like what I've seen of the effects so far, I still consider
this experimental. Please give it a try!
---------------------------------------------

terryjreedy · 2010-07-06T23:18:21Z

[Also posted to pydev for additional input, with Subject line
bpo-2986: difflib.SequenceMatcher is partly broken
Developed with input from Eli Bendersky, who will write patchfile(s) for whichever change option is chosen.]

Summary: difflib.SeqeunceMatcher was developed, documented, and originally operated as "a flexible class for comparing pairs of sequences of any [hashable] type". An "experimental" heuristic was added in 2.3a1 to speed up its application to sequences of code lines, which are selected from an unbounded set of possibilities. As explained below, this heuristic partly to completely disables SequenceMatcher for realistic-length sequences from a small finite alphabet. The regression is easy to fix. The docs were never changed to reflect the effect of the heuristic, but should be, with whatever additional change is made.

In the commit message for revision 26661, which added the heuristic, Tim Peters wrote "While I like what I've seen of the effects so far, I still consider this experimental. Please give it a try!" Several people who have tried it discovered the problem with small alphabets and posted to the tracker. Issues bpo-1528074, bpo-1678339. bpo-1678345, and bpo-4622 are now-closed duplicates of bpo-2986. The heuristic needs revision.

Open questions (discussed after the examples): what exactly to do, which versions to do it too, and who will do it.

---
Some minimal difference examples:

from difflib import SequenceMatcher as SM

# base example
print(SM(None, 'x' + 'y'*199, 'y'*199).ratio())
# should be and is 0.9975 (rounded)

# make 'y' junk
print(SM(lambda c:c=='y', 'x' + 'y'*199, 'y'*199).ratio())
# should be and is 0.0

# Increment b by 1 char
print(SM(None, 'x' + 'y'*199, 'y'*200).ratio())
# should be .995, but now is 0.0 because y is treated as junk

# Reverse a and b, which increments b
print(SM(None, 'y'*199, 'x' + 'y'*199).ratio())
# should be .9975, as before, but now is 0.0 because y is junked

The reason for the bug is the heuristic: if the second sequence is at least 200 items long then any item occurring more than one percent of the time in the second sequence is treated as junk. This was aimed at recurring code lines like 'else:' and 'return', but can be fatal for small alphabets where common items are necessary content.

A more realistic example than the above is comparing DNA gene sequences. Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate sequence of matches and edits and .ratio works as documented and expected. For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are <1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic and practical use.

With the heuristic, everything is junk and there is only one match, ''=='' augmented by the initial prefix of matching bases. This is followed by one edit: replace the rest of the first sequence with the rest of the second sequence. A much faster way to find the first mismatch would be
i = 0
while first[i] == second[i]:
i+=1
The match ratio, based on the initial matching prefix only, is spuriously low.

---
Questions:

1: what change should be make.

Proposed fix: Disentangle the heuristic from the calculation of the internal b2j dict that maps items to indexes in the second sequence b. Only apply the heuristic (or not) afterward.

Version A: Modify the heuristic to only eliminate common items when there are more than, say, 100 items (when len(b2j)> 100 where b2j is first calculated without popularity deletions).

The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone. On the other hand, realistic sequences of more than 200 code lines should have at least 100 different lines, and so the heuristic should continue to be applied when it (mostly?) 'should' be. This change leaves the API unchanged and does not require a user decision.

Version B: add a parameter to .__init__ to make the heuristic optional. If the default were True ('use it'), then the code would run the same as now (even when bad). With the heuristic turned off, users would be able to get the .ratio they may expect and need. On the other hand, users would have to understand the heuristic to know when and when not to use it.

Version C: A more radical alternative would be to make one or more of the tuning parameters user settable, with one setting turning it off.

What type of issue is this, and what version get changed.

I see the proposal as partial reversion of a change that sometimes causes a regression, in order to fix the regression. Such would usually be called a bugfix. Other tracker reviewers claim this issue is a feature request, not a bugfix. Either way, 3.2 gets the fix. The practical issue is whether at least 2.7(.1) should get the fix, or whether the bug should forever continue in 2.x.

Who will make the change.

Eli will write a patch and I will check it. However, Georg Brandel assigned the issue to Tim Peters, with a request for comment, but Tim never responded. Is there an active committer who will grab the issue and do a commit review when a patch is ready?

vbr · 2010-07-07T23:17:00Z

I guess, I am not supposed to post to python-dev - not being a python developer, hopefully it is appropriate to add a comment here - only based on my current usage of (a modified) difflib.SequenceMatcher.
It seems, the mentions of text comparison in that thread, e.g.
http://mail.python.org/pipermail/python-dev/2010-July/101515.html
etc. rather imply line-by-line comparison, and possibly character comparison of matched lines.
For me the direct character-wise comparison is more useful in most cases.
With the popular heuristics disabled the results look pretty well.
(the script only involves changing the background colour of the compared texts - based on the SequenceMatcher - get_opcodes() )
Just now, I only need to disable the popular check, currently I use a monkey-patched subclass of SequenceMatcher with extended signature and modified __chain_b function.
cf. http://mail.python.org/pipermail/python-list/2010-June/1247907.html

I would vote for extending the SequenceMatcher API to enable adjustments (leaving the default values as the current ones) - enable/disable popular check, set the thresholds for string length and "popular" frequency (and eventually other parameters, which might be added).

Are there some restrictions on API changes in a library due to a moratorium - even if the default behaviour remains unchanged?
Otherwise, what might be the disadvantages of this approach?
If the current behaviour is considered appropriate for the original usecases, other uses would be also made possible/easier - only at the cost of learning the meaning of the added parameters - from the enhanced docs, of course.

vbr

terryjreedy · 2010-07-08T22:47:55Z

Anyone can post on Python-dev, but non-developers should do so judiciously and with respect for the purpose of the list. It is also polite to introduce oneself with the first post. In any case, Tim Peters has approved making some change. The remaining question is exactly what.

There is no problem with extending the API in 3.2. The debate there is over 2.7.

My fourth proposal, detailed on pydev, is to introduce a fourth paramater, 'common', to set the frequency threshold to None or int 1-99.

pitrou · 2010-07-08T22:52:46Z

There is no problem with extending the API in 3.2. The debate there is
over 2.7.

We could extend the API as long as it stays backwards-compatible (that
is, the default value for the new argument produces the same behaviour
as before).

terryjreedy · 2010-07-09T01:12:03Z

My proposal F, to expose the common frequency threshold as a fourth positional parameter with default 1, would do that: repeat current behavior. We should, and Eli and I would, add some of the anomalous cases to the test suite and verily that the default is to reproduce the current anomalies, and that passing None changes the result.

Any opinions, anyone, on 'common', 'thresh', 'threshold', or anything else as the new parameter name?

We will have to explain in the doc patch that the parameter is new in 2.7.1 to fix a partial bug and that giving any explicit value will make code not run with 2.7 (.0).

Exposing the set of common values as an instance attribute, as I proposed on pydev, would be a new feature not needed to fix the bug. So it should be limited to 3.2.

terryjreedy · 2010-07-14T01:45:21Z

[copied from pydev post]

Summary: adding an autojunk heuristic to difflib without also adding a way to turn it off was a bug because it disabled running code.

2.6 and 3.1 each have, most likely, one final version each. Don't fix for these but add something to the docs explaining the problem and future fix.

2.7 will have several more versions over several years and will be used by newcomers who might encounter the problem but not know to diagnose it and patch a private copy of the module. So it should have a fix. Solutions thought of so far.

Modify the heuristic to somewhat fix the problem. Bad (unacceptable) because this would silently change behavior and could break tests.
Add a parameter that defaults to using the heuristic but allows turning it off. Perhaps better, but code that used the new API would crash if run on 2.7.0

Tim Peters

Think the most pressing thing is to give people a way to turn the damn
thing off. An ugly way would be to trigger on an unlikely
input-output behavior of the existing isjunk argument. For example,
if
 isjunk("what's the airspeed velocity of an unladen swallow?")
returned
 "don't use auto junk!"
and 2.7.1 recognized that as meaning "don't use auto junk", code could
be written under 2.7.1 that didn't blow up under 2.7. It could
_behave_ differently, although that's true of any way of disabling the
auto-junk heuristics.

Ugly, but perhaps crazy brilliant. Use of such a hack would obviously be temporary. Perhaps its use could be made to issue a -3 warning if such were enabled.

I would simplify the suggestion to something like
isjunk("disable!heuristic") == True
so one could pass
lambda s:s=="disable!heuristic"
It should be something easy to document and write. This issue is the only place such a string should appear, so it should be safe.

Tim and Antoine: if you two can agree on what to do for 2.7, Eli and I will code it.

This suggestion amounts to a suggestion that the fix for 2.7 be decoupled from a better fix for 3.2. I agree. The latter can be discussed once 2.7 is settled.

pitrou · 2010-07-14T09:06:44Z

Le mercredi 14 juillet 2010 à 01:45 +0000, Terry J. Reedy a écrit :

Add a parameter that defaults to using the heuristic but allows
turning it off. Perhaps better, but code that used the new API would
crash if run on 2.7.0

Yes, but this is an exceptional situation. We normally don't add new
APIs in bugfix versions. We'll have to live with it.

[...]
Ugly, but perhaps crazy brilliant. Use of such a hack would obviously
be temporary. Perhaps its use could be made to issue a -3 warning if
such were enabled.

It's still incredibly ugly. Besides, code written for 2.7.1 might not
"blow up" with 2.7, but it will still have different behaviour.
If you are using the new parameter, it's because you *need* it, hence
different behaviour will be unacceptable; therefore, better to raise an
error as the API change proposal does.

terryjreedy · 2010-07-23T18:31:32Z

For 2.6 and 3.1, this is a documentation only issue.
For 2.7, this is a doc + behavior issue.
For 3.2, this is a doc + behavior + new feature issue.

For 2.6.6 (release candidate due Aug 2, 10 days), I propose to add the following paragraph after the current 'Timing:' paragraph in the SequenceMatcher entry ('Heuristic:' should be bold-faced, like 'Timing:')

Heuristic: To speed matching, items that appear more than 1% of the time in sequences of at least 200 items are treated as junk. This has the unfortunate side-effect of giving bad results for sequences constructed from a small set of items. An option to turn off the heuristic will be added to a future version.

I would have said 'to 2.7.1' but that has not happened yet. I thought about putting the heuristic paragraph first, but I think it fits better after the discussion of quadratic run time. I think it should be a separate paragraph and not tacked on the end of the previous paragraph so people will be more likely to take notice.

I have marked this a release blocker because at least 6 issues have been filed for this bug and so I think it important that the explanation be added to the next released doc. I plan to temporarily reassign this to docs@python in a few days.

elibendersky · 2010-07-24T04:59:16Z

Here's a patch for Doc/library/difflib.rst of the 2.6 branch, following Terry's suggested addition to the docs of the SequenceMatcher class.

Tested 'make html'.

birkenfeld · 2010-07-31T07:06:02Z

Deferring to after 3.2a1.

birkenfeld · 2010-07-31T08:00:47Z

Committed 2.6 patch in r83314.

warsaw · 2010-08-02T16:17:14Z

Georg committed this patch to the 2.6 tree, and besides, this is doesn't seem like a blocking issue, so I'm kicking 2.6 off the list and knocking the priority down.

terryjreedy · 2010-09-01T21:32:58Z

While refactoring the code for 2.7, I discovered that the description of the heuristic for 2.6 and in the code comments is off by 1. "items that appear more than 1% of the time" should actually be "items whose duplicates (after the first) appear more than 1% of the time". The discrepancy arises because in the following code

        for i, elt in enumerate(b):
            if elt in b2j:
                indices = b2j[elt]
                if n >= 200 and len(indices) * 100 > n:
                    populardict[elt] = 1
                    del indices[:]
                else:
                    indices.append(i)
            else:
                b2j[elt] = [i]

len(indices) is retrieved *before* the index i of the current elt is added. Whatever one might think the heuristic 'should' have been (and by the nature of heuristics, there is no right answer), the default behavior must remain as it is, so we adjusted the code and doc to match that.

elibendersky · 2010-09-03T04:45:52Z

Attaching a patch (developed jointly with Terry Reedy) for 2.7 that adds an 'autojunk' parameter to SequenceMatcher's constructor. The parameter is True by default which retains the current behavior in 2.6 and earlier, but can be set by the user to False to disable the popularity heuristic. The patch also fixes some documentation inconsistencies that Terry raised in this message.

Notes:

Tests run successfully. Added new test class in test_difflib for testing with the new autojunk parameter False
Patch generated vs. Hg mirror

terryjreedy · 2010-09-07T16:02:28Z

The patch changes the internal function that constructs the dict mapping b items to indexes to read as follows:
create b2j mapping
if isjunk function, move junk items to junk set
if autojunk, move popular items to popular set

I helped write and test the 2.7 patch and verify that default behavior remains unchanged. I believe it is ready to commit.

3.1 and 3.2 patches will follow.

elibendersky · 2010-11-08T05:10:24Z

Adding a documentation patch for 3.1 which is similar to the 2.6 documentation patch that's been committed by Georg into 2.6

terryjreedy · 2010-11-10T18:20:56Z

Tim told me to continue with this as he has no time.
rev86401 - apply 3.1 doc fix

I cannot apply 2.7 patch. I has different header lines. In particular, TortoiseSVN cannot fetch nonexistent revision "Mon Aug 30 06:37:52 2010 +0300". Please regenerate against current 2.7 with method used for 2.6/3.1.

elibendersky · 2010-11-11T08:24:05Z

Attaching a new patch for 2.7 freshly generated vs. current 2.7 maintenance branch from SVN.

terryjreedy · 2010-11-12T00:22:41Z

bpo-2986.fix27.5.patch applied, with version note added to doc, as
rev86418

Only thing left is patch for 3.2, which Eli and I will produce.

terryjreedy · 2010-11-12T21:10:58Z

r86437 - correct and replicate version-added message

elibendersky · 2010-11-20T05:45:30Z

Terry, when is the deadline for producing the patch for 3.2? Perhaps we should at least submit the 2.7 patch for now so that it goes in for sure?

hodgestar · 2010-11-20T15:18:19Z

I made the minor changes needed to get Eli Bendersky's patch to apply against 3.2. Diff attached.

terryjreedy · 2010-11-20T17:55:13Z

Deadline is probably next Fri. However I will apply this or slight revision thereof in a couple of days to make sure this much is in. I have to fixup some work stuff today.

elibendersky · 2010-11-21T11:04:43Z

Simon's patch fix for 3.2 looks good to me - applies cleanly to py3k and tests pass.

terryjreedy · 2010-11-25T06:27:28Z

Since I am not sure I will be able to do any more before the 3.2b1 feature freeze, I went ahead with the minimal patch after checking the differences from the 2.7 version and redoing the Misc/News entry.
(I suspect putting a new entry immediately after the appropriate heading, instead of between other headings, is probably least likely to fatally conflict with intervening changes.) r86745 Thank you Eli and Simon.

Leaving this open for possible further changes.

hodgestar · 2010-11-25T06:48:02Z

My vote is that this bug be closed and a new feature request be opened. Failing that, it would be good to have a concise description of what else we would like done (and the priority should be downgraded, I guess).

elibendersky · 2010-11-25T06:59:23Z

Terry, I agree with Simon re closing and opening a new feature request. This issue has too much baggage in it, and you we always link to it. A new feature request should be opened strictly for 3.2

If you want I can close this issue and open a new one, but I'm waiting for your approval.

terryjreedy · 2010-11-25T20:23:37Z

Agreed. bpo-10534. This is really a 'follow-on' rather than 'superseder',
but the forward reference should be easy for anyone to find.

hagna mannequin added the extension-modules C modules in the Modules dir label May 27, 2008

birkenfeld assigned tim-one Mar 29, 2009

bitdancer added docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement and removed extension-modules C modules in the Modules dir labels May 28, 2009

terryjreedy added release-blocker type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Jul 23, 2010

birkenfeld added deferred-blocker and removed release-blocker labels Jul 31, 2010

birkenfeld added release-blocker and removed deferred-blocker labels Jul 31, 2010

warsaw removed the release-blocker label Aug 2, 2010

terryjreedy assigned terryjreedy and unassigned tim-one Nov 10, 2010

terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Nov 25, 2010

terryjreedy closed this as completed Nov 25, 2010

terryjreedy added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Nov 25, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

sweeneyde mentioned this issue Aug 28, 2022

difflib.SequenceMatcher.find_longest_match() returns wrong result #96363

Closed

difflib.SequenceMatcher not matching long sequences #47235

difflib.SequenceMatcher not matching long sequences #47235

Comments

hagna mannequin commented May 27, 2008

hagna mannequin commented May 27, 2008

birkenfeld commented Mar 29, 2009

mrotondo mannequin commented Mar 30, 2009

bitdancer commented Mar 30, 2009

pitrou commented Oct 2, 2009

vbr mannequin commented Apr 19, 2010

terryjreedy commented Jun 25, 2010

terryjreedy commented Jun 28, 2010

elibendersky mannequin commented Jul 2, 2010

terryjreedy commented Jul 6, 2010

vbr mannequin commented Jul 7, 2010

terryjreedy commented Jul 8, 2010

pitrou commented Jul 8, 2010

terryjreedy commented Jul 9, 2010

terryjreedy commented Jul 14, 2010

pitrou commented Jul 14, 2010

terryjreedy commented Jul 23, 2010

elibendersky mannequin commented Jul 24, 2010

birkenfeld commented Jul 31, 2010

birkenfeld commented Jul 31, 2010

warsaw commented Aug 2, 2010

terryjreedy commented Sep 1, 2010

elibendersky mannequin commented Sep 3, 2010

terryjreedy commented Sep 7, 2010

elibendersky mannequin commented Nov 8, 2010

terryjreedy commented Nov 10, 2010

elibendersky mannequin commented Nov 11, 2010

terryjreedy commented Nov 12, 2010

terryjreedy commented Nov 12, 2010

elibendersky mannequin commented Nov 20, 2010

hodgestar mannequin commented Nov 20, 2010

terryjreedy commented Nov 20, 2010

elibendersky mannequin commented Nov 21, 2010

terryjreedy commented Nov 25, 2010

hodgestar mannequin commented Nov 25, 2010

elibendersky mannequin commented Nov 25, 2010

terryjreedy commented Nov 25, 2010