-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
difflib.SequenceMatcher not matching long sequences #47235
Comments
The following code shows no matches though the strings clearly match. from difflib import *
a =
'''3904320338155955662857322172779218727992471109386112515279452352973279311752006856588512503244702012502812653160306927721351031250270279878152125021081471125246894603319162986283456469448293252335442814953964029718671705515246437056879456095915444174665464026255415736754542680178373675412998898571410483714801783736754144828361714801783736754133068408714801783736754140859665714801783736754153851004471480178373675415715864371410690714801783736754147488890714801783736205957668017837367545448801783104170539154677705102536314736754477780178373675415217103227148017837367541737811137714801783736754172791151671480178373675417692995271480178373675417575983571480178373675417398965871480178310417055026467770551235573705687945609591544562532964082675415736300610425832914520311514810301595721999571547897879113780178373618951021983280377781981989237498913678981414213198924949892679989164882577810944751102884217048258978791137801783104170511836542073627327981801279360326159714801783736171798080178310415420736447510213871790638471586131412631592131012571210126718031314200414571314893700123874777987006697747115770067074789312578013869801783104120529166337056879456095918495136604565251349544838956219513495753741344870733943253617458316356794745831634651172458316348316144586052838244151360641656349118903581890331689038658903263218549028909605134957536316060'''
b =
'''4634320338155955662857322172779218727992471109386112515279452352973279311752006856588512503244702012502812653160306927721351031250270279878152125021081471125246894603319162986283456469448293252335442814953964029718671705515246437056879456095915444174665464026255415736754542680178373675412998898571410483714801783736754144828361714801783736754133068408714801783736754140859665714801783736754153851004471480178373675415715864371410690714801783736754147488890714801783736205957668017837367545448801783104170539154677705102536314736754477780178373675413182108117148017837367541737811137714801783736754172791151671480178373675417692995271480178373675417575983571480178373675417398965871480178310417055026467770551235573705687945609591544562532964082675415736300610425832914520311514810301595721999571547897879113780178373618951021983280377781981989237498913678981414213198924949892679989164882577810944751102884217048258978791137801783104170511836542073627327981801279360326159714801783736171798080178310415420736447510213871790638471412131420041457131485122165131466702097131466731723131466741536131466751581131466771649131466761975131467212090131467261974131467231858131467201556131467212538131467221553131467221943131467231748131466711452131467271787131412578013869801783104154307361718482280178373638585436251621338931320893185072980138084820801545115716861861152948618615002682261422349251058108327767521397977810837298017831041205291663370568794560959184951366045652513495448389562195134957537413448707339432536174583163'''
lst = [(a,b)]
for a, b in lst:
print "---------------------------"
s = SequenceMatcher(None, a, b)
print "length of a is %d" % len(a)
print "length of b is %d" % len(b)
print s.find_longest_match(0, len(a), 0, len(b))
print s.ratio()
for block in s.get_matching_blocks():
m = a[block[0]:block[0]+block[2]]
print "a[%d] and b[%d] match for %d elements and it is \"%s\"" %
(block[0], block[1], block[2], m) |
Tim, I think you've had some enlightening comments about difflib issues |
From the source, it seems that there is undocumented behavior to So, in this case, difflib is treating each individual digit as an A snippet which demonstrates this: from difflib import SequenceMatcher
for i in range(1, 202)[::10]:
a = "a" * i
b = "b" + "a" * i
s = SequenceMatcher(None, a, b)
print s.find_longest_match(0, len(a), 0, len(b)) Up til i=200, the strings match, but afterwards they do not because "a" Strangely, if you get rid of the "b" at the beginning of b, they The comments from difflib.py say some interesting things: This seems to mean that you won't actually get an accurate diff in |
On Mon, 30 Mar 2009 at 00:40, Mike Rotondo wrote:
A better way, I think, would be to provide a way to turn |
The popularity heuristic could be tuned to depend on the number N of |
I just stumbled on some seemingly different unexpected behaviour of besides this In my case, disabling the "popular" heuristics as mentioned by John Machin in seems to have solved the problem; with a modified version of difflib containing: if 0: # disable popular heuristics
if n >= 200 and len(indices) * 100 > n:
populardict[elt] = 1
del indices[:] the comparison catches the differences in the test strings as expected - i.e. one character addition and deletion only. It is likely, that some other use cases for difflib may rely on the "popular"-heuristics but it also seems useful to have some control over this behaviour, which might not be appropriate in all cases. regards, |
This appears to be one of at least three duplicate issues: bpo-1528074, bpo-2986, and bpo-4622. I am closing two, leaving 2986 open, and merging the nearly disjoint nosy lists. (If no longer interested, you can delete yourself from 2986.) bpo-1711800 appears to be slightly different (if not, it could be closed also.) Whether or not a new feature is ever added (earliest, now, 3.2), it appears that the docs need improvement to at least explain the current behavior. If someone who understands the issue could open a separate doc issue (for 2.6/7/3.1/2) with a suggested addition, that would be great. |
The discussion on bpo-152807 references two other closed tracker issues: The patch simply removes the internal heuristic. I think a better patch would be to make it optional, with a tunable popularity threshold. I say 'supposedly intentional' because the code comments only justify the popularity hack for code line comparison and give no indication of awareness that it disables SequenceMatcher for general purpose use, and in particular, for non-toy finite character set comparisons of the type (ascii) used in all the examples. |
The new "junk heuristic" has been added to difflib.py in SVN revision 26661 in 2002 (which is, incidentally, the last revision to modify difflib.py). Its commit log says: --------------------------------------------- While I like what I've seen of the effects so far, I still consider |
[Also posted to pydev for additional input, with Subject line Summary: difflib.SeqeunceMatcher was developed, documented, and originally operated as "a flexible class for comparing pairs of sequences of any [hashable] type". An "experimental" heuristic was added in 2.3a1 to speed up its application to sequences of code lines, which are selected from an unbounded set of possibilities. As explained below, this heuristic partly to completely disables SequenceMatcher for realistic-length sequences from a small finite alphabet. The regression is easy to fix. The docs were never changed to reflect the effect of the heuristic, but should be, with whatever additional change is made. In the commit message for revision 26661, which added the heuristic, Tim Peters wrote "While I like what I've seen of the effects so far, I still consider this experimental. Please give it a try!" Several people who have tried it discovered the problem with small alphabets and posted to the tracker. Issues bpo-1528074, bpo-1678339. bpo-1678345, and bpo-4622 are now-closed duplicates of bpo-2986. The heuristic needs revision. Open questions (discussed after the examples): what exactly to do, which versions to do it too, and who will do it. --- from difflib import SequenceMatcher as SM # base example # make 'y' junk # Increment b by 1 char # Reverse a and b, which increments b The reason for the bug is the heuristic: if the second sequence is at least 200 items long then any item occurring more than one percent of the time in the second sequence is treated as junk. This was aimed at recurring code lines like 'else:' and 'return', but can be fatal for small alphabets where common items are necessary content. A more realistic example than the above is comparing DNA gene sequences. Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate sequence of matches and edits and .ratio works as documented and expected. For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are <1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic and practical use. With the heuristic, everything is junk and there is only one match, ''=='' augmented by the initial prefix of matching bases. This is followed by one edit: replace the rest of the first sequence with the rest of the second sequence. A much faster way to find the first mismatch would be --- 1: what change should be make. Proposed fix: Disentangle the heuristic from the calculation of the internal b2j dict that maps items to indexes in the second sequence b. Only apply the heuristic (or not) afterward. Version A: Modify the heuristic to only eliminate common items when there are more than, say, 100 items (when len(b2j)> 100 where b2j is first calculated without popularity deletions). The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone. On the other hand, realistic sequences of more than 200 code lines should have at least 100 different lines, and so the heuristic should continue to be applied when it (mostly?) 'should' be. This change leaves the API unchanged and does not require a user decision. Version B: add a parameter to .__init__ to make the heuristic optional. If the default were True ('use it'), then the code would run the same as now (even when bad). With the heuristic turned off, users would be able to get the .ratio they may expect and need. On the other hand, users would have to understand the heuristic to know when and when not to use it. Version C: A more radical alternative would be to make one or more of the tuning parameters user settable, with one setting turning it off.
I see the proposal as partial reversion of a change that sometimes causes a regression, in order to fix the regression. Such would usually be called a bugfix. Other tracker reviewers claim this issue is a feature request, not a bugfix. Either way, 3.2 gets the fix. The practical issue is whether at least 2.7(.1) should get the fix, or whether the bug should forever continue in 2.x.
Eli will write a patch and I will check it. However, Georg Brandel assigned the issue to Tim Peters, with a request for comment, but Tim never responded. Is there an active committer who will grab the issue and do a commit review when a patch is ready? |
I guess, I am not supposed to post to python-dev - not being a python developer, hopefully it is appropriate to add a comment here - only based on my current usage of (a modified) difflib.SequenceMatcher. I would vote for extending the SequenceMatcher API to enable adjustments (leaving the default values as the current ones) - enable/disable popular check, set the thresholds for string length and "popular" frequency (and eventually other parameters, which might be added). Are there some restrictions on API changes in a library due to a moratorium - even if the default behaviour remains unchanged? vbr |
Anyone can post on Python-dev, but non-developers should do so judiciously and with respect for the purpose of the list. It is also polite to introduce oneself with the first post. In any case, Tim Peters has approved making some change. The remaining question is exactly what. There is no problem with extending the API in 3.2. The debate there is over 2.7. My fourth proposal, detailed on pydev, is to introduce a fourth paramater, 'common', to set the frequency threshold to None or int 1-99. |
We could extend the API as long as it stays backwards-compatible (that |
My proposal F, to expose the common frequency threshold as a fourth positional parameter with default 1, would do that: repeat current behavior. We should, and Eli and I would, add some of the anomalous cases to the test suite and verily that the default is to reproduce the current anomalies, and that passing None changes the result. Any opinions, anyone, on 'common', 'thresh', 'threshold', or anything else as the new parameter name? We will have to explain in the doc patch that the parameter is new in 2.7.1 to fix a partial bug and that giving any explicit value will make code not run with 2.7 (.0). Exposing the set of common values as an instance attribute, as I proposed on pydev, would be a new feature not needed to fix the bug. So it should be limited to 3.2. |
[copied from pydev post] Summary: adding an autojunk heuristic to difflib without also adding a way to turn it off was a bug because it disabled running code. 2.6 and 3.1 each have, most likely, one final version each. Don't fix for these but add something to the docs explaining the problem and future fix. 2.7 will have several more versions over several years and will be used by newcomers who might encounter the problem but not know to diagnose it and patch a private copy of the module. So it should have a fix. Solutions thought of so far.
Tim Peters
Ugly, but perhaps crazy brilliant. Use of such a hack would obviously be temporary. Perhaps its use could be made to issue a -3 warning if such were enabled. I would simplify the suggestion to something like Tim and Antoine: if you two can agree on what to do for 2.7, Eli and I will code it. This suggestion amounts to a suggestion that the fix for 2.7 be decoupled from a better fix for 3.2. I agree. The latter can be discussed once 2.7 is settled. |
Le mercredi 14 juillet 2010 à 01:45 +0000, Terry J. Reedy a écrit :
Yes, but this is an exceptional situation. We normally don't add new
It's still incredibly ugly. Besides, code written for 2.7.1 might not |
For 2.6 and 3.1, this is a documentation only issue. For 2.6.6 (release candidate due Aug 2, 10 days), I propose to add the following paragraph after the current 'Timing:' paragraph in the SequenceMatcher entry ('Heuristic:' should be bold-faced, like 'Timing:') Heuristic: To speed matching, items that appear more than 1% of the time in sequences of at least 200 items are treated as junk. This has the unfortunate side-effect of giving bad results for sequences constructed from a small set of items. An option to turn off the heuristic will be added to a future version. I would have said 'to 2.7.1' but that has not happened yet. I thought about putting the heuristic paragraph first, but I think it fits better after the discussion of quadratic run time. I think it should be a separate paragraph and not tacked on the end of the previous paragraph so people will be more likely to take notice. I have marked this a release blocker because at least 6 issues have been filed for this bug and so I think it important that the explanation be added to the next released doc. I plan to temporarily reassign this to docs@python in a few days. |
Here's a patch for Doc/library/difflib.rst of the 2.6 branch, following Terry's suggested addition to the docs of the SequenceMatcher class. Tested 'make html'. |
Deferring to after 3.2a1. |
Committed 2.6 patch in r83314. |
Georg committed this patch to the 2.6 tree, and besides, this is doesn't seem like a blocking issue, so I'm kicking 2.6 off the list and knocking the priority down. |
While refactoring the code for 2.7, I discovered that the description of the heuristic for 2.6 and in the code comments is off by 1. "items that appear more than 1% of the time" should actually be "items whose duplicates (after the first) appear more than 1% of the time". The discrepancy arises because in the following code for i, elt in enumerate(b):
if elt in b2j:
indices = b2j[elt]
if n >= 200 and len(indices) * 100 > n:
populardict[elt] = 1
del indices[:]
else:
indices.append(i)
else:
b2j[elt] = [i] len(indices) is retrieved *before* the index i of the current elt is added. Whatever one might think the heuristic 'should' have been (and by the nature of heuristics, there is no right answer), the default behavior must remain as it is, so we adjusted the code and doc to match that. |
Attaching a patch (developed jointly with Terry Reedy) for 2.7 that adds an 'autojunk' parameter to SequenceMatcher's constructor. The parameter is True by default which retains the current behavior in 2.6 and earlier, but can be set by the user to False to disable the popularity heuristic. The patch also fixes some documentation inconsistencies that Terry raised in this message. Notes:
|
The patch changes the internal function that constructs the dict mapping b items to indexes to read as follows: I helped write and test the 2.7 patch and verify that default behavior remains unchanged. I believe it is ready to commit. 3.1 and 3.2 patches will follow. |
Adding a documentation patch for 3.1 which is similar to the 2.6 documentation patch that's been committed by Georg into 2.6 |
Tim told me to continue with this as he has no time. I cannot apply 2.7 patch. I has different header lines. In particular, TortoiseSVN cannot fetch nonexistent revision "Mon Aug 30 06:37:52 2010 +0300". Please regenerate against current 2.7 with method used for 2.6/3.1. |
Attaching a new patch for 2.7 freshly generated vs. current 2.7 maintenance branch from SVN. |
bpo-2986.fix27.5.patch applied, with version note added to doc, as Only thing left is patch for 3.2, which Eli and I will produce. |
r86437 - correct and replicate version-added message |
Terry, when is the deadline for producing the patch for 3.2? Perhaps we should at least submit the 2.7 patch for now so that it goes in for sure? |
I made the minor changes needed to get Eli Bendersky's patch to apply against 3.2. Diff attached. |
Deadline is probably next Fri. However I will apply this or slight revision thereof in a couple of days to make sure this much is in. I have to fixup some work stuff today. |
Simon's patch fix for 3.2 looks good to me - applies cleanly to py3k and tests pass. |
Since I am not sure I will be able to do any more before the 3.2b1 feature freeze, I went ahead with the minimal patch after checking the differences from the 2.7 version and redoing the Misc/News entry. Leaving this open for possible further changes. |
My vote is that this bug be closed and a new feature request be opened. Failing that, it would be good to have a concise description of what else we would like done (and the priority should be downgraded, I guess). |
Terry, I agree with Simon re closing and opening a new feature request. This issue has too much baggage in it, and you we always link to it. A new feature request should be opened strictly for 3.2 If you want I can close this issue and open a new one, but I'm waiting for your approval. |
Agreed. bpo-10534. This is really a 'follow-on' rather than 'superseder', |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: