Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difflib.py Differ.compare is too slow [for degenerate cases] #119105

Closed
pulkin opened this issue May 16, 2024 · 36 comments · Fixed by #119131, #119376 or #119492
Closed

difflib.py Differ.compare is too slow [for degenerate cases] #119105

pulkin opened this issue May 16, 2024 · 36 comments · Fixed by #119131, #119376 or #119492
Labels
performance Performance or resource usage type-feature A feature request or enhancement

Comments

@pulkin
Copy link
Contributor

pulkin commented May 16, 2024

Bug report

Bug description:

import difflib

a = ["0123456789\n"] * 1_000
b = ["01234a56789\n"] * 1_000
list(difflib.Differ().compare(a, b))  # very slow and will probably hit max recursion depth

The case is pathological in the sense of many lines with the same exact diff / ratio.

The issue is that in the current implementation _fancy_replace will take the first pair of lines (with the same ratio) as a split point and will call itself recursively for all lines starting at 2, then 3, 4, etc. This repeats 1_000 times resulting in a massive recursion depth and O(N^3) complexity scaling.

For an average random case it should split anywhere in the range of 1_000 with a complexity scaling of O(N^2 log N).

I personally encountered this in diffing csv files where one of the files has a column added which, apparently, results in all-same ratios for every line in the file.

Proposal

Fixing this is not so hard by adding some heuristics (WIP) pulkin@31e1ed0

The idea is very straightforward: while doing the _fancy_replace magic, if you see many diffs with the same exact ratio, pick the one closest to the middle of chunks rather than the first one (which can be the worst possible choice). This is done by promoting the ratios of those pairs of lines that are closer to the middle of the chunks.

The _drag_to_center function turns line number into the weight added to the ratio (twice: for a and for b). The weight is zero for both ends of chunks and maximal in the middle (quadratic poly was chosen for simplicity). The magnitude of the weight "_gravity" is small enough to only affect ratios that are exactly the same: it relies on the assumption that we probably have less than 500k symbols in the line such that the steps in the ratio are greater than 1e-6. If this assumption fails some diffs may become different (not necessarily worse).

Performance impact for non-pathological cases is probably minimal.

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

Linux

Linked PRs

@pulkin pulkin added the type-bug An unexpected behavior, bug, or error label May 16, 2024
@tim-one tim-one added performance Performance or resource usage type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels May 16, 2024
@tim-one
Copy link
Member

tim-one commented May 16, 2024

Changed label from "type-bug" to "performance" and "type-feature". difflib doesn't make any promises about speed, and finding "bad cases" is expected - that's not "bug" territory.

In your proposal. please add some high-level comments about the meaning of "_gravity" and what range of values might be useful. The name on its own is just a mystery.

@tim-one
Copy link
Member

tim-one commented May 16, 2024

BTW, I agree this class of inputs is worth addressing 😄. Please open a Pull Request?

Offhand, it seems to me that this would be a faster and much more obvious way to get a value that's 0 at i=0 and i=hi-1, and a maximum of 0.5 at the midpoint:

min(i - lo, hi - 1 - i) / (hi - lo)

@pulkin
Copy link
Contributor Author

pulkin commented May 16, 2024

Sure that will probably do as well; was just avoiding function calls.

@tim-one
Copy link
Member

tim-one commented May 16, 2024

Micro-optimizations like that are increasingly out of style in core Python code. They generally don't help (or even hurt) when running under PyPy, and CPython is making real progress at speeding the common cases too.

Relatedly, nobody ever abuses the default-argument mechanism to add local_name=some_global_name_or_const "arguments" anymore, unless it's absolutely time-critical in a very heavily used library.

Else clarity is valued more.

@tim-one
Copy link
Member

tim-one commented May 16, 2024

Let's stay away from hard-coded constants, too - especially ones with "mystery names" 😉. Constants frequently bite us later, and are rarely needed.

In this case, .ratio() is defined to return 2.0*M / T, where M is the number of matches and T is the total number of characters. So code like this, once at the start, is cheap to determine what's actually required:

lena = ahi - alo
lenb = bhi - blo
smallest_ratio_delta = 2 * 0.99 / (lena + lenb)

Defining lena and lenb locals may also simplify later code. Multiplying by 0.99 is to make this a lower bound, to be on the safe side. Again resist micro-optimization: while years ago it helped a little bit to avoid the multiply and type in "1.98" by hand, those days are long over. Even CPython does that arithmetic at compile-time now.

Then if you have two "midrange bonus" functions that max out at 0.5 for i and j being at the midpoints of their ranges,

midrange_total_bonus = (midrange_a_bonus + midrange_b_bonus) * smallest_ratio_delta 

will yield a score bonus less than the difference between any two distinct raw scores (the sum of the two bonuses can't be greater than 1.0).

@tim-one
Copy link
Member

tim-one commented May 17, 2024

"Strength reduction" looks like a worthwhile optimization. Short course: all function calls, and even multiplies, can be removed from the inner loop, by keeping the bonus as a running total, and adjusting it near the top of the loop like so:

                bonus += a_inc if i <= ihalf else -a_inc

where the other names are function-level invariants:

        ihalf = (alo + ahi - 1) / 2
        a_inc = smallest_ratio_delta / lena

So the bonus keeps going up around the loop, until the midpoint is reached, after which it keeps going down.

Similarly for the outer loop. Some details may be tricky. Better names could be picked 😉.

If you don't want to bother, let me know here, and I'll open a pull request.

@pulkin
Copy link
Contributor Author

pulkin commented May 17, 2024

Unfortunately, smallest_ratio is not about alo or ahi: it is about the smallest difference across unrelated comparisons (thus, I was also wrong with my 500k symbol estimation). It has to be much smaller; this one is probably a correct (upper bound):

len_a = max(map(len, a))
len_b = max(map(len, b))
smallest_ratio_delta = 2 * 0.99 / (lena + lenb) ** 2

I plan to play around and open PR. My first one here!

@tim-one
Copy link
Member

tim-one commented May 17, 2024

Wow - I was hallucinating badly, wasn't I ☹️?

It's all yours - I look forward to the PR 😄

Unless I'm still hallucinating, the square in the new expression isn't needed, is it? If the longest elements ever compared have lengths len_a and len_b, the smallest non-zero .ratio() possible is 2 over their sum.

@pulkin
Copy link
Contributor Author

pulkin commented May 17, 2024

No problem: I often skip over large chunks and forget to come back and to explain.

Given the ratio 2 * some_integer / (len(ai) + len(bj)),

the differences between two arbitrary ratios are simply

  2 * some_integer / (len(ai1) + len(bj1))
- 2 * other_integer / (len(ai2) + len(bj2)) =

2 * [some_integer * (...) - other_integer * (...)]
/ [(len(ai1) + len(bj1)) * (len(ai2) + len(bj2))] =

2 * still_some_integer
/ [(len(ai1) + len(bj1)) * (len(ai2) + len(bj2))]

Trying to minimize the above by the absolute value (and requiring it to be non-zero) we just discard whatever the details of the numerator except that it is a non-zero integer: thus, we take it still_some_integer = 1. For the denominator we should technically take largest and next-largest len(ai) + len(bj) but we just take it the largest possible sum len(ai) + len(bj) twice to simplify computations and slightly worsen our lower bound.

Example (rather my rough understanding than the actual computation):

a1, b1 = "abc", "a"
a2, b2 = "abcd", "a"
ratio_1 = 2 * 1/4
ratio_2 = 2 * 1/5
ratio_1 - ratio_2 = 2 * 1/20  # our estimation is 2 * 1/25 instead

@pulkin pulkin changed the title difflib.py Differ.compare is too slow [for pathological cases] difflib.py Differ.compare is too slow [for degenerate cases] May 17, 2024
@tim-one
Copy link
Member

tim-one commented May 17, 2024

Thanks! Got it now. The smallest delta between .ratio()s of two sequences of the same lengths may indeed be much larger than the delta between .ratio()s of sequences with different (but no larger) lengths.

Your more careful analysis looks right. Using the square to simplify reasoning is fine. It's not like we're pushing the limits of float precision here 😉.

@pulkin
Copy link
Contributor Author

pulkin commented May 17, 2024

I am slightly concerned by how small weights can become (but only slightly). We now try to make differences as small as 1 / line_size ** 2 / number_of_lines. For a file with 10k lines and 1k symbols per line this is 1e-11 which is ok-ish, yet, not so distant from the double precision 1e-17 on the logarithmic scale.

@tim-one
Copy link
Member

tim-one commented May 17, 2024

If we end up computing differences so small that they up getting totally lost to rounding when added to the ratio, I don't think it much matters - just a waste of time.

The interesting cases are near the midpoints of the index ranges, where your factor number_of_lines is close to 1 instead.

Alternatives include using fractions.Fraction to retain exact ratios and weights internally, or (faster) to use decimal configured to use "sufficient" precision.

@tim-one
Copy link
Member

tim-one commented May 17, 2024

Unless I'm hallucinating again, there's a simpler way, which doesn't add new floating-point stress, or even require computing "epsilon". Store the ratio instead as a 2-tuple, (ratio_just_as_now, distsum). Here distsum is an int giving the sum of the distance of i from its closest endpoint, and likewise for j.

Tuple comparison does the rest. If the first components differ (ratios we use today differ), that settles it. Else (ratios the same), the larger is whichever has the larger distsum.

Quick coding of that worked fine. Strength reduction in the inner loop becomes just

                distsum += 1 if i <= ihalf else -1

It's fine to use a more long-winded way in the outer loop, though. In real life, the most important line of the inner loop is the real_quick_ratio() "early out", and I'm loathe to add cycles that slow down getting to that very fast test (under PyPy, the method call may even be inlined).

@tim-one
Copy link
Member

tim-one commented May 18, 2024

I think I'll back off on "strength reduction". The full-blown min(i - alo, ahi-1 - i) is certainly easier to grasp, and not all that much slower. Although I'd personally precompute ahi-1 😉 - saving trips around the eval loop can still matter in innermost loops.

@pulkin
Copy link
Contributor Author

pulkin commented May 18, 2024

I updated the branch with rolling weights. I think I like it the most; the only subtle point is that the weight is now an integer number: instead of computing delta_a / len_a + delta_b / len_b I compute the numerator only delta_a * len_b + delta_b * len_a. I thought about further simplifying it to delta_a + delta_b: is normalization that important?

I've put an assert weight == 0 at the end of the loop (not in the branch) to check if everything is correct.

@tim-one
Copy link
Member

tim-one commented May 18, 2024

I thought about further simplifying it to delta_a + delta_b: is normalization that important?

I don['t even know of a case where "normalization" is helpful 😉. Do you? Suppose we have two sequences of wildly different lengths (so normalization could make most difference) but all element pairs have the same .ratio(). Without normalization, the best-scoring pair will come from the midpoints of both sequences. Nothing can improve on that, but adding normalization makes it more of a puzzle to figure out what happens.

I've put an assert weight == 0 at the end of the loop (not in the branch)

Why not in the code? Checking for 0 is cheap.

Note a subtlety with "rolling weights": in the innermost loop, the adjustment has to be done very near the top of the loop. Else there's a conditional (very rarely executed) continue that jumps back to the top of the loop and so skip that iteration's adjustment. Something the aforementioned assert would have caught 😉.

@tim-one
Copy link
Member

tim-one commented May 18, 2024

FYI, this is the cheapest way I've thought of to get a non-rolling weight:

# Function-level invariantd.
midi = (alo + ahi-1) / 2
midj = (blo + bhi-1) / 2

    # Outer loop.
    b_weight = -abs(j - midj)
    ...
        # Inner loop.
        weight = b_weight - abs(i - midi)

It's a float (but with fractional part 0 or 0.5). If there are an odd number of possible indices, it's exactly 0 at the middle one, and gets smaller(more negative) at the same rate for indices moving away from the midpoint.

If there are an even number of indices, the two center ones get the largest weight, -0.5.

I don't think signs, or possible fractions of 0.5, matter. As in your original post, "convex" is what matters: msx out at the central indices, and get smaller the farther from the center.

@tim-one
Copy link
Member

tim-one commented May 18, 2024

I think reasoning rules out that normalization can do any good. Consider what happens without it, and assuming weights are "convex".

The only interesting cases are where the weights make a difference, at identical raw .ratio()s. Restrict attention to only those (i, j) pairs that reach the maximum raw .ratio(). If (I, J) is one of those pairs where I and J are both closest to their respective index midpoints, by convexity the weights of I and of J alone are each as large as possible. No other (i, j) pair can exceed its weight.

Convexity is scale-invariant, so normalization doesn't hurt this: we can multiply the i scores by any non-zero constant, and the j scores by any other, and it doesn't make any difference to that outcome.

So don't bother with multiplying at all 😄.

@pulkin
Copy link
Contributor Author

pulkin commented May 19, 2024

Suppose we have two sequences of wildly different lengths (so normalization could make most difference) but all element pairs have the same .ratio().

I agree: there is no normalization needed in this case. But suppose we have just two competitors with indices short_1, long_1 and short_2, long_2. We are generally more concerned about equally splitting the shorter text. Then, accurate weighting may be important. Without weights, the values of long_* are, in general, bigger, thus, more impactful. We may want the opposite.

If (I, J) is one of those pairs where I and J are both closest to their respective index midpoints, by convexity the weights of I and of J alone are each as large as possible.

This one also caught my attention! But this is not always the case as I hopefully explained above. If we are choosing the best_i from a set of candidates {i} and the best_j from a set of candidate {j} separately (as it happens in the code snippet above) then we do not need any weights. But if we are choosing the best pair from the set {(i, j)} which is not necessarily a product of some sets {i} and {j} then the result does depend on the assigned weights. And the issue is not in the form of the weight function; it is rather the search space may not be trivial.

And this may not be the end of the story. After some thinking I came up with this statement:

Deep enough into the recursion tree we are guaranteed to have comparable (by size) sub-sequences (provided the specific choice of the weight function -abs(i - mid_a) - abs(j - mid_b) for example). Thus, from the point of fixing the complexity scaling, weights are not important indeed.

The O(N^3) complexity scaling and the recursion blow up may happen only if

  • the search space {i, j} includes a specific structure (i1, j1), (i2, j2), ... (in, jn) where both indexes are strictly growing i1 < i2 < i3 < ... < in and j1 < j2 < j3 < ... < jn and
  • the weight function promotes the pairs sequentially: we split at (i1, j1), then at (i2, j2), etc

And it is particularly difficult to find an example where the weight function -abs(i - mid_a) - abs(j - mid_b) would do such a cruel thing! The best I came up with is where the long sequence is exponentially bigger than the short one, thus, "deep enough" is actually deeper than the number of elements in the short sequence. But it is still very optimal from the long sequence point of view.

@pulkin
Copy link
Contributor Author

pulkin commented May 19, 2024

On a related note, there might be some room for further improvement in the above. I think in the case discussed here it is much faster to split into more than two chunks at the same recursion depth. Technically, we should find the longest strictly growing sequence (i1, j1), (i2, j2), ... (in, jn) such that i1 < i2 < i3 < ... < in and j1 < j2 < j3 < ... < jn and do recursive calls n+1 times rather than 2 times only. For the case in the header it is going to be a massive improvement.

Even more, we could avoid recursive calls completely by computing all pairs above the 0.75 threshold and finding the longest chain above (some research needed with regard to fast ratio lower bounds optimizations).

Yet, I believe this PR is good enough.

@tim-one
Copy link
Member

tim-one commented May 19, 2024

You're right again 😄. I over-generalized from your specific test case, assuming the full Cartesian product of the index ranges would be in play in bad cases.

Well, maybe they are in worst current cases. The sparser the subset of the full product in play, the harder it looks to contrive bad cases. I'll settle for a massive improvement in the worst cases.

I think in the case discussed here it is much faster to split into more than two chunks at the same recursion depth.

Absolutely so. The recursive calls are doing purely redundant work, finding the same max ratio at (a subset of) the same line pairs over & over & over again.

I knew that at the time, but an obvious quick way to exploit it didn't come to mind. As is, short-circuiting so that only a brand new max could make it through the gauntlet of "if" tests paid off a lot. Relaxing that to also find pairs that equal the current max was unattractive - and "bad cases" never showed up in real life. But, at the time, the library was overwhelmingly used only to compare files of Python and C code, and human-written prose.

If you'd like to pursue it, have at it!

Yet, I believe this PR is good enough.

No question that it's a huge improvement as is.

@tim-one tim-one linked a pull request May 19, 2024 that will close this issue
@tim-one
Copy link
Member

tim-one commented May 19, 2024

Something like this (wholly untested) might do the trick, preserving short-circuiting as much as possible without missing an "equals the best" case. For each j, it remembers the smallest i with an equal ratio, provided that i is greater than the last such i remembered (and not since thrown away by finding a strictly larger ratio).

from operator import gt, ge
max_pairs = [] # index pairs achieving best_ratio
maxi = -1 # `i` index of last pair in max_pairs

for j in range_b:
    searching = True
    for i in range_a:
        cmp = ge if searching and i > maxi else gt
        if cmp(real_quick_ratio, best_ratio) and ...:
            searching = False
            if ratio > best_ratio:
                max_pairs.clear()
                best_ratio = ratio
            max_pairs.append((i, j))
            maxi = i
    # The indices in max_pairs all have the same, maximal ratio,
    # and are strictly ascending in both positions.

It's still a "greedy" approach. In your original test case, I expect max_pairs to end up as [(alo, blo), (alo + 1, blo + 1), (alo + 2, blo + 2), ...].

tim-one added a commit that referenced this issue May 19, 2024
Code from https://github.com/pulkin, in PR
#119131

Greatly speeds `Differ` when there are many identically scoring pairs, by splitting the recursion near the inputs' midpoints instead of degenerating (as now) into just peeling off the first two lines.

Co-authored-by: Tim Peters <tim.peters@gmail.com>
@tim-one
Copy link
Member

tim-one commented May 20, 2024

we could avoid recursive calls completely by computing all pairs above the 0.75 threshold and finding the longest chain above

I'm not sanguine about that. The "synch pair" establishes a very high wall, with no cross-talk between the "before" and "after" segments of the inputs. This is mostly driven by the linear nature of diff output, restricted to "matches", "insert", "delete", and "replace", from first to last lines of both inputs.

For example, suppose we have two thousand-line inputs. The first line of a is a very close match to the last line of b, but all other pairs are only (say) 90% similar. Ideally it could be useful to show how those 90% similar lines are related to each other, but the output has no way to do that. Instead the best(*) that can be done is:

  • The first 999 lines of b were inserted.
  • Then there's one very similar line.
  • Then the last 999 lines of a were deleted.

Any ratio less than the unique very best is irrelevant to that outcome.

If the synch pair occurred more toward the middle of the inputs, the recursive calls could do a lot of redundant work, but again ratios between the "before" elements of one input and the "after" elements of the other are useless after the synch pair is picked.

There's also that difflib intends never to create a data structure whose size is O(len(a) * len(b)). In the pseudo-code posted before this, len(max_pairs) <= min(ahi - alo, bhi - blo), which is minor.

(*) Of course that's arguable. Could be that it's the "very close match" that should be discounted, in favor of producing markup for the 999 "pretty close" matches. But that would be a Big Change, and this isn't Hard Science™.

@pulkin
Copy link
Contributor Author

pulkin commented May 20, 2024

All true. I think it won't be irrelevant to study other diff libs.

Just to mention, the code merged does change some diff outputs. For example, this one will find only one match, instead of two (before).

cat
dog
close match 1
close match 2

vs

close match 3
close match 4
kitten
puppy

If we want to restore it back (and to keep the recursion under control) we can take the approach above for ratio == max(ratio) and using the original (greedy) algorithm. Then, we may consider loosening towards other ratio >= .75 and using a more educated guess to form the chain (i1, j1) < (i2, j2) < ... < (in, jn). If we properly implement the foundation then both changes are not so massive any longer, whether they go in or not.

I, too, have concerns about considering all ratio >= .75 at once. Not that the diff changes but computing it might take a lot more time because the "slow" .ratio will be called more often in this case.

@tim-one
Copy link
Member

tim-one commented May 20, 2024

I'm not determined to keep all results the same, but your example bothers me enough that I'm going to re-open this issue.

The fundamental goal of this library was to create diffs that, to the extent possible, "make sense" to human eyes. Because, at the time, all Unix-y diff programs too often produced diffs for Python's C code that were more baffling than helpful (e.g., synching up on garbage like blank lines and lone curly braces far apart in the inputs).

Locality seems to matter most to people, so the core approach builds on finding long stretches of consecutive matching lines. Another that matters to people is "top to bottom". It just makes "no sense" to my eyes that given two blocks of closely matching lines, the program would match the first line of one input's block with the last line of the other's. "Well, that's because they're both closest to the center of their index ranges" sounds like a geek making up technobabble excuses for absurd output 😉.

I believe the pseudocode I gave would restore "first to first, second to second, ..." results, so I'm keen now to pursue it.

About other diff libraries, over the years I haven't found that studying their details was helpful. At a high level, if I had it to do over again I'd start with git's "histogram" diff. Another thing that catches eyes is rare lines, and "histogram" addresses that directly. I think difflib's "junk" heuristics were a mostly-failed attempt toward the same end, by ignoring "too common" elements instead of identifying the rare ones. People generally don't know to use it, and even I rarely made the effort required.

But "histogram" isn't a full diff algorithm either, just an early pass for identifying high-value "synch points". Producing diffs between synch points has to be done by some other algorithm.

And for some tasks, like comparing genome sequences, "histogram" is useless (it won't find any rare elements). Then again, difflib is mostly useless for that too (and completely useless unless "autojunk" is disabled - in which case its generally quadratic runtime is too slow for that task).

@tim-one
Copy link
Member

tim-one commented May 22, 2024

I opened a PR implementing the "track all optimal pairs" idea.

#119376

@pulkin
Copy link
Contributor Author

pulkin commented May 22, 2024

I did some automated testing on various orderings of unique and similar groups of strings and your code looks equivalent to the packaged one output-wise.

I have a small request (rather related to the original implementation): eqi and eqj things are not super obvious to understand. I am not 100% sure but I think the code logic is roughly equivalent to this:

best_ratio = 1
for j in ...
    for i in ...
        if ai == bj:
            if best_ratio != 1:
                continue  # we already have a close match; ignore the exact one
            ratio = 1
        else:
            set_seq(...)
            _cutoff = cutoff if best_ratio == 1 else best_ratio
            if crqr() < _cutoff or rqr() < _cutoff or (ratio := qr()) < _cutoff:
                continue  # ratio is smaller
        if ratio != best_ratio:
            # start max_pairs over
        else:
            # do i, j checks
            max_pairs.append((i, j))

There is an obvious subtlety with 0.75 ratio: do we accept that?

@tim-one
Copy link
Member

tim-one commented May 22, 2024

your code looks equivalent to the packaged one output-wise.

Thanks! That was the intent, and I think it's pretty clear it "should" act the same.

eqi will always be annoying to understand. I'm not really inclined to swap out one set of convolutions for another 😉. It's one reason for why I'd probably ditch the "junk" concept if I had it to do over again - it's always an annoying special case.

I'm also keen to keep making a distinction between ">" and ">=". Because, using the lines from your example:

>>> from difflib import SequenceMatcher as S
>>> m = S(None, "0123456789\n", "01234a56789\n")
>>> m.ratio()
0.9565217391304348
>>> m.quick_ratio()
0.9565217391304348
>>> m.real_quick_ratio()
0.9565217391304348

That is, the "quick upper bounds" are in fact good to the last bit. search_equal will be False for about half the pairs compared, and using strict ">" for those lets them short-circuit on the first (real_quick_ratio) compare. Avoiding calls to the other - far more expensive - ones is valuable (in a constant-factor sense, not O()).

There is an obvious subtlety with 0.75 ratio: do we accept that?

I do. Those were basically pulled out of a hat. Interline markup can get more annoying than helpful unless the lines are "quite similar", and the cutoffs were adjusted by eyeball to quantify "quite". They're neither touchy nor critical. But, for backward compatibility, can never be changed 😉.

Since what's in the PR now has been tested and seems to work fine, I'll merge it later tonight. If you or I want to pursue other ideas beyond it, that's fine - just make another PR.

@tim-one
Copy link
Member

tim-one commented May 22, 2024

search_equal will be False for about half the pairs compared

I should add that's a fuzzy "average case" claim. In your test case, a million pairs are compared, but search_equal is True for only a thousand of them. The overwhelming majority use strict >, and so get out early on the first, very cheap, "real quick ratio" test.

tim-one added a commit that referenced this issue May 22, 2024
…s] (#119376)

Track all pairs achieving the best ratio in Differ(). This repairs the "very deep recursion and cubic time" bad cases in a way that preserves previous output.
@tim-one
Copy link
Member

tim-one commented May 23, 2024

Here's a different case that's still cubic-time, and blows the recursion stack. The first line pair has a unique best ratio, and the same is true at each recursion level:

import difflib
a = []
b = []
N = 1000
for i in range(N):
    s = "0" * (N - i)
    s2 = s + "x"
    a.append(s + "\n")
    b.append(s2 + "\n")
list(difflib.Differ().compare(a, b))

I don't have a slick idea to get around that while preserving current results (well, for this specific input, sure - I mean "in general").

Half-baked: for each j, remember the smallest i (if any) achieving the highest ratio above the cutoff. By construction, a list of such (i, j) pairs is increasing in the 2nd index. Then pick a longest-possible subsequence (not necessarily contiguous) of those that's strictly increasing in the 1st index too.

Wouldn't always eliminate non-trivial recursive calls. In this test case, and in the original, it would. How it may affect current results is clear as mud to me 😄.

EDIT: why not build the list with strictly increasing i to begin with (as is done now)? Suppose the first line of b is similar to the last line of a. Then the first pair added would be (ahi - 1, blo), and we could never add another (we already maxed out the 1st index position).

@pulkin
Copy link
Contributor Author

pulkin commented May 23, 2024

Yes I was aware about this one. It is special in a sense of arbitrary long strings: if you limit string length then asymptotically it is still O(N^2). Frankly speaking, for the application I am looking into I will probably try implementing the fast ONP one keeping string comparison intact because O(N^2) is still very slow for 500 Mb text files.

@tim-one
Copy link
Member

tim-one commented May 23, 2024

Over my head 😉. I don't know what "ONP" means(*), or "keeping string comparison intact".

In general, I don't know of a worst-case sub-quadratic algorithm for "longest common subsequence" problems. Suffix trees can, in theory, solve "longest contiguous common subsequence" problems in linear time, but are exceedingly delicate with high constant factors.

That said, it's easy to believe difflib isn't suitable for your application. It was aimed at human-friendly output, pretty much regardless of cost. "Shortest edit script" sucks at the "human-friendly" part.

BTW, you can avoid _fancy_replace() entirely by using SequenceMatcher directly. You can get an "edit script" much faster then, but no attempt is made to identify "similar" lines. They're identical or they aren't.

(*) As in "An O(NP) Sequence Comparison Algorithm" by Wu, Manber, & Myers?

@tim-one
Copy link
Member

tim-one commented May 23, 2024

you can avoid _fancy_replace() entirely by using SequenceMatcher directly

And seems possible that's what you really wanted all along. If you're thinking of implementing a Myers diff algorithm, then you don't care about intraline markup. That can make a huge difference..

Here's your original test, but with a million lines in each input:

>>> import difflib
>>> a = ["0123456789\n"] * 1_000_000
>>> b = ["01234a56789\n"] * 1_000_000
>>> s = difflib.SequenceMatcher(None, a, b) # appears instantaneous
>>> list(s.get_matching_blocks()) # also very  fast
[Match(a=1000000, b=1000000, size=0)] # see the docs - this always has a size-0 dummy at the end
>>> for x in s.get_opcodes(): # and here's your "edit script"
...     print(x)
('replace', 0, 1000000, 0, 1000000)
>>>

Any kind of "bad case" you make up for _fancy_replace() is most likely to run in linear time under SequenceMatcher.

Under the covers, Differ.compare() does the same thing, but goes on to replace SequenceMatcher's "replace" opcode with its own _fancy_replace output,

One final "speed hint": PyPy generally runs diffllib functions much faster than CPython does. In your original test, using difflib's original code, PyPy finishes in about 5 seconds on my box, without overflowing the stack. I don't know all the reasons for that (in part, function inlining cuts recursion depth).

@tim-one
Copy link
Member

tim-one commented May 24, 2024

FYI, in the latest PR, _fancy_replace() is no longer recursive, and the number of ratios it computes is worst-case O(min(bhi - blo, ahi - alo)). It's certainly possible to contrive cases where it delivers different results, but so far none that I care about. But it's still much faster to use SequenceMatcher directly if you don't care about intraline markup.

@pulkin
Copy link
Contributor Author

pulkin commented May 24, 2024

Yes: thinking about using SequenceMatcher. I did not look into it but it is tempting for sure. In a way that I can abstract out dealing with different dimensions (characters in the line, lines in the file but maybe also files in the folder even).

But the straightforward implementation that you showed is not something very useful in my case: I do care about close but not exact matches. Roughly speaking I would like to simplify it to this:

@dataclass
class CloseIsEqualWrapper:
    token: Any
    
    def __eq__(self, other):
        assert isinstance(other, CloseIsEqualWrapper)
        return difflib.SequenceMatcher(a=self.token, b=other.token).ratio() >= .75

a = list(map(CloseIsEqualWrapper, a))
b = list(map(CloseIsEqualWrapper, a))

print(list(difflib.SequenceMatcher(None, a, b).get_matching_blocks()))

Then, if SequenceMatcher is not fast enough I will be able to isolate the performance problems to the sequence comparison without worrying whether it is line or file comparison.

@tim-one
Copy link
Member

tim-one commented May 24, 2024

Alas, I doubt that can be made to work. Dicts are critical to how matching works, and defining __eq__ as "close to" breaks just about every precondition dicts rely on. For example, any two values that compare equal must have the same hash code. Which you can always satisfy by, e.g., defining __hash__ to return some constant - but then all keys collide, and dict lookup becomes an especially slow way of doing linear search.

There's also that "close to" doesn't define an equivalence relation, because it's not transitive. That A is close to B, and B is close to C, doesn't imply that A is close to C.

ABCDE close_to VBCDE close_to VWCDE close_to VWXDE close_to VWXYE close_to VWXYZ

Any two strings of sufficient length can be transformed to the other via such a chain.

And there's also - as the docs note - that .ratio() on its own isn't even symmetric. Order can matter. For example, compare "taste" to "tease". In that order, the ratio is 0.8; swap them & it falls to 0.4. This is a consequence of the documented way in which "longest matching block" is defined, and then the "very high wall" between "before and after" portions of the block kicks in.

I expect you'd do better by not trying to trick SequeneceMatcher. Copy (or subclass) the Differ class and supply your own idea of _fancy_replace.

Of course if you're coding a Meyers-style algorithm from scratch, none of the above matters. Then I'd recommend the classic Levenshtein distance (which is symmetric) as a measure of similarity (or a wrapper around .ratio() can be made so, by, e.g., trying both orders and returning the larger (or smaller) ratio, or doing it just once with the lesser (or greater) input first).

tim-one added a commit that referenced this issue May 25, 2024
``_fancy_replace()`` is no longer recursive. and a single call does a worst-case linear number of ratio() computations instead of quadratic. This renders toothless a universe of pathological cases. Some inputs may produce different output, but that's rare, and I didn't find a case where the final diff appeared to be of materially worse quality. To the contrary, by refusing to even consider synching on lines "far apart", there was more easy-to-digest locality in the output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
2 participants