-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bytes.find consistently hangs in a particular scenario #86138
Comments
Sorry for the vague title. I'm not sure how to succinctly describe this issue. The following code:
Always hangs on the second call to find. It might complete eventually, but I've left it running and never seen it do so. Because of the structure of data.bin, it should find the same position as the first call to find. The first call to find completes and prints near instantly, which makes the pathological performance of the second (which is only searching for one b"\xff" more than the first) even more mystifying. I attempted to upload the data.bin file I was working with as an attachment here, but it failed multiple times. I assume it's too large for an attachment; it's a 32MiB file consisting only of 00 bytes and FF bytes. Since I couldn't attach it, I uploaded it to a gist. I hope that's okay. https://gist.github.com/Zeturic/7d0480a94352968c1fe92aa62e8adeaf I wasn't able to trigger the pathological runtime behavior with other sequences of bytes, which is why I uploaded it in the first place. For example, if it is randomly generated, it doesn't trigger it. I've verified that this happens on multiple versions of CPython (as well as PyPy) and on multiple computers / operating systems. |
Can reproduce on Alpine Linux, with CPython 3.8.2 (running under WSLv2), so it's not just you. CPU usage is high; seems like it must be stuck in an infinite loop. |
Also reproduced on 64-bit Win10 with just-released 3.9.0. Note that string search tries to incorporate a number of tricks (pieces of Boyer-Moore, Sunday, etc) to speed searches. The "skip tables" here are probably computing a 0 by mistake. The file here contains only 2 distinct byte values, and the relatively huge string to search for has only 1, so there's plenty of opportunity for those tricks to get confused ;-) |
What is the content of "data.bin"? How can we reproduce the issue? |
It's available from the Github gist that Kevin posted. Here is a direct link that you can curl/wget: |
Adding some hasty printf-debugging to fastsearch.h, I see this: >>> with open('data.bin', 'rb') as f:
... s = f.read()
...
>>> base = 15403807 * b'\xff'
>>> longer = base + b'\xff'
>>> base in s
Bloom has 0x00: 0
Bloom has 0xff: -2147483648
skip = 0
Checking bloom filter for next char 0x00... not found in pattern. Skipping a bunch.
Candidate at 15403808
Trying the same base, I also get this: >>> with open('data.bin', 'rb') as f:
... s = f.read()
...
>>> base = 15403807 * b'\xff'
>>> s1 = s[1:]
>>> base in s1
So maybe part of the issue is just that the algorithm is getting particularly The fact that "skip" is 0 in these cases seems to just be a limitation of the algorithm: Computing the entire Boyer-Moore table would be better in cases like these, but |
Does the code really hang? Or does it complete if you wait for an infinite amount of time? I understand that your concern is that bytes.find() is inefficient for a very specific pattern. |
Indeed, this is just a very unlucky case. >>> n = len(longer)
>>> from collections import Counter
>>> Counter(s[:n])
Counter({0: 9056995, 255: 6346813})
>>> s[n-30:n+30].replace(b'\x00', b'.').replace(b'\xff', b'@')
b'..............................@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
>>> Counter(s[n:])
Counter({255: 18150624}) When checking "base", we're in this situation
Whereas when checking "longer", we're in this situation:
I'm attaching reproducer.py, which replicates this from scratch without loading data from a file. |
Good sleuthing, Dennis! Yes, Fredrik was not willing to add "potentially expensive" (in time or in space) tricks: http://effbot.org/zone/stringlib.htm So worst-case time is proportional to the product of the arguments' lengths, and the cases here appear to, essentially, hit that. It _was_ a goal that it always be at least as fast as the dirt-dumb search algorithm it replaced, and in good real-life (not just contrived) cases to be much faster. It met the goals it had. |
Just FYI, the original test program did get the right answer for the second search on my box - after about 3 1/2 hours :-) |
BTW, this initialization in the FASTSEARCH code appears to me to be a mistake: skip = mlast - 1; That's "mistake" in the sense of "not quite what was intended, and so confusing", not in the sense of "leads to a wrong result". I believe
|
The attached fastsearch.diff removes the speed hit in the original test case and in the constructed one. I don't know whether it "should be" applied, and really can't make time to dig into it. The rationale: when the last characters of the pattern string and the current search window don't match, this is what happens: if the "Bloom filter" can prove that the character one beyond the search window isn't in the pattern at all, the search window is advanced by len(pattern) + 1 positions. Else it's advanced only 1 position. What changes here: if the Bloom filter thinks the character one beyond the search window may be in the pattern, this goes on to see whether the Bloom filter thinks the last character in the search window may be in the pattern (by case assumption, we already know it's not the _last_ character in the pattern). If not, the search window can be advanced by len(pattern) positions. So it adds an extra check in the last-characters-don't-match and next-character-may-be-in-the-pattern case: if the last character is not in the pattern, it's irrelevant that the next character may be - there's no possible overall match including the last character, so we can move the search window beyond it. The question is whether this costs more than it saves overall. It saves literally hours of CPU time for one search in this report's case ;-) But worst-case time remains O(m*n) regardless, just somewhat harder to provoke. |
I agree that skip could could do 1 better. ---
I don't think I'm convinced: the second check fixes only the very specific case when s[len(p):].startswith(p).
In fact, it's even slower than before due to the double-checking. --- I'd be curious how something like Crochemore and Perrin's "two-way" algorithm would do (constant space, compares < 2n-m):
Apparently it's been used as glibc's memmem() since 2008. There, it additionally computes a table, but only for long needles where it really pays off:
Although there certainly seems to be a complexity rabbithole here that would be easy to over-do. |
Ya, the case for the diff is at best marginal. Note that while it may be theoretically provable that the extra test would make the worst cases slower, that's almost certainly not measurable. The extra test would almost never be executed in the worst cases: in those the last characters of the needle and haystack DO match, and we spend almost all the total time in the
loop. The extra test is only in the branch where the last characters _don't_ match. In that branch, it's already certain that the last haystack character does not match the last needle character, so in a probabilistic sense, for "random" data that increases the odds that the last haystack character isn't in the needle at all. In which case the payoff is relatively large compared to the cost of the test (can skip len(needle) positions at once, instead of only 1). I don't believe this can be out-thought. It would require timing on "typical" real-life searches. Which are likely impossible to obtain ;-) Offhand do you know what the best timing for two-way search is in a pattern-not-found case? The algorithm is too complicated for that to be apparent to me at a glance. As Fredrik said in the post of his I linked to, he was very keen to have an algorithm with best case sublinear time. For example, the one we have takes best-case not-found O(n/m) time (where n is the haystack length and m the needle length). For example, searching for 'B' * 1_000_000 in 'A' * 10_000_000 fails after just 9 character comparisons (well, not counting the million less 1 compares to initialize |
Looking at the glibc implementation, in the top-level "else" clause period = MAX (suffix, needle_len - suffix) + 1; so period > needle_len / 2. Then during each iteration of the j-loop (where j is the index into the haystack), Here's a sublinear example for the two-way algorithm: N = 10**7
haystack = "ABC" * N
needle1 = "ABC" * (N // 10) + "DDD"
needle2 = "ABC" * 10 + "DDD"
=== Results === This case is surprisingly better than the builtin: N = 10**7
haystack = "ABC" * N
needle1 = "DDD" + "ABC" * (N // 10)
needle2 = "DDD" + "ABC" * 10
This was measured using the attached fastsearch.py. There are some manipulations in there like string slicing that would make more sense as pointer math in C, so especially accounting for that, I'm guessing it would be pretty competitive in a lot of cases. A lot of the implementations look like they use a complete Boyer-Moore table to go sublinear in even more cases, which it seems we want to avoid. I'm not certain, but it's conceivable that the existing techniques of using just one one delta-value or the "Bloom filter" could be thrown in there to get some of the same improvements. Although that could be redundant -- I'm not sure. |
Impressive, Dennis! Nice work. FYI, on the OP's original test data, your fastsearch() completes each search in under 20 seconds using CPython, and in well under 1 second using PyPy. Unfortunately, that's so promising it can't just be dismissed offhandedly ;-) Looks like something quite worth pursuing here! |
Here is a C implementation of the two-way algorithm that should work as a drop-in replacement for Objects/stringlib/fastsearch.h. Benchmarking so far, it looks like it is a bit slower in a lot of cases. But it's also a bit faster in a some other cases and oodles faster in the really bad cases. I wonder if there's a good heuristic cutoff (for the needle size?) where the two-way usually becomes better. |
PR 22679 is a draft that does the two-way algorithm but also adds both of the tricks from Fredrik's implementation: a bit-set "bloom filter" and remembering the skip-distance between some pair of characters. |
Yes, that's correct.
I'm definitely open to switching things up, but I'm not totally sure there would be a difference. In Sunday, the pattern is somehow scanned at each step, and then upon finding a mismatch, we look up in the table the first character outside the window. With the PR as written, when looking up in the table the last character of the window, we haven't actually produced a mismatch yet; the 'continue' statements jump ahead until the last characters match (mod 128), and then the two-way scanning begins, which determines how the window gets shifted after that. But after this two-way-determined shift, the last character in the new window immediately gets looked up again in the table at the top of the loop, effectively letting the two-way shift immediately be extended to a longer shift. I suppose that the line |
I posted the example thinking that having a concrete walkthrough might be good for discussion, and it looks like it was. ;-) This makes me curious how a simplified-but-not-as-simplified-as-the-status-quo Sunday algorithm would fare: using the Sunday algorithm, but with a shift lookup modulo 128 rather than a boolean lookup. It might not be provably O(n), but it might be faster for some cases. Although if the common case is already fast enough, maybe we do want a provably-linear algorithm for the big stuff. |
I confess I _assumed_ all along that you were generalizing the current code's Sunday trick to 7-bit equivalence classes (up from 32 bits total) and 64K possible shift counts (up from just 2 total possibilities: 1 or len(needle)+1). The Sunday trick couldn't care less where or when the mismatch occurs, just that a mismatch occurred somewhere. In my head I was picturing the paper's code, not the PR. Whenever it makes a shift, it could compare it to the "Sunday-ish shift", and pick the larger. That should have no effect on worst case O() behavior - it would always shift at least as much as Crochempre-Perrin when a mismatch was hit. I can't say how it would relate to details of the PR's spelling, because I'm still trying to fully understand the paper ;-) I don't believe I can usefully review the code before then. |
Unfortunately due to licensing issues, it looks like I'll have to ditch the PR and start from scratch: it was too heavily based on the glibc implementation, which has a stricter license. It may be take a good deal of time before I have a potential replacement, but I'll try to work on a "clean-room" implementation. Lesson learned! |
I attached a new PR, with a lot of the same ideas. The major differences between this and the last PR:
I'll post some more benchmarks soon, but preliminarily, it seems like this swapping of the character to first be matched is better for some needles, worse for others, which makes sense. Stringbench.py for example has some needles that have a unique character at the end, which prefers the status quo and old PR, but other strings work better for this PR. |
Note that Sunday doesn't care (at all) where mismatches occur. The "natural" way to add Sunday: follow pure C-P unless/until it finds a mismatching position. Pure C-P then computes a specific shift. Nothing about that changes. But something is added: also compute the shift Sunday suggests, and pick the larger of that and what C-P computed. C-P and Sunday both have cases where they can justify "supernaturally large" shifts (i.e., as long as len(needle), or even that +1 for Sunday), and they're not always the same cases. |
This is the approach in the PR: # JUMP_BOTH
while ...:
if haystack[j + cut] != needle[cut]:
# Sunday skip
j += table[haystack[j + len(needle)]]
continue
j += rest_of_the_two_way_algorithm() If I understand correctly, this is what you're suggesting: # JUMP_MAX
while ...:
shift = rest_of_the_two_way_algorithm()
j += max(shift, table[haystack[j + len(needle)]]) I implemented this and on my Zipf benchmark, and JUMP_BOTH was faster on 410 cases (at most 1.86x faster), while JUMP_MAX was faster on 179 cases (at most 1.3x faster). Some things to notice about the JUMP_BOTH approach:
The typical purpose of the Sunday skip (as far as I can tell) is that in Sunday's algorithm, we find a mismatch, then have no a priori idea of how far ahead to skip, other than knowing that it has to be at least 1, so we check the character 1 to the right of the window. Another way of thinking about this would be to increment the window by 1, look at the last character in the new window, and jump ahead to line it up. However, with the hybrid method of the PR, we *do* have some a priori skip, bigger than 1, known before we check the table. So instead of doing the maximum of the two-way skip and the Sunday skip, why not do both? As in: do the two-way shift, then extend that with a Sunday shift in the next iteration. I tried another version also with this in mind, something like: # JUMP_AFTER
while ...:
# Guaranteed to jump to a mismatched poisition
j += rest_of_the_two_way_algorithm() - 1
j += table[haystack[j + len(needle)]] But this resulted in slightly-worse performance: JUMP_BOTH was faster in 344 cases (at most 1.9x faster), while JUMP_AFTER was faster in 265 cases (at most 1.32x faster). My guess for the explanation of this is that since most of the time is spent on Sunday shifts lining up the cut character, it's beneficial for the compiler to generate specialized code for that case. |
Here are those zipf-distributed benchmarks for PR 22904: https://pastebin.com/raw/qBaMi2dm Ignoring differences of <5%, there are 33 cases that get slower, but 477 cases that got faster. Here's a stringbench.py run for PR 22904: https://pastebin.com/raw/ABm32bA0 It looks like the stringbench times get a bit worse on a few cases, but I would attribute that to the benchmarks having many "difficult" cases with a unique character at the end of the needle, such as: s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E"), which the status quo already handles as well as possible, whereas the PR best handles the case where some middle "cut" character is unique. Who knows how common these cases are. |
After spending quite a while setting up my dev machine for (hopefully) reliable benchmarking, I can say with a high level of certainty that the latest PR (GH-22904, commit fe9e9d9) runs stringbench.py significantly faster than the master branch (commit fb5db7e). See attached results. I've also run the zipf-distributed random string benchmarks provided by Dennis, with a slight addition of My conclusion is that the current two-way implementation is certainly better for needles of length over 100, but for shorter needles the situation is still unclear. System details: Steps taken to prepare system for performance tuning:
|
I should also mention that I configured and compiled Python for each of the above-mentioned versions using the --enable-optimizations` flag. |
I am attempting to better understand the performance characteristics to determine where a cutoff should go. Attached is a colorful table of benchmarks of the existing algorithm to the PR with the cutoff changed to |
I used the following script, and the raw results are attached. import pyperf
runner = pyperf.Runner()
ms = [12, 16, 24, 32, 48, 64, 96, 128,
192, 256, 384, 512, 768, 1024, 1536]
ns = [2000, 3000, 4000, 6000, 8000,
12000, 16000, 24000, 32000, 48000,
64000, 96000]
for n, m in product(ns, ms):
runner.timeit(f"needle={m}, haystack={n}",
setup="needle = zipf_string(m); haystack = zipf_string(n)",
stmt="needle in haystack",
globals=locals() | globals()) |
I'm sorry I haven't been able to give more time to this. I love what's been done, but am just overwhelmed :-( The main thing here is to end quadratic-time disasters, without doing significant damage in other cases. Toward that end it would be fine to use "very high" cutoffs, and save tuning for later. It's clear that we _can_ get significant benefit even in many non-disastrous cases, but unclear exactly how to exploit that. But marking the issue report as "fixed" doesn't require resolving that - and I want to see this improvement in the next Python release. An idea that hasn't been tried: in the world of sorting, quicksort is usually the fastest method - but it's prone to quadratic-time disasters. OTOH, heapsort never suffers disasters, but is almost always significantly slower in ordinary cases. So the more-or-less straightforward "introsort" compromises, by starting with a plain quicksort, but with "introspection" code added to measure how quickly it's making progress. If quicksort appears to be thrashing, it switches to heapsort. It's possible an "introsearch" would work well too. Start with pure brute force (not even with the current preprocessing), and switch to something else if it's making slow progress. That's easy to measure: compare the number of character comparisons we've made so far to how many character positions we've advanced the search window so far. If that ratio exceeds (say) 3, we're heading out of linear-time territory, so should switch to a different approach. With no preprocessing at all, that's likely to be faster than even the current code for very short needles, and in all cases where the needle is found at the start of the haystack. This context is trickier, though, in that the current code, and pure C&P, can also enjoy sub-linear time cases. It's always something ;-) |
But that also suggests a different approach: start with the current code, but add introspection to switch to your enhancement of C&P if the current code is drifting out of linear-time territory. |
This feels reasonable to me -- I changed the cutoff to the more cautious |
For convenience, attached is a quick and dirty Tkinter GUI that lets you step through the Crochemore/Perrin Algorithm on your choice of inputs, just for play/discovery. A good illustration of the memory for periodic needles can be found by testing: The GUI program does not implement the Boyer-Moore/Horspool/Sunday-style shift-table. In the current PR 22904, this table is used in exactly those situations where the GUI says "Matched 0, so jump ahead by 1". |
PR 22904 now adds a text document explaining how the Two-Way algorithm works in much more detail. I was looking at more benchmarking results, and I came to a couple of conclusions about cutoffs. There's a consistent benefit to using the two-way algorithm whenever both strings are long enough and the needle is less than 20% the length of the haystack. If that percentage is higher, the preprocessing cost starts to dominate, and the Two-Way algorithm is slower than the status quo in average such cases. This doesn't change the fact that in cases like OP's original example, the Two-way algorithm can be thousands of times faster than the status quo, even though the needle is a significant percentage of the length of the haystack. So to handle that, I tried an adaptive/introspective algorithm like Tim suggested: it counts the number of matched characters that don't lead to full matches, and once that total exceeds O(m), the Two-Way preprocessing cost will surely be worth it. The only way I could figure out how to not damage the speed of the usual small-string cases is to duplicate that code. See comparison.jpg for how much better the adaptive version is as opposed to always using two-way on this Zipf benchmark, while still guaranteeing O(m + n). |
Thanks for pointing that out Tim! Turns out PyPy had copied that mindlessly and I just fixed it. (I'm also generally following along with this issue, I plan to implement the two-way algorithm for PyPy as well, once you all have decided on a heuristic. We are occasionally in a slightly easier situation, because for constant-enough needles we can have the JIT do the pre-work on the needle during code generation.) |
Any chance PR 22904 can make it into 3.10 before the May 03 feature freeze? The text document in that PR has some notes on how the algorithm works, so that may be a good place to start reviewing if anyone is interested. |
If Tim approves we might get it into alpha 6 which goes out Monday. |
I'm very sorry for not keeping up with this - my health has been poor, and I just haven't been able to make enough time. Last time I looked to a non-trivial depth, I was quite happy, and just quibbling about possible tradeoffs. I can't honestly commit to doing better in the near future, so where does that leave us? I'd personally "risk it". I just added Raymond to the nosy list, on the off chance he can make some "emergency time" to give a more informed yes/no. He's routinely sucked up weeks of my life with random review requests, so I figure turnabout is fair play ;-) |
In "Fast String Searching" (1991), A. Hume and D. Sunday emphasize that most of the runtime of string-search algorithms occurs in a code path that skips over immediately-disqualified alignments. As such, that paper recommends extracting a hot "Horspool" code path and keeping it small. For that reason, I'm posting a PR which boils down that fast path to the following: while (skip > 0 && window_last < haystack_end) {
window_last += skip;
skip = table[(*window_last) & 63];
} This uses two memory accesses, whereas the last implementation used three (the extra to get window[cut]). This requires the skip table to change slightly, since it is indexed by the last character in the current haystack window rather than one character later ("fast" versus "sd1" in the paper). The paper also recommends unrolling this loop 3x, but my benchmarks found no benefit to unrolling. The PR also refactors some of the main fastsearch code into separate functions, and computes and then uses a similar gap/skip integer used already in the default implementation (Hume and Sunday call this "md2"). It retains a linear-time constant space worst-case with the Two-Way algorithm. There are a bunch of benchmarks here, both on Zipf-distributed random strings and on a few c/rst/python/binary files in the cpython repo. They compare the current main branch to the PR:
Summary of those results: On the Zipf strings: On the "Real world" source files: |
I checked the original example in this issue and the newest change in #71278 makes the |
Looks like this can be closed now, the original issue is fixed, the original patch is merged for 3.10, and the improved patch is merged for 3.11. Thanks! ✨ 🍰 ✨ |
str in str
on random words of a language with a Zipf distributionNote: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: