bpo-41972: Tweak fastsearch.h string search algorithms #27091

sweeneyde · 2021-07-12T03:56:23Z

https://bugs.python.org/issue41972

…r inner loop.

…nto fastersearch

ambv

I admit going through _two_way with the double loops and gotos caused a little head-spinning. I'll run some Hypothesis tests on the new version.

Note that this can only go to 3.11 now, as it doesn't qualify as a bug fix and we're past the beta stage for 3.10.

Objects/stringlib/fastsearch.h

ambv · 2021-07-15T10:35:49Z

Objects/stringlib/fastsearch.h

    if (mode != FAST_RSEARCH) {
-        if (m >= 100 && w >= 2000 && w / m >= 5) {
+        if (n < 3000 || (m < 100 && n < 30000) || m < 6) {


How did you come about those numbers?

I'll consolidate the benchmarks I used and post them here. But this is a chart comparing the default implementation to the existing two-way implementation. The equivalent chart for this implementation is similar, and I'm trying to essentially to cut out the green bits of that image.

For now though, the differences this PR makes have some benchmarks here

sweeneyde · 2021-07-15T18:31:25Z

There's an overview of the two-way algorithm at https://github.com/python/cpython/blob/main/Objects/stringlib/stringlib_find_two_way_notes.txt

If you want to be able to test more rigorously, you could make the table and its contents smaller with something like this:

#define MAX_SHIFT 3

#define TABLE_SIZE_BITS 2u

sweeneyde · 2021-07-16T02:50:36Z

Here's the chart of the ratios of the runtimes of the two algorithms, with cyan/black text to distinguish where these thresholds lie. This was run using zipf_benchmarks.py in the gist I posted before.

ambv · 2021-07-16T18:30:30Z

I was running Hypothesis smoke tests on this PR for most of today and all's well.

Do the benchmarks you provided include the changes in your last two commits?

sweeneyde · 2021-07-16T21:49:53Z

I re-ran some benchmarks comparing the main branch to commit 4d7d102 and they're here: https://gist.github.com/sweeneyde/c370dcee453a2bcb34157261fb79650e

These use the same benchmarks as I posted before: one randomly generated character-by-character with a zipf distribution, and then a set of benchmarks comparing binary data, C code, Python code, and RestructuredText.

Summary of random string results:
* Geometric mean: 1.05x faster
* 542 cases faster (at best 3.45x faster)
* 166 cases slower (at worst 1.73x slower)

Summary of "real life" benchmarks:
* Geometric mean: 1.22x faster
* 372 cases faster (at best 14.12x faster)
* 129 cases slower (at worst 3.95x slower)

(My most recent commits only tweaked the cutoffs, and the table in my last comment ran with no run-time cutoffs.)

ambv · 2021-07-17T09:54:56Z

The recent results still look pretty good but are visibly worse from the ones you presented on July 12 on the issue. What do you think the reason for that is?

sweeneyde · 2021-07-17T22:44:45Z

The recent results still look pretty good but are visibly worse from the ones you presented on July 12 on the issue. What do you think the reason for that is?

There was a bug in my original implementation where it could access past the end of the haystack buffer. The new and old assembly is below (Microsoft Visual C++ 2019, Release x64 Build).

Assembly for commit 4d7d102 (it looks like it unrolled x2):

   440:             for (;;) {
   441:                 LOG_LINEUP();
   442:                 Py_ssize_t shift = table[(*window_last) & TABLE_MASK];
00007FF937A5BFA0  movsx       rax,byte ptr [r9]  
00007FF937A5BFA4  and         eax,3Fh  
00007FF937A5BFA7  movzx       ecx,byte ptr [rax+r15]  
   443:                 window_last += shift;
00007FF937A5BFAC  add         r9,rcx  
   444:                 if (shift == 0) {
00007FF937A5BFAF  test        rcx,rcx  
00007FF937A5BFB2  je          $windowloop+2Dh (07FF937A5BFCDh)  
   445:                     break;
   446:                 }
   447:                 if (window_last >= haystack_end) {
00007FF937A5BFB4  cmp         r9,rbp  
00007FF937A5BFB7  jae         stringlib__two_way+13Ch (07FF937A5BF4Ch)  
   440:             for (;;) {
   441:                 LOG_LINEUP();
   442:                 Py_ssize_t shift = table[(*window_last) & TABLE_MASK];
00007FF937A5BFB9  movsx       rax,byte ptr [r9]  
00007FF937A5BFBD  and         eax,3Fh  
00007FF937A5BFC0  movzx       ecx,byte ptr [rax+r15]  
   443:                 window_last += shift;
00007FF937A5BFC5  add         r9,rcx  
   444:                 if (shift == 0) {
00007FF937A5BFC8  test        rcx,rcx  
00007FF937A5BFCB  jne         $windowloop+14h (07FF937A5BFB4h)  
   448:                     return -1;
   449:                 }
   450:                 LOG("Horspool skip");
   451:             }

Old Assembly from commit
d3f8438 (incorrect, accessed past the end of the haystack buffer):

   432:             while (shift > 0 && window_last < haystack_end) {
00007FF93489BF6C  test        rdx,rdx  
00007FF93489BF6F  je          $windowloop+33h (07FF93489BF93h)  
00007FF93489BF71  cmp         rax,rsi  
00007FF93489BF74  jae         stringlib__two_way+13Dh (07FF93489BF0Dh)  
   433:                 LOG("Horspool skip.\n");
   434:                 window_last += shift;
00007FF93489BF76  add         rax,rdx  
   435:                 shift = table[(*window_last) & TABLE_MASK];
00007FF93489BF79  movsx       rcx,byte ptr [rax]  
00007FF93489BF7D  and         ecx,3Fh  
00007FF93489BF80  movzx       edx,byte ptr [rcx+r14]  
00007FF93489BF85  test        rdx,rdx  
00007FF93489BF88  jne         $windowloop+11h (07FF93489BF71h)  
   436:                 LOG_LINEUP();
   437:             }
   438:             if (window_last >= haystack_end) {
00007FF93489BF8A  cmp         rax,rsi  
00007FF93489BF8D  jae         stringlib__two_way+13Dh (07FF93489BF0Dh)  
   439:                 break; // return -1
   440:             }

It appears that the 10 Zipf results that slowed down the most between these versions all had needles that ended in 'D', which I had originally randomly set to be the most-frequently occurring character (about 25% of all characters). For needles that end in common characters, less time is spent in the Horspool skip table loop, so unrolling made things worse rather than better. Similarly, the benchmarks on binary data that slowed down the most all ended in null bytes, which were very common in the file I looked at.

I can look at reverting commit b8df3e1 and see how that changes things.

sweeneyde · 2021-07-18T02:23:08Z

Reverting commit b8df3e1 was a wash: "1.02x slower" on the zipf benchmarks and "1.00x faster" on the real-life benchmarks.

bpo-41972: Tweak fastsearch.h string search algorithms (pythonGH-27091)

sweeneyde · 2021-07-19T14:36:31Z

Thanks @ambv , and congrats on the new role!

ambv · 2021-07-19T14:37:09Z

Thanks for the kind words, and thanks for your change!

* origin/main: (1146 commits) bpo-42064: Finalise establishing sqlite3 global state (pythonGH-27155) bpo-44678: Separate error message for discontinuous padding in binascii.a2b_base64 strict mode (pythonGH-27249) correct spelling (pythonGH-27076) bpo-44524: Add missed __name__ and __qualname__ to typing module objects (python#27237) bpo-27513: email.utils.getaddresses() now handles Header objects (python#13797) Clean up comma usage in Doc/library/functions.rst (python#27083) bpo-42238: Fix small rst issue in NEWS.d/. (python#27238) bpo-41972: Tweak fastsearch.h string search algorithms (pythonGH-27091) bpo-44340: Add support for building with clang full/thin lto (pythonGH-27231) bpo-44661: Update property_descr_set to use vectorcall if possible. (pythonGH-27206) bpo-44645: Check for interrupts on any potentially backwards edge (pythonGH-27216) bpo-41546: make pprint (like print) not write to stdout when it is None (pythonGH-26810) bpo-44554: refactor pdb targets (and internal tweaks) (pythonGH-26992) bpo-43086: Add handling for out-of-spec data in a2b_base64 (pythonGH-24402) bpo-44561: Update hyperlinks in Doc/distributing/index.rst (python#27032) bpo-42355: symtable.get_namespace() now checks whether there are multiple or any namespaces found (pythonGH-23278) bpo-44654: Do not export the union type related symbols (pythonGH-27223) bpo-44633: Fix parameter substitution of the union type with wrong types. (pythonGH-27218) bpo-44654: Refactor and clean up the union type implementation (pythonGH-27196) bpo-20291: Fix MSVC warnings in getargs.c (pythonGH-27211) ...

sweeneyde added 11 commits July 6, 2021 16:29

Don't force inlining

90a6a0c

Use a Boyer-Moore skip table

19fa661

Switch to two-way in more cases

d2219d5

Tighter comments

3b8760f

Fix the direction of the cutoff inequality :)

5d86b8d

Separate a gap loop

2e9a9b1

update cutoffs

c84314c

tweak cutoffs, fix i - cut + 1 jump

7517c60

Refactor into smaller functions

e5ef33b

Focus the table on the last character of the pattern, giving a tighte…

a01fb70

…r inner loop.

tweak cutoffs

6be0c32

the-knights-who-say-ni added the CLA signed label Jul 12, 2021

bedevere-bot added the awaiting review label Jul 12, 2021

blurb-it bot and others added 6 commits July 12, 2021 04:06

📜🤖 Added by blurb_it.

f9fe807

Fix sanitizer warnings

25f9b56

Merge branch 'fastersearch' of https://github.com/sweeneyde/cpython i…

d3f8438

…nto fastersearch

Refactor for sanitizer

4379ea6

Don't goto declarations

e336b79

Emphasize to the compiler that the loop is a loop

b8df3e1

ambv reviewed Jul 15, 2021

View reviewed changes

Use Py_LOCAL_INLINE

869f7d0

Use Py_LOCAL_INLINE again

747a39a

sweeneyde added 2 commits July 15, 2021 23:22

Tweak thresholds based on most recent comparison.

72143ef

Fix greater/less than sign.

4d7d102

ambv merged commit d01dceb into python:main Jul 19, 2021

bedevere-bot removed the awaiting review label Jul 19, 2021

sthagen added a commit to sthagen/python-cpython that referenced this pull request Jul 19, 2021

Merge pull request #539 from python/main

7cb0dbb

bpo-41972: Tweak fastsearch.h string search algorithms (pythonGH-27091)

sweeneyde deleted the fastersearch branch September 8, 2021 01:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-41972: Tweak fastsearch.h string search algorithms #27091

bpo-41972: Tweak fastsearch.h string search algorithms #27091

sweeneyde commented Jul 12, 2021 •

edited by bedevere-bot

Loading

ambv left a comment

ambv Jul 15, 2021

sweeneyde Jul 15, 2021

sweeneyde commented Jul 15, 2021

sweeneyde commented Jul 16, 2021

ambv commented Jul 16, 2021

sweeneyde commented Jul 16, 2021

ambv commented Jul 17, 2021

sweeneyde commented Jul 17, 2021

sweeneyde commented Jul 18, 2021

sweeneyde commented Jul 19, 2021

ambv commented Jul 19, 2021

bpo-41972: Tweak fastsearch.h string search algorithms #27091

bpo-41972: Tweak fastsearch.h string search algorithms #27091

Conversation

sweeneyde commented Jul 12, 2021 • edited by bedevere-bot Loading

ambv left a comment

Choose a reason for hiding this comment

ambv Jul 15, 2021

Choose a reason for hiding this comment

sweeneyde Jul 15, 2021

Choose a reason for hiding this comment

sweeneyde commented Jul 15, 2021

sweeneyde commented Jul 16, 2021

ambv commented Jul 16, 2021

sweeneyde commented Jul 16, 2021

ambv commented Jul 17, 2021

sweeneyde commented Jul 17, 2021

sweeneyde commented Jul 18, 2021

sweeneyde commented Jul 19, 2021

ambv commented Jul 19, 2021

sweeneyde commented Jul 12, 2021 •

edited by bedevere-bot

Loading