Skip to content

⚡ Much faster ResponseReader performance (backports #642)#651

Merged
nevans merged 5 commits intov0.4-stablefrom
backport/v0.4/response_reader-nonlinear-performance
Apr 22, 2026
Merged

⚡ Much faster ResponseReader performance (backports #642)#651
nevans merged 5 commits intov0.4-stablefrom
backport/v0.4/response_reader-nonlinear-performance

Conversation

@nevans
Copy link
Copy Markdown
Collaborator

@nevans nevans commented Apr 22, 2026

This backports the first several commits from #642 to v0.4-stable. While the other commits also improve performance, this is all that is needed to improve the algorithmic complexity from O(n²) to O(n).

Also the other commits add a minor breaking change.

nevans added 5 commits April 21, 2026 09:55
TruffleRuby handles `io.gets(CRLF, limit)` differently when the limit
cuts in the middle of the terminator.  It's helpful behavior, but it's
different enough to break the tests.
A very large response with many small repeated literals can trigger
super-linear time.  This happens because the regular expression that
checks for literal continuation matches from the beginning of the buffer
every time.

This could be mitigated by searching from an offset, based on what has
already been processed, or only searching the most recent line (before
merging it with the buffer), but that is still `O(n)` on line length.

The regexp is anchored to the end of the string, so searching in reverse
from the end of the string should be `O(1)`.  This is accomplished by
converting `=~` to `rindex`.

Note that this _does_ slow down the "no literals" scenario.

```
$ benchmark-driver benchmarks/response_reader.yml --filter KiB
Warming up --------------------------------------
  1KiB with no literals   143.564k i/s -    153.197k times in 1.067099s (6.97μs/i)
 10KiB with no literals    27.394k i/s -     28.864k times in 1.053670s (36.50μs/i)
100KiB with no literals     2.926k i/s -      3.157k times in 1.079109s (341.81μs/i)
  1KiB of  25B literals     2.786k i/s -      2.970k times in 1.066159s (358.98μs/i)
 10KiB of  25B literals    263.498 i/s -     286.000 times in 1.085396s (3.80ms/i)
100KiB of  25B literals     19.470 i/s -      20.000 times in 1.027203s (51.36ms/i)
  1KiB of   0B literals    530.014 i/s -     530.000 times in 0.999974s (1.89ms/i)
 10KiB of   0B literals     45.239 i/s -      50.000 times in 1.105233s (22.10ms/i)
100KiB of   0B literals      3.075 i/s -       4.000 times in 1.300721s (325.18ms/i)
Calculating -------------------------------------
                             local        YJIT
  1KiB with no literals   137.049k    159.971k i/s -    430.691k times in 3.142607s 2.692304s
 10KiB with no literals    27.272k     28.101k i/s -     82.181k times in 3.013413s 2.924470s
100KiB with no literals     2.941k      2.937k i/s -      8.776k times in 2.984095s 2.988129s
  1KiB of  25B literals     2.803k      4.136k i/s -      8.357k times in 2.981249s 2.020772s
 10KiB of  25B literals    262.978     385.394 i/s -     790.000 times in 3.004055s 2.049850s
100KiB of  25B literals     18.355      22.549 i/s -      58.000 times in 3.159962s 2.572152s
  1KiB of   0B literals    505.733     759.572 i/s -      1.590k times in 3.143953s 2.093285s
 10KiB of   0B literals     45.414      67.569 i/s -     135.000 times in 2.972648s 1.997962s
100KiB of   0B literals      2.722       3.510 i/s -       9.000 times in 3.306786s 2.564007s

Comparison:
               1KiB with no literals
                   YJIT:    159971.1 i/s
                  local:    137049.0 i/s - 1.17x  slower

              10KiB with no literals
                   YJIT:     28101.2 i/s
                  local:     27271.7 i/s - 1.03x  slower

             100KiB with no literals
                  local:      2940.9 i/s
                   YJIT:      2937.0 i/s - 1.00x  slower

               1KiB of  25B literals
                   YJIT:      4135.5 i/s
                  local:      2803.2 i/s - 1.48x  slower

              10KiB of  25B literals
                   YJIT:       385.4 i/s
                  local:       263.0 i/s - 1.47x  slower

             100KiB of  25B literals
                   YJIT:        22.5 i/s
                  local:        18.4 i/s - 1.23x  slower

               1KiB of   0B literals
                   YJIT:       759.6 i/s
                  local:       505.7 i/s - 1.50x  slower

              10KiB of   0B literals
                   YJIT:        67.6 i/s
                  local:        45.4 i/s - 1.49x  slower

             100KiB of   0B literals
                   YJIT:         3.5 i/s
                  local:         2.7 i/s - 1.29x  slower
```

For responses that are larger than 10KiB, the benchmarks do take another
dip.  Despite that, I believe the algorithm _is_ still linear, and that
the performance hit on large responses is probably due to the large
strings inducing memory locality (paging/caching) bottlenecks.
I was suprised at how much slower `buff.rindex` is (vs `=~`) when it
doesn't match.  This speeds up that case significantly (it's now faster
than it was prior to the `rindex` change), with only a small impact in
the case when it does match.

```
$ benchmark-driver benchmarks/response_reader.yml --filter KiB
Warming up --------------------------------------
  1KiB with no literals   202.754k i/s -    210.144k times in 1.036449s (4.93μs/i)
 10KiB with no literals    55.683k i/s -     57.541k times in 1.033362s (17.96μs/i)
100KiB with no literals     6.654k i/s -      7.176k times in 1.078491s (150.29μs/i)
  1KiB of  25B literals     2.780k i/s -      2.959k times in 1.064363s (359.70μs/i)
 10KiB of  25B literals    260.357 i/s -     286.000 times in 1.098491s (3.84ms/i)
100KiB of  25B literals     19.485 i/s -      20.000 times in 1.026445s (51.32ms/i)
  1KiB of   0B literals    506.675 i/s -     550.000 times in 1.085508s (1.97ms/i)
 10KiB of   0B literals     44.384 i/s -      45.000 times in 1.013872s (22.53ms/i)
100KiB of   0B literals      3.063 i/s -       4.000 times in 1.305939s (326.48ms/i)
Calculating -------------------------------------
                             local        YJIT
  1KiB with no literals   194.355k    247.756k i/s -    608.261k times in 3.129645s 2.455086s
 10KiB with no literals    55.733k     58.585k i/s -    167.049k times in 2.997311s 2.851414s
100KiB with no literals     6.553k      6.453k i/s -     19.961k times in 3.045870s 3.093461s
  1KiB of  25B literals     2.732k      4.061k i/s -      8.340k times in 3.052737s 2.053682s
 10KiB of  25B literals    256.552     379.524 i/s -     781.000 times in 3.044220s 2.057840s
100KiB of  25B literals     17.804      23.286 i/s -      58.000 times in 3.257733s 2.490779s
  1KiB of   0B literals    467.714     703.446 i/s -      1.520k times in 3.249846s 2.160791s
 10KiB of   0B literals     45.376      65.876 i/s -     133.000 times in 2.931045s 2.018955s
100KiB of   0B literals      3.072       3.840 i/s -       9.000 times in 2.929458s 2.343586s

Comparison:
               1KiB with no literals
                   YJIT:    247755.5 i/s
                  local:    194354.7 i/s - 1.27x  slower

              10KiB with no literals
                   YJIT:     58584.6 i/s
                  local:     55733.0 i/s - 1.05x  slower

             100KiB with no literals
                  local:      6553.5 i/s
                   YJIT:      6452.6 i/s - 1.02x  slower

               1KiB of  25B literals
                   YJIT:      4061.0 i/s
                  local:      2732.0 i/s - 1.49x  slower

              10KiB of  25B literals
                   YJIT:       379.5 i/s
                  local:       256.6 i/s - 1.48x  slower

             100KiB of  25B literals
                   YJIT:        23.3 i/s
                  local:        17.8 i/s - 1.31x  slower

               1KiB of   0B literals
                   YJIT:       703.4 i/s
                  local:       467.7 i/s - 1.50x  slower

              10KiB of   0B literals
                   YJIT:        65.9 i/s
                  local:        45.4 i/s - 1.45x  slower

             100KiB of   0B literals
                   YJIT:         3.8 i/s
                  local:         3.1 i/s - 1.25x  slower
```
Unfortunately, neither `#rindex` nor even `#end_with?` appear to be
truly `O(1)` for very large strings (100K+).  I'm guessing that this is
due to memory locality and caching issues.  But, by parsing the literal
from the latest `line` (rather than the full buffer), we mostly avoid
that problem.

Also, by explicitly parsing literal_size immediately after reading the
line, we don't need to parse it again in `#done?`.

```
$ benchmark-driver benchmarks/response_reader.yml --filter KiB
Warming up --------------------------------------
  1KiB with no literals   202.846k i/s -    214.181k times in 1.055878s (4.93μs/i)
 10KiB with no literals    55.699k i/s -     57.354k times in 1.029717s (17.95μs/i)
100KiB with no literals     6.622k i/s -      6.688k times in 1.009943s (151.01μs/i)
  1KiB of  25B literals     3.428k i/s -      3.751k times in 1.094065s (291.67μs/i)
 10KiB of  25B literals    342.733 i/s -     350.000 times in 1.021202s (2.92ms/i)
100KiB of  25B literals     34.343 i/s -      36.000 times in 1.048234s (29.12ms/i)
  1KiB of   0B literals    683.800 i/s -     690.000 times in 1.009066s (1.46ms/i)
 10KiB of   0B literals     69.186 i/s -      70.000 times in 1.011759s (14.45ms/i)
100KiB of   0B literals      6.914 i/s -       7.000 times in 1.012449s (144.64ms/i)
Calculating -------------------------------------
                             local        YJIT
  1KiB with no literals   193.622k    250.330k i/s -    608.539k times in 3.142929s 2.430944s
 10KiB with no literals    55.944k     58.881k i/s -    167.096k times in 2.986849s 2.837843s
100KiB with no literals     6.550k      6.480k i/s -     19.866k times in 3.033041s 3.065821s
  1KiB of  25B literals     3.445k      5.520k i/s -     10.285k times in 2.985693s 1.863057s
 10KiB of  25B literals    338.578     548.670 i/s -      1.028k times in 3.036224s 1.873620s
100KiB of  25B literals     33.829      55.860 i/s -     103.000 times in 3.044728s 1.843900s
  1KiB of   0B literals    626.970      1.103k i/s -      2.051k times in 3.271287s 1.860275s
 10KiB of   0B literals     66.065     108.347 i/s -     207.000 times in 3.133301s 1.910523s
100KiB of   0B literals      6.720       8.159 i/s -      20.000 times in 2.976273s 2.451265s

Comparison:
               1KiB with no literals
                   YJIT:    250330.3 i/s
                  local:    193621.6 i/s - 1.29x  slower

              10KiB with no literals
                   YJIT:     58881.3 i/s
                  local:     55943.9 i/s - 1.05x  slower

             100KiB with no literals
                  local:      6549.9 i/s
                   YJIT:      6479.8 i/s - 1.01x  slower

               1KiB of  25B literals
                   YJIT:      5520.5 i/s
                  local:      3444.8 i/s - 1.60x  slower

              10KiB of  25B literals
                   YJIT:       548.7 i/s
                  local:       338.6 i/s - 1.62x  slower

             100KiB of  25B literals
                   YJIT:        55.9 i/s
                  local:        33.8 i/s - 1.65x  slower

               1KiB of   0B literals
                   YJIT:      1102.5 i/s
                  local:       627.0 i/s - 1.76x  slower

              10KiB of   0B literals
                   YJIT:       108.3 i/s
                  local:        66.1 i/s - 1.64x  slower

             100KiB of   0B literals
                   YJIT:         8.2 i/s
                  local:         6.7 i/s - 1.21x  slower
```
@nevans nevans added backport This issue or PR is for a stable release branch performance related to CPU use, memory use, latency, etc labels Apr 22, 2026
@nevans nevans merged commit 88d9523 into v0.4-stable Apr 22, 2026
34 checks passed
@nevans nevans deleted the backport/v0.4/response_reader-nonlinear-performance branch April 22, 2026 21:43
@nevans nevans added bug Something isn't working security vulnerability patch Pull requests that address security vulnerabilities labels Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport This issue or PR is for a stable release branch bug Something isn't working performance related to CPU use, memory use, latency, etc security vulnerability patch Pull requests that address security vulnerabilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant