⚡ Much faster ResponseReader performance (backports #642)#651
Merged
nevans merged 5 commits intov0.4-stablefrom Apr 22, 2026
Merged
Conversation
TruffleRuby handles `io.gets(CRLF, limit)` differently when the limit cuts in the middle of the terminator. It's helpful behavior, but it's different enough to break the tests.
A very large response with many small repeated literals can trigger
super-linear time. This happens because the regular expression that
checks for literal continuation matches from the beginning of the buffer
every time.
This could be mitigated by searching from an offset, based on what has
already been processed, or only searching the most recent line (before
merging it with the buffer), but that is still `O(n)` on line length.
The regexp is anchored to the end of the string, so searching in reverse
from the end of the string should be `O(1)`. This is accomplished by
converting `=~` to `rindex`.
Note that this _does_ slow down the "no literals" scenario.
```
$ benchmark-driver benchmarks/response_reader.yml --filter KiB
Warming up --------------------------------------
1KiB with no literals 143.564k i/s - 153.197k times in 1.067099s (6.97μs/i)
10KiB with no literals 27.394k i/s - 28.864k times in 1.053670s (36.50μs/i)
100KiB with no literals 2.926k i/s - 3.157k times in 1.079109s (341.81μs/i)
1KiB of 25B literals 2.786k i/s - 2.970k times in 1.066159s (358.98μs/i)
10KiB of 25B literals 263.498 i/s - 286.000 times in 1.085396s (3.80ms/i)
100KiB of 25B literals 19.470 i/s - 20.000 times in 1.027203s (51.36ms/i)
1KiB of 0B literals 530.014 i/s - 530.000 times in 0.999974s (1.89ms/i)
10KiB of 0B literals 45.239 i/s - 50.000 times in 1.105233s (22.10ms/i)
100KiB of 0B literals 3.075 i/s - 4.000 times in 1.300721s (325.18ms/i)
Calculating -------------------------------------
local YJIT
1KiB with no literals 137.049k 159.971k i/s - 430.691k times in 3.142607s 2.692304s
10KiB with no literals 27.272k 28.101k i/s - 82.181k times in 3.013413s 2.924470s
100KiB with no literals 2.941k 2.937k i/s - 8.776k times in 2.984095s 2.988129s
1KiB of 25B literals 2.803k 4.136k i/s - 8.357k times in 2.981249s 2.020772s
10KiB of 25B literals 262.978 385.394 i/s - 790.000 times in 3.004055s 2.049850s
100KiB of 25B literals 18.355 22.549 i/s - 58.000 times in 3.159962s 2.572152s
1KiB of 0B literals 505.733 759.572 i/s - 1.590k times in 3.143953s 2.093285s
10KiB of 0B literals 45.414 67.569 i/s - 135.000 times in 2.972648s 1.997962s
100KiB of 0B literals 2.722 3.510 i/s - 9.000 times in 3.306786s 2.564007s
Comparison:
1KiB with no literals
YJIT: 159971.1 i/s
local: 137049.0 i/s - 1.17x slower
10KiB with no literals
YJIT: 28101.2 i/s
local: 27271.7 i/s - 1.03x slower
100KiB with no literals
local: 2940.9 i/s
YJIT: 2937.0 i/s - 1.00x slower
1KiB of 25B literals
YJIT: 4135.5 i/s
local: 2803.2 i/s - 1.48x slower
10KiB of 25B literals
YJIT: 385.4 i/s
local: 263.0 i/s - 1.47x slower
100KiB of 25B literals
YJIT: 22.5 i/s
local: 18.4 i/s - 1.23x slower
1KiB of 0B literals
YJIT: 759.6 i/s
local: 505.7 i/s - 1.50x slower
10KiB of 0B literals
YJIT: 67.6 i/s
local: 45.4 i/s - 1.49x slower
100KiB of 0B literals
YJIT: 3.5 i/s
local: 2.7 i/s - 1.29x slower
```
For responses that are larger than 10KiB, the benchmarks do take another
dip. Despite that, I believe the algorithm _is_ still linear, and that
the performance hit on large responses is probably due to the large
strings inducing memory locality (paging/caching) bottlenecks.
I was suprised at how much slower `buff.rindex` is (vs `=~`) when it
doesn't match. This speeds up that case significantly (it's now faster
than it was prior to the `rindex` change), with only a small impact in
the case when it does match.
```
$ benchmark-driver benchmarks/response_reader.yml --filter KiB
Warming up --------------------------------------
1KiB with no literals 202.754k i/s - 210.144k times in 1.036449s (4.93μs/i)
10KiB with no literals 55.683k i/s - 57.541k times in 1.033362s (17.96μs/i)
100KiB with no literals 6.654k i/s - 7.176k times in 1.078491s (150.29μs/i)
1KiB of 25B literals 2.780k i/s - 2.959k times in 1.064363s (359.70μs/i)
10KiB of 25B literals 260.357 i/s - 286.000 times in 1.098491s (3.84ms/i)
100KiB of 25B literals 19.485 i/s - 20.000 times in 1.026445s (51.32ms/i)
1KiB of 0B literals 506.675 i/s - 550.000 times in 1.085508s (1.97ms/i)
10KiB of 0B literals 44.384 i/s - 45.000 times in 1.013872s (22.53ms/i)
100KiB of 0B literals 3.063 i/s - 4.000 times in 1.305939s (326.48ms/i)
Calculating -------------------------------------
local YJIT
1KiB with no literals 194.355k 247.756k i/s - 608.261k times in 3.129645s 2.455086s
10KiB with no literals 55.733k 58.585k i/s - 167.049k times in 2.997311s 2.851414s
100KiB with no literals 6.553k 6.453k i/s - 19.961k times in 3.045870s 3.093461s
1KiB of 25B literals 2.732k 4.061k i/s - 8.340k times in 3.052737s 2.053682s
10KiB of 25B literals 256.552 379.524 i/s - 781.000 times in 3.044220s 2.057840s
100KiB of 25B literals 17.804 23.286 i/s - 58.000 times in 3.257733s 2.490779s
1KiB of 0B literals 467.714 703.446 i/s - 1.520k times in 3.249846s 2.160791s
10KiB of 0B literals 45.376 65.876 i/s - 133.000 times in 2.931045s 2.018955s
100KiB of 0B literals 3.072 3.840 i/s - 9.000 times in 2.929458s 2.343586s
Comparison:
1KiB with no literals
YJIT: 247755.5 i/s
local: 194354.7 i/s - 1.27x slower
10KiB with no literals
YJIT: 58584.6 i/s
local: 55733.0 i/s - 1.05x slower
100KiB with no literals
local: 6553.5 i/s
YJIT: 6452.6 i/s - 1.02x slower
1KiB of 25B literals
YJIT: 4061.0 i/s
local: 2732.0 i/s - 1.49x slower
10KiB of 25B literals
YJIT: 379.5 i/s
local: 256.6 i/s - 1.48x slower
100KiB of 25B literals
YJIT: 23.3 i/s
local: 17.8 i/s - 1.31x slower
1KiB of 0B literals
YJIT: 703.4 i/s
local: 467.7 i/s - 1.50x slower
10KiB of 0B literals
YJIT: 65.9 i/s
local: 45.4 i/s - 1.45x slower
100KiB of 0B literals
YJIT: 3.8 i/s
local: 3.1 i/s - 1.25x slower
```
Unfortunately, neither `#rindex` nor even `#end_with?` appear to be
truly `O(1)` for very large strings (100K+). I'm guessing that this is
due to memory locality and caching issues. But, by parsing the literal
from the latest `line` (rather than the full buffer), we mostly avoid
that problem.
Also, by explicitly parsing literal_size immediately after reading the
line, we don't need to parse it again in `#done?`.
```
$ benchmark-driver benchmarks/response_reader.yml --filter KiB
Warming up --------------------------------------
1KiB with no literals 202.846k i/s - 214.181k times in 1.055878s (4.93μs/i)
10KiB with no literals 55.699k i/s - 57.354k times in 1.029717s (17.95μs/i)
100KiB with no literals 6.622k i/s - 6.688k times in 1.009943s (151.01μs/i)
1KiB of 25B literals 3.428k i/s - 3.751k times in 1.094065s (291.67μs/i)
10KiB of 25B literals 342.733 i/s - 350.000 times in 1.021202s (2.92ms/i)
100KiB of 25B literals 34.343 i/s - 36.000 times in 1.048234s (29.12ms/i)
1KiB of 0B literals 683.800 i/s - 690.000 times in 1.009066s (1.46ms/i)
10KiB of 0B literals 69.186 i/s - 70.000 times in 1.011759s (14.45ms/i)
100KiB of 0B literals 6.914 i/s - 7.000 times in 1.012449s (144.64ms/i)
Calculating -------------------------------------
local YJIT
1KiB with no literals 193.622k 250.330k i/s - 608.539k times in 3.142929s 2.430944s
10KiB with no literals 55.944k 58.881k i/s - 167.096k times in 2.986849s 2.837843s
100KiB with no literals 6.550k 6.480k i/s - 19.866k times in 3.033041s 3.065821s
1KiB of 25B literals 3.445k 5.520k i/s - 10.285k times in 2.985693s 1.863057s
10KiB of 25B literals 338.578 548.670 i/s - 1.028k times in 3.036224s 1.873620s
100KiB of 25B literals 33.829 55.860 i/s - 103.000 times in 3.044728s 1.843900s
1KiB of 0B literals 626.970 1.103k i/s - 2.051k times in 3.271287s 1.860275s
10KiB of 0B literals 66.065 108.347 i/s - 207.000 times in 3.133301s 1.910523s
100KiB of 0B literals 6.720 8.159 i/s - 20.000 times in 2.976273s 2.451265s
Comparison:
1KiB with no literals
YJIT: 250330.3 i/s
local: 193621.6 i/s - 1.29x slower
10KiB with no literals
YJIT: 58881.3 i/s
local: 55943.9 i/s - 1.05x slower
100KiB with no literals
local: 6549.9 i/s
YJIT: 6479.8 i/s - 1.01x slower
1KiB of 25B literals
YJIT: 5520.5 i/s
local: 3444.8 i/s - 1.60x slower
10KiB of 25B literals
YJIT: 548.7 i/s
local: 338.6 i/s - 1.62x slower
100KiB of 25B literals
YJIT: 55.9 i/s
local: 33.8 i/s - 1.65x slower
1KiB of 0B literals
YJIT: 1102.5 i/s
local: 627.0 i/s - 1.76x slower
10KiB of 0B literals
YJIT: 108.3 i/s
local: 66.1 i/s - 1.64x slower
100KiB of 0B literals
YJIT: 8.2 i/s
local: 6.7 i/s - 1.21x slower
```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This backports the first several commits from #642 to
v0.4-stable. While the other commits also improve performance, this is all that is needed to improve the algorithmic complexity from O(n²) to O(n).Also the other commits add a minor breaking change.