perf(h2,tls): hybrid emit selector — DRAIN small bodies, GATHER large (#30)#32
Merged
Conversation
Documents the two HTTP/2 TLS emit paths and the per-pass selector that sits between them, with the per-strategy memcpy / allocation arithmetic and the bench numbers driving the threshold. Companion to the Phase 1 implementation work on the same issue.
Two TLS emit paths now coexist behind an adaptive selector:
DRAIN — drain nghttp2 via mem_send into a 16 KiB stack buffer and
BIO_write straight into the plaintext BIO. No records[] /
body_refs[] gather machinery, no per-pass emalloc churn.
Wins on short responses where alloc/zval_ptr_dtor cost
dominates.
GATHER — drive nghttp2 via session_send + NO_COPY callbacks, fold
frames into records[] (with body_refs[] keeping bodies
alive), then memcpy everything into stage[] and ship with
one SSL_write_ex. Wins on bodies that fill at least one
TLS record (amortises cipher setup; only one memcpy of
the body instead of two — mem_send + BIO_write).
Selector lives on http2_session_t::large_streams_pending. Each submit
site (dynamic submit_response / submit_response_streaming, static
buffered + streaming submit) pins the counter when the response body
exceeds H2_TLS_HYBRID_LARGE_THRESHOLD (2 KiB); cb_on_stream_close
unpins. Streaming responses with unknown total size are pessimistically
treated as large. http2_session_emit takes DRAIN while the counter is
zero, GATHER otherwise.
Override the selector with TRUE_ASYNC_H2_TLS_EMIT_MODE = drain | gather
| hybrid (default) for A/B testing; env is read once and cached.
Bench (release PHP, h2 TLS, c=100 m=32, h2load -t 1, 10s × N median):
body gather drain hybrid
static 100B 125k 146k 145k drain win (~17%)
static 1K 111k 120k ~120k drain win (~9%)
static 4K 83k 76k ~83k gather win (~10%)
static 16K 55k 40k 61k gather win
static 64K 17k 12k 17k gather win
dyn 3B 204k 264k 268k drain win
dyn 16K 70k 54k 75k gather win
dyn 64K 20k 13k 19k gather win
Profile diff at static 4K (perf record -F999 -g): gather lowers
memmove from 8.57% to 6.75% (one body memcpy vs two in DRAIN), at
the cost of +1.14pp _emalloc for the gather scratch arrays — net
−0.7pp CPU translates to the ~10% RPS win.
phpt: server/h2 26/26, server/static+tls 27/28 (pre-existing
004-static-workers failure, unrelated).
Closed
4 tasks
Contributor
CoverageTotal lines: 77.12% → 77.16% (+0.04 pp)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bench (release PHP, h2 TLS, c=100 m=32, h2load -t 1, 10s × N median)
Profile (perf record -F 999 -g, static 4K): gather lowers `memmove` from 8.6% → 6.8% (one body memcpy vs two in DRAIN) at the cost of +1.1pp `_emalloc` for the gather scratch — net 0.7pp CPU translates to ~10% RPS.
Test plan
Notes