perf(decode): inline sequence executor for direct path + auto-route decode_all (z000033 −24%, high-entropy-1m parity)#263
Conversation
|
Warning Review limit reached
More reviews will be available in 42 minutes and 30 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughAdds an x86_64 "inline donor" decode fast path: a BufferBackend capability and unsafe hook, SIMD overlap-copy primitives, UserSliceBackend implementation, and executor wiring with safety guards and fallback to legacy repeat. ChangesInline donor execution path
Sequence Diagram(s)sequenceDiagram
participant Executor as execute_one_sequence_pipelined
participant DecodeBuffer
participant BufferBackend
participant ExecPrims as exec_sequence_donor::x86
Executor->>DecodeBuffer: snapshot literal cursor and compute high/limits
Executor->>DecodeBuffer: DecodeBuffer::buffer_mut()
DecodeBuffer->>BufferBackend: donor_exec_one_sequence(lit_ptr, lit_len, offset, match_len)
BufferBackend->>ExecPrims: call copy16/wildcopy/overlap_copy8 helpers
ExecPrims->>BufferBackend: perform writes into backend slice
BufferBackend->>DecodeBuffer: (caller) advance_output_counter(n)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR ports donor zstd’s ZSTD_execSequence shape into the Rust decoder’s direct-write hot path (DecodeBuffer<UserSliceBackend>), aiming to reduce per-sequence dispatch overhead by inlining literal+match copying as one straight-line routine on x86/x86_64.
Changes:
- Add an opt-in
BufferBackend::donor_exec_one_sequencefast path (compile-time selected viaSUPPORTS_INLINE_DONOR_EXEC) and dispatch to it from the pipelined sequence executor. - Introduce an x86/x86_64 SSE2 helper module (
exec_sequence_donor) implementing donor-stylecopy16/ wildcopy / overlapCopy8 primitives. - Add
DecodeBufferhelpers to expose the backend mutably and to advance output/hash bookkeeping when bypassingpush/repeat.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/decoding/user_slice_buf.rs |
Implements the donor-style execSequence routine for UserSliceBackend and opts into the compile-time fast path. |
zstd/src/decoding/sequence_section_decoder.rs |
Adds the const-dispatched donor-exec branch inside execute_one_sequence_pipelined. |
zstd/src/decoding/mod.rs |
Wires in the new exec_sequence_donor module. |
zstd/src/decoding/exec_sequence_donor.rs |
New x86/x86_64 donor helper implementations for 16-byte copy, wildcopy, and overlapCopy8. |
zstd/src/decoding/decode_buffer.rs |
Adds backend access + bookkeeping helper (advance_output_counter) for donor-exec writes. |
zstd/src/decoding/buffer_backend.rs |
Extends the backend trait with the opt-in const and the unsafe donor-exec hook. |
… guard, explicit overshoot assert
…opy8 net offset, doc accuracy, hash assert
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/decoding/buffer_backend.rs`:
- Around line 78-90: The unsafe contract for the literal-copy hook is missing
the required read-side slack: callers must guarantee enough literal bytes for
the unconditional copy16/tail wildcopy that the donor path performs. Update the
safety doc for the method in buffer_backend.rs (the function implementing the
literal-copy hook used by donor_exec_one_sequence and
execute_one_sequence_pipelined) to add a bullet stating callers must ensure the
literal buffer has at least the extra read slack (e.g., high + 15 or
WILDCOPY_OVERLENGTH) so the unconditional copy16/wildcopy load cannot read
out-of-bounds; reference donor_exec_one_sequence,
execute_one_sequence_pipelined, and copy16 in the comment so future callers know
this requirement.
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 178-214: The checkpoint/restore logic must also save and restore
the frame hasher to avoid committing rolled-back bytes: update
DecodeBufferCheckpoint to include a snapshot of the hasher state (the same
hasher used by advance_output_counter / self.hash) and modify
try_restore_checkpoint() to restore that saved hasher state when rewinding tail
and total_output_counter; alternatively, delay mutating self.hash in
advance_output_counter for donor-path writes until the caller confirms the block
is committed—choose one approach and apply it consistently by updating the
DecodeBufferCheckpoint type, its creation sites, and try_restore_checkpoint()
(and any code that constructs/restores checkpoints) so hash state remains
consistent with tail/total_output_counter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 38720835-8a1c-41d2-bfe4-ecae01c180da
📒 Files selected for processing (6)
zstd/src/decoding/buffer_backend.rszstd/src/decoding/decode_buffer.rszstd/src/decoding/exec_sequence_donor.rszstd/src/decoding/mod.rszstd/src/decoding/sequence_section_decoder.rszstd/src/decoding/user_slice_buf.rs
…actual hash sites
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
zstd/src/decoding/buffer_backend.rs (1)
86-100:⚠️ Potential issue | 🟠 Major | ⚡ Quick winLine 91 still under-specifies the empty-literals case.
UserSliceBackend::donor_exec_one_sequencedoes an unconditionalcopy16before anylit_lengthcheck, solits.is_empty()still needs a 16-byte readable window. The currentlits.len() + 15wording only guarantees 15 bytes in that case. Please either document this asmax(lits.len() + 15, 16)or makelit_length == 0an explicit “must fall back” precondition.Suggested doc fix
- /// has ≥ `lits.len() + 15` initialised bytes addressable - /// from `lits.as_ptr()`. The current dispatch site + /// has at least `core::cmp::max(lits.len() + 15, 16)` + /// initialised bytes addressable from `lits.as_ptr()`. + /// In particular, `lits.is_empty()` must not use this path + /// unless the caller can still provide a 16-byte readable + /// window. The current dispatch site🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/decoding/buffer_backend.rs` around lines 86 - 100, The doc comment under the read-side slack description under-specifies the empty-literals case: update the contract to require a 16-byte readable window when lits can be empty by either (a) changing the wording from "lits.len() + 15" to "max(lits.len() + 15, 16)" so UserSliceBackend::donor_exec_one_sequence's unconditional copy16 has a safe read, or (b) add an explicit precondition that callers must treat lit_length == 0 as a guaranteed fallback and never call donor_exec_one_sequence when lits.is_empty(); reference the unconditional copy16 call and the donor_exec_one_sequence/sequence_section_decoder::execute_one_sequence_pipelined dispatch so reviewers can locate and verify the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 541-543: The donor_path_safe boolean currently requires
high.checked_add(15).is_some_and(|b| b <= lit_len) for all sequences, which
incorrectly blocks the inline donor path for short literal tails; change the
logic so the high+15 <= lit_len check is only enforced when the remaining
literal length after lit_cur_before is > 16 (i.e., when exec_sequence_donor
actually needs the literal wildcopy tail), otherwise skip that high-bound check;
update the expression that computes donor_path_safe (which uses
B::SUPPORTS_INLINE_DONOR_EXEC, lit_cur_before, lit_len, and high) to reflect
this conditional requirement so the inline donor path remains allowed for the
common 1..=16 literal case while preserving the copy16 guard behavior.
---
Duplicate comments:
In `@zstd/src/decoding/buffer_backend.rs`:
- Around line 86-100: The doc comment under the read-side slack description
under-specifies the empty-literals case: update the contract to require a
16-byte readable window when lits can be empty by either (a) changing the
wording from "lits.len() + 15" to "max(lits.len() + 15, 16)" so
UserSliceBackend::donor_exec_one_sequence's unconditional copy16 has a safe
read, or (b) add an explicit precondition that callers must treat lit_length ==
0 as a guaranteed fallback and never call donor_exec_one_sequence when
lits.is_empty(); reference the unconditional copy16 call and the
donor_exec_one_sequence/sequence_section_decoder::execute_one_sequence_pipelined
dispatch so reviewers can locate and verify the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4a8e6886-b1ae-4c00-94b6-6f519af2991b
📒 Files selected for processing (3)
zstd/src/decoding/buffer_backend.rszstd/src/decoding/decode_buffer.rszstd/src/decoding/sequence_section_decoder.rs
… tighten donor exec docs
…−8.19%, low-entropy-1m −7.05%) (#267) * perf(decode): K-cascade through decode_and_execute_sequences (Tier 9) * perf(decode): wrap BMI2/AVX2/VBMI2 dispatch arms with target_feature trampolines
1. usize overflow on 32-bit targets in execute_one_sequence and
execute_one_sequence_pipelined: lit_cur + seq.ll as usize
could wrap on adversarial input and the subsequent
get_unchecked(lit_cur..high) would slice OOB (UB). Both
sites now use checked_add(...).filter(|&h| h <= lit_len)
so wrap-on-overflow surfaces as
ExecuteSequencesError::NotEnoughBytesForSequence instead of
undefined behaviour.
2. Tail-literals push in decode_and_execute_sequences_impl still
went through the infallible buffer.push(rest). Routes the
tail push through buffer.try_push so a malformed block whose
unclaimed tail-literal length overshoots the fixed-capacity
backend surfaces as ExecuteSequencesError::OutputBufferOverflow
instead of panicking via UserSliceBackend::extend's release-mode
assert.
3. Stale doc references after the recent identifier renames:
- errors.rs DecodeBufferError::BackendOverflow rewritten to
describe the variant as kept for binary compatibility and
point at the richer OutputBufferOverflow for new code.
- user_slice_buf.rs UserSliceBackend::extend comment block
rewritten to reflect the new fallible dispatch
(try_extend / try_push / try_reserve cover the safe public
APIs).
- frame_decoder.rs test renamed
decode_all_matches_decode_all_on_single_segment_frame to
decode_all_legacy_drain_matches_direct_path_on_single_segment_frame
with a comment that describes the two distinct internal
paths the test exercises.
… docs 1. \`execute_sequences_fields\` (the RLE-mode sequence executor used by the fused decoder's fallback when sequence-section decode hits the RLE branch) still wrote literals through the infallible \`buffer.push(...)\`. On a malformed RLE-driven sequence stream whose literal claims overshoot the user's fixed-capacity slice that would panic via the per-call \`assert!\` inside \`UserSliceBackend::extend\`. Route both the per-sequence literal push and the tail-literals push through \`buffer.try_push\` so the overshoot surfaces as \`ExecuteSequencesError::OutputBufferOverflow\` and bubbles up as \`FrameDecoderError\` instead. 2. \`UserSliceBackend\` module docs rewritten: the previous "DoS surface on malformed Compressed blocks" section claimed the direct path could panic and pointed at a tracked follow-up. With the safety refactor landed in this PR (try_push / try_reserve / exec_sequence_inline returning Result), the section now describes the actual contract: safe public APIs route through the fallible write surface and surface overshoot as ExecuteSequencesError::OutputBufferOverflow / DecodeBufferError::OutputBufferOverflow; the infallible entry points remain as defense-in-depth for callers that have already validated capacity at a higher layer. 3. \`BackendOverflow\` doc updated: it previously claimed the decoder converts \`BackendOverflow\` into \`FrameContentSizeMismatch\` at the decode_all boundary. The actual mapping is now \`OutputBufferOverflow\` (in ExecuteSequencesError or DecodeBufferError) wrapped into FrameDecoderError::FailedToReadBlockBody.
…s_fields contract
…length 1. \`UserSliceBackend::exec_sequence_inline\` reported \`requested\` as \`cap_required - tail\`, which includes the +15-byte wildcopy overshoot. Other overflow producers (\`BackendOverflow\` from \`try_*\` paths) report the logical write length. Align: split the capacity check into two stages — \`total = lit_length + match_length\` is the logical value reported in \`OutputBufferOverflow.requested\`; the overshoot stays in the internal \`cap_required\` used only for the bounds check. 2. \`wildcopy_overlap_8byte_stride\` doc comment said each iter reads \`src + off + 8\`; actual access pattern is \`src + off\` (8 bytes). Corrected to match the implementation.
Summary
Verbatim port of the C zstd
ZSTD_execSequencebody (zstd_decompress_block.c:1008-1105) into a new inline sequence-execution path onUserSliceBackend, plus follow-on refactors that:FrameDecoderdecode entry points onto one internal code path (one direct fast path + one legacy fallback by per-frame eligibility).decode_all,decode_all_to_vec) surfaceFrameDecoderErroron malformed input instead of panicking.WILDCOPY_OVERLENGTHslack indecode_all_to_vecso the direct path is reachable without callers having to know about the slack contract.The inline-execution body bypasses the
DecodeBuffer::push+repeatabstraction chain (5+ layer dispatch throughextend_from_within_unchecked→copy_bytes_overshooting→single_op_copy_16etc.) in favour of a straight-line shape:Literal copy — unconditional
_mm_storeu_si12816-byte SSE2 store followed by a 16-byte SIMD wildcopy loop for thelitLength > 16tail. For the typical 1..=16 byte litLength the second copy never fires.Match copy fast path (offset ≥ 16) — single wildcopy with
no_overlapsemantics, 16-byte SIMD loop.Match copy short-offset (offset 1..=15) —
overlap_copy8spreading (dec32table / dec64table for offset < 8, plaincopy8for 8..15) followed bywildcopy(overlap_src_before_dst)8-byte stride for the remainingmatchLength - 8bytes.Dispatched at compile time via a const
SUPPORTS_INLINE_SEQUENCE_EXEC: boolon theBufferBackendtrait —UserSliceBackendreturnstrueonx86_64(SSE2 is the architectural baseline there), every other backend (FlatBuf, RingBuffer) and target falls through to the existing legacy chain. 32-bitx86(i586/i686) is excluded because the SSE2 intrinsics are emitted without a#[target_feature]gate and those targets don't always carry SSE2 in their baseline.Safety: fallible writes on every direct-path entry
After the routing landed, two paths still went through release-mode
assert!onUserSliceBackendoverflow:exec_sequence_inline) hadassert!on per-sequence capacity.buffer.push+repeat_inner→extend_from_within_unchecked) hadassert!too.Both are now fallible:
BufferBackend::exec_sequence_inlinereturnsResult<(), ExecuteSequencesError>. The newOutputBufferOverflowvariant carries the (tail,requested,capacity) triple.BufferBackend::try_reserve(n)method. Default impl (growable backends) delegates toreserve()infallibly.UserSliceBackendoverrides with a lineartail + n <= capcheck.DecodeBuffer::repeat_innercallstry_reserveand propagatesBackendOverflow→DecodeBufferError::OutputBufferOverflow.buffer.push(lits)callsites inexecute_one_sequence_pipelinedmove tobuffer.try_push(lits)?.Net effect: a malformed-frame corrupted sequence whose match or literal length overshoots the user's slice surfaces as
FrameDecoderErrorinstead of a panic.Path unification
With safety closed,
decode_allanddecode_all_to_vecnow route through the direct (UserSliceBackend) path automatically when the per-frame eligibility gate holds (FCS > 0, no active dict, remaining output hasWILDCOPY_OVERLENGTHslack). Ineligible frames keep falling through to the per-block decode + read drain loop.decode_all_to_vecreserves theWILDCOPY_OVERLENGTHslack internally so callers don't need to know about it —output.len()on return is the actual decompressed size, NOT the inflated capacity. The slack costs at most 32 bytes of one-timeVec::reserve.The previously-public
decode_to_slice_trusted(and its privatedecode_single_frame_legacy_drainhelper) are dropped —decode_all's internal dispatch produces identical output through the samerun_direct_decodehelper. Thepure_rust_directdecompress bench arm is dropped too —pure_rustnow allocates withWILDCOPY_OVERLENGTHslack and exercises the same direct path, making the two arms identical.Bench (i9-9900K,
886a35devs pre-PRmain)pure_rust(decode_all) hits the direct path automatically across all eligible fixtures:main(pre-PR)886a35dedecompress/level_-1_fast/decodecorpus-z000033/c_stream/matrix/pure_rustdecompress/level_-1_fast/high-entropy-1m/c_stream/matrix/pure_rustdecompress/level_-1_fast/low-entropy-1m/c_stream/matrix/pure_rustThe 24-40 % wall-clock reduction reaches all
decode_all/decode_all_to_veccallers automatically without their having to know about WILDCOPY slack.Why this layer wins
Earlier attempts (#256 libc fallback replacement, #262 8-byte stride, this branch's earlier chunk-cap=16) all targeted micro-optimisations INSIDE
copy_bytes_overshooting's dispatch tree and either lost to Intel ERMS or to the existing 16-byte SSE2 path oursingle_op_copy_16already emits. The bottleneck is not the per-byte SIMD width — it's the dispatch tree itself.This port collapses the entire literal+match-copy work into one inlined unsafe block per sequence with NO trait dispatch, NO
copy_bytes_overshootingcompare chain, NOsingle_op_copy_16indirection. The SIMD store is the same 16-byte SSE2_mm_storeu_si128either way; the saving is the surrounding compare/branch chain folded out.Testing
cargo nextest run -p structured-zstd --release— 636/637 pass (1 pre-existing release-mode debug_assert test, unrelated).cargo clippy --all-targets -- -D warningsclean.decode_all_*family inframe_decoder.rs) cover the literal-copy + match-copy contract — the inline executor produces byte-identical output to the legacypush + repeatchain.copy16,wildcopy_no_overlap,wildcopy_overlap_8byte_stride,overlap_copy8) covered by unit tests inexec_sequence_inline.rs.UserSliceBackend::exec_sequence_inlinebody covered by direct unit tests inuser_slice_buf.rs(short-literal + long-offset, long-literal wildcopy tail, short-offset overlap-copy).Part of #247.
Summary by CodeRabbit