perf(decode): inline sequence executor for direct path + auto-route decode_all (z000033 −24%, high-entropy-1m parity) by polaz · Pull Request #263 · structured-world/structured-zstd

polaz · 2026-05-25T19:01:20Z

Summary

Verbatim port of the C zstd ZSTD_execSequence body (zstd_decompress_block.c:1008-1105) into a new inline sequence-execution path on UserSliceBackend, plus follow-on refactors that:

Unify all FrameDecoder decode entry points onto one internal code path (one direct fast path + one legacy fallback by per-frame eligibility).
Make the direct path's write surface fully fallible — safe public decode APIs (decode_all, decode_all_to_vec) surface FrameDecoderError on malformed input instead of panicking.
Auto-reserve WILDCOPY_OVERLENGTH slack in decode_all_to_vec so the direct path is reachable without callers having to know about the slack contract.

The inline-execution body bypasses the DecodeBuffer::push + repeat abstraction chain (5+ layer dispatch through extend_from_within_unchecked → copy_bytes_overshooting → single_op_copy_16 etc.) in favour of a straight-line shape:

Literal copy — unconditional _mm_storeu_si128 16-byte SSE2 store followed by a 16-byte SIMD wildcopy loop for the litLength > 16 tail. For the typical 1..=16 byte litLength the second copy never fires.
Match copy fast path (offset ≥ 16) — single wildcopy with no_overlap semantics, 16-byte SIMD loop.
Match copy short-offset (offset 1..=15) — overlap_copy8 spreading (dec32table / dec64table for offset < 8, plain copy8 for 8..15) followed by wildcopy(overlap_src_before_dst) 8-byte stride for the remaining matchLength - 8 bytes.

Dispatched at compile time via a const SUPPORTS_INLINE_SEQUENCE_EXEC: bool on the BufferBackend trait — UserSliceBackend returns true on x86_64 (SSE2 is the architectural baseline there), every other backend (FlatBuf, RingBuffer) and target falls through to the existing legacy chain. 32-bit x86 (i586/i686) is excluded because the SSE2 intrinsics are emitted without a #[target_feature] gate and those targets don't always carry SSE2 in their baseline.

Safety: fallible writes on every direct-path entry

After the routing landed, two paths still went through release-mode assert! on UserSliceBackend overflow:

The donor inline body itself (exec_sequence_inline) had assert! on per-sequence capacity.
The fallback path used by ineligible sequences (buffer.push + repeat_inner → extend_from_within_unchecked) had assert! too.

Both are now fallible:

BufferBackend::exec_sequence_inline returns Result<(), ExecuteSequencesError>. The new OutputBufferOverflow variant carries the (tail, requested, capacity) triple.
A new fallible BufferBackend::try_reserve(n) method. Default impl (growable backends) delegates to reserve() infallibly. UserSliceBackend overrides with a linear tail + n <= cap check.
DecodeBuffer::repeat_inner calls try_reserve and propagates BackendOverflow → DecodeBufferError::OutputBufferOverflow.
The fallback buffer.push(lits) callsites in execute_one_sequence_pipelined move to buffer.try_push(lits)?.

Net effect: a malformed-frame corrupted sequence whose match or literal length overshoots the user's slice surfaces as FrameDecoderError instead of a panic.

Path unification

With safety closed, decode_all and decode_all_to_vec now route through the direct (UserSliceBackend) path automatically when the per-frame eligibility gate holds (FCS > 0, no active dict, remaining output has WILDCOPY_OVERLENGTH slack). Ineligible frames keep falling through to the per-block decode + read drain loop.

decode_all_to_vec reserves the WILDCOPY_OVERLENGTH slack internally so callers don't need to know about it — output.len() on return is the actual decompressed size, NOT the inflated capacity. The slack costs at most 32 bytes of one-time Vec::reserve.

The previously-public decode_to_slice_trusted (and its private decode_single_frame_legacy_drain helper) are dropped — decode_all's internal dispatch produces identical output through the same run_direct_decode helper. The pure_rust_direct decompress bench arm is dropped too — pure_rust now allocates with WILDCOPY_OVERLENGTH slack and exercises the same direct path, making the two arms identical.

Bench (i9-9900K, `886a35de` vs pre-PR `main`)

pure_rust (decode_all) hits the direct path automatically across all eligible fixtures:

fixture	`main` (pre-PR)	`886a35de`	change	× donor
`decompress/level_-1_fast/decodecorpus-z000033/c_stream/matrix/pure_rust`	2.034 ms	1.553 ms	−23.6%	2.40× (was 3.14×)
`decompress/level_-1_fast/high-entropy-1m/c_stream/matrix/pure_rust`	147.0 µs	101.4 µs	−31.0%	1.03× (parity)
`decompress/level_-1_fast/low-entropy-1m/c_stream/matrix/pure_rust`	355.3 µs	213.7 µs	−39.9%	1.45× (was 2.42×)

The 24-40 % wall-clock reduction reaches all decode_all / decode_all_to_vec callers automatically without their having to know about WILDCOPY slack.

Why this layer wins

Earlier attempts (#256 libc fallback replacement, #262 8-byte stride, this branch's earlier chunk-cap=16) all targeted micro-optimisations INSIDE copy_bytes_overshooting's dispatch tree and either lost to Intel ERMS or to the existing 16-byte SSE2 path our single_op_copy_16 already emits. The bottleneck is not the per-byte SIMD width — it's the dispatch tree itself.

This port collapses the entire literal+match-copy work into one inlined unsafe block per sequence with NO trait dispatch, NO copy_bytes_overshooting compare chain, NO single_op_copy_16 indirection. The SIMD store is the same 16-byte SSE2 _mm_storeu_si128 either way; the saving is the surrounding compare/branch chain folded out.

Testing

cargo nextest run -p structured-zstd --release — 636/637 pass (1 pre-existing release-mode debug_assert test, unrelated).
cargo clippy --all-targets -- -D warnings clean.
Direct-decode regression tests (the decode_all_* family in frame_decoder.rs) cover the literal-copy + match-copy contract — the inline executor produces byte-identical output to the legacy push + repeat chain.
SIMD helpers (copy16, wildcopy_no_overlap, wildcopy_overlap_8byte_stride, overlap_copy8) covered by unit tests in exec_sequence_inline.rs.
UserSliceBackend::exec_sequence_inline body covered by direct unit tests in user_slice_buf.rs (short-literal + long-offset, long-literal wildcopy tail, short-offset overlap-copy).

Part of #247.

Summary by CodeRabbit

Performance Improvements
- Faster decompression on x86_64 via a new inline sequence execution path for literal + match copying.
Stability / Correctness
- Direct-path writes are now fully fallible — malformed frames surface structured errors instead of panicking.
- Rolling frame-hash state is now snapshot and restored during speculative decoding, preventing failed paths from affecting final frame digests.
Compatibility
- Non-x86_64 platforms continue using the existing decode path; optimized path is enabled only where supported.

coderabbitai · 2026-05-25T19:01:27Z

Warning

Review limit reached

@polaz, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 42 minutes and 30 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 717c7517-2f71-44e1-a226-e2ca837159f5

📥 Commits

Reviewing files that changed from the base of the PR and between 10cc5d8 and a217299.

📒 Files selected for processing (12)

zstd/src/decoding/block_decoder.rs
zstd/src/decoding/buffer_backend.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/errors.rs
zstd/src/decoding/exec_sequence_inline.rs
zstd/src/decoding/frame_decoder.rs
zstd/src/decoding/mod.rs
zstd/src/decoding/scratch.rs
zstd/src/decoding/sequence_execution.rs
zstd/src/decoding/sequence_section_decoder.rs
zstd/src/decoding/user_slice_buf.rs
zstd/src/lib.rs

📝 Walkthrough

Walkthrough

Adds an x86_64 "inline donor" decode fast path: a BufferBackend capability and unsafe hook, SIMD overlap-copy primitives, UserSliceBackend implementation, and executor wiring with safety guards and fallback to legacy repeat.

Changes

Inline donor execution path

Layer / File(s)	Summary
Backend trait design and counter helpers `zstd/src/decoding/buffer_backend.rs`, `zstd/src/decoding/decode_buffer.rs`	`BufferBackend` adds `SUPPORTS_INLINE_DONOR_EXEC` const and unsafe `donor_exec_one_sequence` hook (default `unreachable!()`). `DecodeBuffer` checkpoint docs clarified (hashing not snapshotted) and adds `buffer_mut()` and `advance_output_counter()` helpers.
x86_64 SIMD copy and overlap-copy helpers `zstd/src/decoding/exec_sequence_donor.rs`, `zstd/src/decoding/mod.rs`	New x86_64-gated `exec_sequence_donor` module provides `copy16`, `wildcopy_no_overlap`, `wildcopy_overlap_8byte_stride`, and `overlap_copy8` (donor-style offset-dependent spread logic) and unit tests; module is declared from `decoding/mod.rs`.
UserSliceBackend inline donor implementation `zstd/src/decoding/user_slice_buf.rs`	`UserSliceBackend` opts into `SUPPORTS_INLINE_DONOR_EXEC` on x86_64 and implements `donor_exec_one_sequence` using SIMD primitives, handling literal and match copying, overlap cases, capacity assertion, and tail advancement. Includes x86_64-only tests.
Pipelined executor fast path and kernel dispatch `zstd/src/decoding/sequence_section_decoder.rs`	`decode_and_execute_sequences` dispatches per CPU kernel and threads `K` into `BitReaderReversed`. Pipelined executor snapshots literal cursor and, when safety checks and `B::SUPPORTS_INLINE_DONOR_EXEC` permit, calls `buffer_mut().donor_exec_one_sequence(...)` and advances counters; otherwise falls back to `push(lits)` + `repeat_lookahead_prefetched`.
Frame routing & direct-decode adjustments `zstd/src/decoding/frame_decoder.rs`	Per-frame eligibility gate for direct-decode, refactored internal `run_direct_decode`, tighter overflow/error accounting, post-decode produced-size checks, and test updates to use `decode_all`.
FSE decoder kernel generics & tests `zstd/src/fse/fse_decoder.rs`, `zstd/src/fse/mod.rs`	`FSEDecoder` methods become generic over `CpuKernel` and accept `BitReaderReversed<'_, K>`; test helper constructs `BitReaderReversed::<ScalarKernel>::new(...)`.
Benches and error reporting `zstd/benches/compare_ffi.rs`, `zstd/src/decoding/errors.rs`	Benchmark target buffer resized with `WILDCOPY_OVERLENGTH` slack; `ExecuteSequencesError` adds `DonorPathBufferOverflow { tail, requested, capacity }` and `Display` message.

Sequence Diagram(s)

sequenceDiagram
  participant Executor as execute_one_sequence_pipelined
  participant DecodeBuffer
  participant BufferBackend
  participant ExecPrims as exec_sequence_donor::x86
  Executor->>DecodeBuffer: snapshot literal cursor and compute high/limits
  Executor->>DecodeBuffer: DecodeBuffer::buffer_mut()
  DecodeBuffer->>BufferBackend: donor_exec_one_sequence(lit_ptr, lit_len, offset, match_len)
  BufferBackend->>ExecPrims: call copy16/wildcopy/overlap_copy8 helpers
  ExecPrims->>BufferBackend: perform writes into backend slice
  BufferBackend->>DecodeBuffer: (caller) advance_output_counter(n)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

perf(decode): collapse FrameDecoder → block_decoder → ... 9-layer chain to donor's 3-layer ZSTD_decompressBlock_internal shape #265: Shares implementation goals for an inline donor decode path and user-slice backend donor exec hook.
perf: top-level CpuKernel dispatch at FrameDecoder/FrameCompressor entry + FSE Entry layout #247: Related CPU-kernel dispatch and FSE/generic kernel wiring touched by this change.

Possibly related PRs

structured-world/structured-zstd#261: Changes the WILDCOPY_OVERLENGTH that this PR relies on for donor-path capacity checks.
structured-world/structured-zstd#227: Introduced executor plumbing reused by the donor fast-path fallback chain.
structured-world/structured-zstd#254: Added CpuKernel/BitReader genericization that this PR extends.

Poem

I hop in bytes and SSE streams,
I copy sixteen and chase decoder dreams,
offsets spread like carrot rows,
kernels wake and fast-paths glow,
a rabbit pads the output flows. 🐰✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title 'perf(decode): inline sequence executor for direct path + auto-route decode_all (z000033 −24%, high-entropy-1m parity)' directly describes the main change: a performance optimization adding an inline donor sequence executor for the direct decode path with automatic routing in decode_all.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/donor-exec-sequence-port

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-25T19:11:03Z

Codecov Report

❌ Patch coverage is 70.96774% with 135 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/decoding/sequence_section_decoder.rs	46.05%	82 Missing ⚠️
zstd/src/decoding/errors.rs	0.00%	17 Missing ⚠️
zstd/src/decoding/user_slice_buf.rs	86.23%	15 Missing ⚠️
zstd/src/decoding/decode_buffer.rs	7.69%	12 Missing ⚠️
zstd/src/decoding/buffer_backend.rs	33.33%	8 Missing ⚠️
zstd/src/decoding/frame_decoder.rs	98.14%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR ports donor zstd’s ZSTD_execSequence shape into the Rust decoder’s direct-write hot path (DecodeBuffer<UserSliceBackend>), aiming to reduce per-sequence dispatch overhead by inlining literal+match copying as one straight-line routine on x86/x86_64.

Changes:

Add an opt-in BufferBackend::donor_exec_one_sequence fast path (compile-time selected via SUPPORTS_INLINE_DONOR_EXEC) and dispatch to it from the pipelined sequence executor.
Introduce an x86/x86_64 SSE2 helper module (exec_sequence_donor) implementing donor-style copy16 / wildcopy / overlapCopy8 primitives.
Add DecodeBuffer helpers to expose the backend mutably and to advance output/hash bookkeeping when bypassing push/repeat.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`zstd/src/decoding/user_slice_buf.rs`	Implements the donor-style execSequence routine for `UserSliceBackend` and opts into the compile-time fast path.
`zstd/src/decoding/sequence_section_decoder.rs`	Adds the const-dispatched donor-exec branch inside `execute_one_sequence_pipelined`.
`zstd/src/decoding/mod.rs`	Wires in the new `exec_sequence_donor` module.
`zstd/src/decoding/exec_sequence_donor.rs`	New x86/x86_64 donor helper implementations for 16-byte copy, wildcopy, and overlapCopy8.
`zstd/src/decoding/decode_buffer.rs`	Adds backend access + bookkeeping helper (`advance_output_counter`) for donor-exec writes.
`zstd/src/decoding/buffer_backend.rs`	Extends the backend trait with the opt-in const and the unsafe donor-exec hook.

… guard, explicit overshoot assert

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

…opy8 net offset, doc accuracy, hash assert

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/buffer_backend.rs`:
- Around line 78-90: The unsafe contract for the literal-copy hook is missing
the required read-side slack: callers must guarantee enough literal bytes for
the unconditional copy16/tail wildcopy that the donor path performs. Update the
safety doc for the method in buffer_backend.rs (the function implementing the
literal-copy hook used by donor_exec_one_sequence and
execute_one_sequence_pipelined) to add a bullet stating callers must ensure the
literal buffer has at least the extra read slack (e.g., high + 15 or
WILDCOPY_OVERLENGTH) so the unconditional copy16/wildcopy load cannot read
out-of-bounds; reference donor_exec_one_sequence,
execute_one_sequence_pipelined, and copy16 in the comment so future callers know
this requirement.

In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 178-214: The checkpoint/restore logic must also save and restore
the frame hasher to avoid committing rolled-back bytes: update
DecodeBufferCheckpoint to include a snapshot of the hasher state (the same
hasher used by advance_output_counter / self.hash) and modify
try_restore_checkpoint() to restore that saved hasher state when rewinding tail
and total_output_counter; alternatively, delay mutating self.hash in
advance_output_counter for donor-path writes until the caller confirms the block
is committed—choose one approach and apply it consistently by updating the
DecodeBufferCheckpoint type, its creation sites, and try_restore_checkpoint()
(and any code that constructs/restores checkpoints) so hash state remains
consistent with tail/total_output_counter.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 38720835-8a1c-41d2-bfe4-ecae01c180da

📥 Commits

Reviewing files that changed from the base of the PR and between da4aaa8 and 375322b.

📒 Files selected for processing (6)

zstd/src/decoding/buffer_backend.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/exec_sequence_donor.rs
zstd/src/decoding/mod.rs
zstd/src/decoding/sequence_section_decoder.rs
zstd/src/decoding/user_slice_buf.rs

…slack contract

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

…actual hash sites

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

zstd/src/decoding/buffer_backend.rs (1)

86-100: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Line 91 still under-specifies the empty-literals case.

UserSliceBackend::donor_exec_one_sequence does an unconditional copy16 before any lit_length check, so lits.is_empty() still needs a 16-byte readable window. The current lits.len() + 15 wording only guarantees 15 bytes in that case. Please either document this as max(lits.len() + 15, 16) or make lit_length == 0 an explicit “must fall back” precondition.

Suggested doc fix

-    ///   has ≥ `lits.len() + 15` initialised bytes addressable
-    ///   from `lits.as_ptr()`. The current dispatch site
+    ///   has at least `core::cmp::max(lits.len() + 15, 16)`
+    ///   initialised bytes addressable from `lits.as_ptr()`.
+    ///   In particular, `lits.is_empty()` must not use this path
+    ///   unless the caller can still provide a 16-byte readable
+    ///   window. The current dispatch site

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/buffer_backend.rs` around lines 86 - 100, The doc comment
under the read-side slack description under-specifies the empty-literals case:
update the contract to require a 16-byte readable window when lits can be empty
by either (a) changing the wording from "lits.len() + 15" to "max(lits.len() +
15, 16)" so UserSliceBackend::donor_exec_one_sequence's unconditional copy16 has
a safe read, or (b) add an explicit precondition that callers must treat
lit_length == 0 as a guaranteed fallback and never call donor_exec_one_sequence
when lits.is_empty(); reference the unconditional copy16 call and the
donor_exec_one_sequence/sequence_section_decoder::execute_one_sequence_pipelined
dispatch so reviewers can locate and verify the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 541-543: The donor_path_safe boolean currently requires
high.checked_add(15).is_some_and(|b| b <= lit_len) for all sequences, which
incorrectly blocks the inline donor path for short literal tails; change the
logic so the high+15 <= lit_len check is only enforced when the remaining
literal length after lit_cur_before is > 16 (i.e., when exec_sequence_donor
actually needs the literal wildcopy tail), otherwise skip that high-bound check;
update the expression that computes donor_path_safe (which uses
B::SUPPORTS_INLINE_DONOR_EXEC, lit_cur_before, lit_len, and high) to reflect
this conditional requirement so the inline donor path remains allowed for the
common 1..=16 literal case while preserving the copy16 guard behavior.

---

Duplicate comments:
In `@zstd/src/decoding/buffer_backend.rs`:
- Around line 86-100: The doc comment under the read-side slack description
under-specifies the empty-literals case: update the contract to require a
16-byte readable window when lits can be empty by either (a) changing the
wording from "lits.len() + 15" to "max(lits.len() + 15, 16)" so
UserSliceBackend::donor_exec_one_sequence's unconditional copy16 has a safe
read, or (b) add an explicit precondition that callers must treat lit_length ==
0 as a guaranteed fallback and never call donor_exec_one_sequence when
lits.is_empty(); reference the unconditional copy16 call and the
donor_exec_one_sequence/sequence_section_decoder::execute_one_sequence_pipelined
dispatch so reviewers can locate and verify the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4a8e6886-b1ae-4c00-94b6-6f519af2991b

📥 Commits

Reviewing files that changed from the base of the PR and between 375322b and 3bb2bf8.

📒 Files selected for processing (3)

zstd/src/decoding/buffer_backend.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/sequence_section_decoder.rs

…opy not called)

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

… tighten donor exec docs

…−8.19%, low-entropy-1m −7.05%) (#267) * perf(decode): K-cascade through decode_and_execute_sequences (Tier 9) * perf(decode): wrap BMI2/AVX2/VBMI2 dispatch arms with target_feature trampolines

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

1. usize overflow on 32-bit targets in execute_one_sequence and execute_one_sequence_pipelined: lit_cur + seq.ll as usize could wrap on adversarial input and the subsequent get_unchecked(lit_cur..high) would slice OOB (UB). Both sites now use checked_add(...).filter(|&h| h <= lit_len) so wrap-on-overflow surfaces as ExecuteSequencesError::NotEnoughBytesForSequence instead of undefined behaviour. 2. Tail-literals push in decode_and_execute_sequences_impl still went through the infallible buffer.push(rest). Routes the tail push through buffer.try_push so a malformed block whose unclaimed tail-literal length overshoots the fixed-capacity backend surfaces as ExecuteSequencesError::OutputBufferOverflow instead of panicking via UserSliceBackend::extend's release-mode assert. 3. Stale doc references after the recent identifier renames: - errors.rs DecodeBufferError::BackendOverflow rewritten to describe the variant as kept for binary compatibility and point at the richer OutputBufferOverflow for new code. - user_slice_buf.rs UserSliceBackend::extend comment block rewritten to reflect the new fallible dispatch (try_extend / try_push / try_reserve cover the safe public APIs). - frame_decoder.rs test renamed decode_all_matches_decode_all_on_single_segment_frame to decode_all_legacy_drain_matches_direct_path_on_single_segment_frame with a comment that describes the two distinct internal paths the test exercises.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

… docs 1. \`execute_sequences_fields\` (the RLE-mode sequence executor used by the fused decoder's fallback when sequence-section decode hits the RLE branch) still wrote literals through the infallible \`buffer.push(...)\`. On a malformed RLE-driven sequence stream whose literal claims overshoot the user's fixed-capacity slice that would panic via the per-call \`assert!\` inside \`UserSliceBackend::extend\`. Route both the per-sequence literal push and the tail-literals push through \`buffer.try_push\` so the overshoot surfaces as \`ExecuteSequencesError::OutputBufferOverflow\` and bubbles up as \`FrameDecoderError\` instead. 2. \`UserSliceBackend\` module docs rewritten: the previous "DoS surface on malformed Compressed blocks" section claimed the direct path could panic and pointed at a tracked follow-up. With the safety refactor landed in this PR (try_push / try_reserve / exec_sequence_inline returning Result), the section now describes the actual contract: safe public APIs route through the fallible write surface and surface overshoot as ExecuteSequencesError::OutputBufferOverflow / DecodeBufferError::OutputBufferOverflow; the infallible entry points remain as defense-in-depth for callers that have already validated capacity at a higher layer. 3. \`BackendOverflow\` doc updated: it previously claimed the decoder converts \`BackendOverflow\` into \`FrameContentSizeMismatch\` at the decode_all boundary. The actual mapping is now \`OutputBufferOverflow\` (in ExecuteSequencesError or DecodeBufferError) wrapped into FrameDecoderError::FailedToReadBlockBody.

…s_fields contract

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

…length 1. \`UserSliceBackend::exec_sequence_inline\` reported \`requested\` as \`cap_required - tail\`, which includes the +15-byte wildcopy overshoot. Other overflow producers (\`BackendOverflow\` from \`try_*\` paths) report the logical write length. Align: split the capacity check into two stages — \`total = lit_length + match_length\` is the logical value reported in \`OutputBufferOverflow.requested\`; the overshoot stays in the internal \`cap_required\` used only for the bounds check. 2. \`wildcopy_overlap_8byte_stride\` doc comment said each iter reads \`src + off + 8\`; actual access pattern is \`src + off\` (8 bytes). Corrected to match the implementation.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated no new comments.

perf(decode): donor ZSTD_execSequence inline port for UserSliceBackend

71ece87

Copilot AI review requested due to automatic review settings May 25, 2026 19:01

Copilot started reviewing on behalf of polaz May 25, 2026 19:01 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

fix(decode): donor exec port — x86_64-only gate, lit-source 16B slack…

c6a76a4

… guard, explicit overshoot assert

polaz requested a review from Copilot May 25, 2026 19:36

Copilot started reviewing on behalf of polaz May 25, 2026 19:36 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/exec_sequence_inline.rs

Comment thread zstd/src/decoding/buffer_backend.rs Outdated

Comment thread zstd/src/decoding/decode_buffer.rs Outdated

This was referenced May 25, 2026

perf(decode): HUF 4-stream burst donor verbatim port (HUF_decompress4X1_usingDTable_internal_fast_c_loop) #264

Closed

perf(decode): collapse FrameDecoder → block_decoder → ... 9-layer chain to donor's 3-layer ZSTD_decompressBlock_internal shape #265

Closed

fix(decode): tighten donor exec port — full-tail lit slack, overlap_c…

375322b

…opy8 net offset, doc accuracy, hash assert

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Comment thread zstd/src/decoding/buffer_backend.rs

Comment thread zstd/src/decoding/decode_buffer.rs Outdated

fix(decode): checkpoint snapshots frame hash + document literal-read …

2d5337e

…slack contract

Copilot AI review requested due to automatic review settings May 25, 2026 20:21

Copilot started reviewing on behalf of polaz May 25, 2026 20:21 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/decode_buffer.rs Outdated

polaz mentioned this pull request May 25, 2026

perf(decode): Tier 9 K-cascade + target_feature trampolines (z000033 −8.19%, low-entropy-1m −7.05%) #267

Merged

fix(decode): donor gate covers ll==0 over-read; checkpoint doc lists …

3bb2bf8

…actual hash sites

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

fix(decode): donor gate skips high+15 check for lit_length<=16 (wildc…

00e8a6c

…opy not called)

Copilot AI review requested due to automatic review settings May 25, 2026 21:19

Copilot started reviewing on behalf of polaz May 25, 2026 21:20 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread zstd/src/decoding/decode_buffer.rs Outdated

Comment thread zstd/src/decoding/decode_buffer.rs Outdated

Comment thread zstd/src/decoding/buffer_backend.rs Outdated

Comment thread zstd/src/decoding/exec_sequence_donor.rs Outdated

polaz added 2 commits May 26, 2026 00:38

fix(decode): drop redundant incremental hash (final pass overwrites);…

bb1467d

… tighten donor exec docs

perf(decode): Tier 9 K-cascade + target_feature trampolines (z000033 …

28f6188

…−8.19%, low-entropy-1m −7.05%) (#267) * perf(decode): K-cascade through decode_and_execute_sequences (Tier 9) * perf(decode): wrap BMI2/AVX2/VBMI2 dispatch arms with target_feature trampolines

Copilot AI review requested due to automatic review settings May 25, 2026 22:53

Copilot started reviewing on behalf of polaz May 25, 2026 22:53 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread zstd/src/decoding/user_slice_buf.rs Outdated

Comment thread zstd/src/decoding/sequence_section_decoder.rs

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

polaz requested review from Copilot and removed request for Copilot May 26, 2026 10:55