Skip to content

perf(decode): pack LL/ML metadata + hot-path micro-opts#197

Merged
polaz merged 19 commits into
mainfrom
feat/#178-decode-speed-2
May 19, 2026
Merged

perf(decode): pack LL/ML metadata + hot-path micro-opts#197
polaz merged 19 commits into
mainfrom
feat/#178-decode-speed-2

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 19, 2026

Summary

Follow-on to #194 (merged). Continues the decode-speed work for issue #178.

Micro-optimisations on the FSE / HUF hot path

  • Pack LL/ML base+extra_bits into one u32 lookup (c2eed0b6): replace the pair of LL_BASE: [u32; 36] + LL_EXTRA_BITS: [u8; 36] arrays with a single LL_META: [u32; 36] (low 24 bits = baseline, high 8 bits = extra_bits). One u32 load + shift + mask vs two distinct cache-line touches. Compile-time validation via const fn reconstructs the packed table from donor-spec arrays so the contents cannot drift silently.
  • debug_assert + get_unchecked on lookup_ll/ml_code hot path (058b8d51): FSE table-construction invariants (max_symbol = MAX_LITERAL_LENGTH_CODE / MAX_MATCH_LENGTH_CODE, RLE byte range-check) already guarantee code <= 35 / code <= 52; drop the release-mode bounds check.
  • Collapse aarch64 HUF dispatch (a695a60d): the three aarch64_neon / aarch64_sve decode_symbol_and_advance kernels were copy-paste clones of the scalar body. The match was paying a 3-arm dispatch for three identical bodies on every single-symbol decode. Cfg-split: x86 keeps the runtime match (real BMI2 vs scalar perf delta), aarch64 calls scalar directly with no match.
  • Inline extend_from_within_unchecked (4daf2926): match-copy hot path (DecodeBuffer::repeat) is the second-hottest decode entry; #[inline] lets the compiler fold the head < tail flat-layout fast path into the caller.
  • Fuse reserve + flat-extend on RingBuffer::extend (41b357e6): hoist a single fused fast path with head <= tail && len < cap - tail to the top of extend — both reserve() and the wrap-dispatch are skipped on the dominant literal-push shape.

Flat-buffer backend (compile-time generic, backlog item #132)

Phases 1–4 land the BufferBackend trait, FlatBuf (Vec<u8>-backed no-wrap impl), and a compile-time-monomorphised DecodeBuffer<B: BufferBackend = RingBuffer> cascaded through DecoderScratch<B> and every block / sequence decoder helper. FrameDecoderState.decoder_scratch becomes a DecoderScratchKind enum that picks Ring or Flat at reset() time based on FrameHeader.descriptor.single_segment_flag(); the match fires once per FrameDecoder entry point (decode_block_content, drain, can_drain, …), never inside the hot push/repeat loop.

Why compile-time generic instead of a runtime enum: the earlier
runtime enum BufferStorage { Ring, Flat } attempt paid match
overhead in every push/repeat and measured +43–58% regression on
small-frame decompress. The generic split monomorphises each backend
independently and erases wrap dispatch from the flat side entirely
at compile time.

Critical detail: DecoderScratchKind::reset reuses the existing scratch allocations (FSE / HUF tables, sequence vec) when the backend kind is unchanged across frames. An earlier iteration that replaced the whole scratch on every reset measured +255% regression on small frames; the reuse path keeps the small-frame cost flat.

FlatBuf::reserve semantics fix (12343cd3): Vec::reserve(additional) is "additional bytes beyond len", not "delta from capacity"; on a multi-frame stream where frame 2 has window_size > frame 1's prior FlatBuf capacity, the previous gap-from-capacity formula under-reserved by the existing capacity. Pinned the failing libFuzzer artifact (crash-e33ba082…) as multi_frame_flat_buf_path_does_not_panic in src/tests/fuzz_regressions.rs.

Tests

  • 514/514 lib tests pass (506 prior + 7 new FlatBuf unit tests + 1 new fuzz regression multi_frame_flat_buf_path_does_not_panic)
  • Clippy clean on default + --no-default-features + --no-default-features --features hash
  • 4/4 fuzz regression tests pass

Measured wins (M1, decompress L=-7 fast)

Compared to pre-PR baseline (origin/main after #194 merged):

Scenario Before After Δ
small-1k-random rust_stream 172 ns 131 ns −24%
small-10k-random rust_stream 1.46 µs 1.11 µs −24%
small-4k-log-lines c_stream 1.80 µs 1.28 µs −29%
decodecorpus z000033 c_stream 1.19 ms 928 µs −22%
decodecorpus z000033 rust_stream 5.22 ms 4.07 ms −22%
low-entropy-1m c_stream 316 µs 252 µs −20%
large-log-stream rust_stream 18.0 ms 14.1 ms −22%

FFI gap on decodecorpus z000033: c_stream 2.73× → 2.11×, rust_stream 4.62× → 3.60×.

Source of most of the win: 41b357e6 (RingBuffer fused reserve + flat-extend), which collapsed two separate dispatches into a single hoisted fast path for the dominant literal-push shape.

Part of #178.

Summary by CodeRabbit

  • New Features

    • Added example programs for file compression and decode performance profiling.
  • Bug Fixes

    • Added a regression test to prevent multi-frame decoding panics.
  • Refactor

    • Reworked decoder internals to support pluggable output backends and a new flat-buffer path for single-segment frames, improving decoding efficiency and memory behavior.
    • Simplified and optimized Huffman/sequence decoding paths and buffer handling for greater performance and robustness.
  • Chores

    • Updated benchmark build profile settings.

Review Change Stack

Replace the previous pair of separate const arrays (LL_BASE [u32; 36] +
LL_EXTRA_BITS [u8; 36] = 180 B over two cache-line regions, indexed
twice per sequence) with a single packed table (LL_META [u32; 36] =
144 B, indexed once). Layout: low 24 bits = baseline (max 65536 fits
trivially), high 8 bits = extra_bits (max 16). Same for ML.

Hot path now performs one u32 load + a shift + mask to recover both
fields, instead of two distinct loads from two cache-line regions.
Saves one memory access per LL/ML pair, i.e. two accesses per
sequence — compounds over thousands of sequences per block.

Validity of the packing is enforced at compile time by reconstructing
the table via a const fn from the original donor-spec arrays, so the
literal table contents and the spec do not drift apart silently.
Copilot AI review requested due to automatic review settings May 19, 2026 15:27
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ebc057fa-a5e6-4cc9-8ab0-c51e74064770

📥 Commits

Reviewing files that changed from the base of the PR and between 98d44f7 and 0b42ce7.

📒 Files selected for processing (1)
  • zstd/src/decoding/flat_buf.rs

📝 Walkthrough

Walkthrough

Adds a BufferBackend trait and WILDCOPY_OVERLENGTH; implements a Vec-backed FlatBuf and RingBuffer BufferBackend impl; makes DecodeBuffer/DecoderScratch generic over backends and wires Frame/Block decoding to backend-dispatched helpers. Packs LL/ML metadata into u32 tables and updates sequence lookups. Simplifies Huff0 single-symbol dispatch.

Changes

Buffer backend and decode buffer genericization

Layer / File(s) Summary
BufferBackend trait and constant
zstd/src/decoding/buffer_backend.rs, zstd/src/decoding/mod.rs
Adds WILDCOPY_OVERLENGTH and the pub(crate) trait BufferBackend API used by decoders.
Generic DecodeBuffer and test updates
zstd/src/decoding/decode_buffer.rs, zstd/src/decoding/mod.rs
Make DecodeBuffer<B: BufferBackend = RingBuffer> generic, add from_backend, generalize Read impl and DrainGuard, and update tests to DecodeBuffer::<RingBuffer>::new(...).
FlatBuf backend implementation and tests
zstd/src/decoding/flat_buf.rs
Add FlatBuf Vec-backed backend implementing BufferBackend with head/tail semantics, unchecked within-buffer copy, extend/read helpers, and unit tests.
RingBuffer BufferBackend impl and inlining/Safety docs
zstd/src/decoding/ringbuffer.rs
Add impl super::buffer_backend::BufferBackend for RingBuffer with inline-forwarding wrappers; add #[inline] attributes and expand SAFETY docs for unchecked extend helpers.
DecoderScratch generic wiring
zstd/src/decoding/scratch.rs
Make DecoderScratch<B: BufferBackend = RingBuffer> and update its buffer field and impl to use DecodeBuffer<B>.
Frame/Block dispatch wiring
zstd/src/decoding/frame_decoder.rs, zstd/src/decoding/block_decoder.rs
Introduce DecoderScratchKind and update FrameDecoderState/block decode functions to dispatch to backend-tagged scratch methods and route decode_block_content through the new generic helpers.
Cargo, examples, and tests
Cargo.toml, zstd/examples/*, zstd/src/tests/fuzz_regressions.rs
Add bench profile debug setting, new examples encode_l4 and profile_decode, and a fuzz regression test covering multi-frame flat-buf path.

Packed Metadata for Sequence Decoder Lookup Tables

Layer / File(s) Summary
Packed metadata tables and pack/unpack helpers
zstd/src/decoding/sequence_section_decoder.rs
Introduce LL_META and ML_META packed u32 representations and add const fn pack_code_meta / unpack_code_meta.
Updated lookup functions and generic sequence decode
zstd/src/decoding/sequence_section_decoder.rs
decode_and_execute_sequences and execute_one_sequence are generic over B: BufferBackend; lookup_ll_code and lookup_ml_code now debug_assert! bounds and return (baseline, extra_bits) by unpacking a single packed entry via unsafe get_unchecked.

Huffman Single-Symbol Dispatch

Layer / File(s) Summary
Dispatch logic for single-symbol decode
zstd/src/huff0/huff0_decoder.rs
HuffmanDecoder::decode_symbol_and_advance retains x86/x86_64 SIMD/BMI2 selection for the single-symbol path and routes non-x86 targets to the scalar implementation.
Remove dead aarch64 single-symbol kernels
zstd/src/huff0/huff0_decoder.rs
Delete aarch64 NEON/SVE single-symbol decode implementations and add comments noting scalar fallback for those targets.

Sequence Diagram(s)

sequenceDiagram
  participant FrameDecoder
  participant DecoderScratchKind
  participant DecodeBuffer
  participant BufferBackend
  FrameDecoder->>DecoderScratchKind: decode_block_content(...)
  DecoderScratchKind->>DecodeBuffer: forward decode
  DecodeBuffer->>BufferBackend: extend/extend_from_within_unchecked/as_slices
  BufferBackend-->>DecodeBuffer: storage/slots
  DecodeBuffer-->>DecoderScratchKind: decode result / bytes_written
  DecoderScratchKind-->>FrameDecoder: block decode result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I packed the bits in tidy rows,

Baseline and extras snug in one u32 pose,
FlatBuf keeps the head while RingBuffer rolls,
Scalar hums where SIMD lost its roles,
A rabbit hops — the decoder flows.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf(decode): pack LL/ML metadata + hot-path micro-opts' directly and specifically summarizes the main change: packing literal/match metadata into compact arrays and optimizing the decoding hot path, which are the primary performance-focused changes in this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#178-decode-speed-2

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR continues decode hot-path optimization work by reducing per-sequence table lookups in the FSE-driven sequence decoder. It packs literal-length (LL) and match-length (ML) baseline + extra-bits metadata into single u32 entries, so each symbol lookup becomes one load plus bit ops instead of two separate table loads.

Changes:

  • Replace LL_BASE/LL_EXTRA_BITS with a packed LL_META table and add unpack helper.
  • Replace ML_BASE/ML_EXTRA_BITS with a packed ML_META table and add shared pack helper.
  • Update lookup_ll_code / lookup_ml_code to use the packed metadata tables.

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 93.01075% with 26 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/decoding/frame_decoder.rs 88.69% 13 Missing ⚠️
zstd/src/decoding/sequence_section_decoder.rs 54.54% 10 Missing ⚠️
zstd/src/decoding/flat_buf.rs 97.85% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

polaz added 2 commits May 19, 2026 18:31
Address Copilot finding on PR #197: pack_code_meta's doc comment
claimed compile-time validation but no asserts were actually present.
Add const-evaluated asserts that
  - bases[i] & 0xFF00_0000 == 0   (baseline must fit in 24 bits)
  - extra_bits[i] <= 16            (zstd format §3.1.1.3.2.1.1 limit)

Any future spec extension that violates either invariant now fails
the build instead of silently clobbering the packed payload.
extend_from_within_unchecked and its branchless variant are the
hottest path from DecodeBuffer::repeat — every non-repcode,
non-overlapping match calls one of them. They were missing #[inline],
so cross-crate consumers (the codepath through DecodeBuffer) could
not fold the flat-layout (head < tail) fast path into the caller and
paid a real function-call hop per match copy.

For frames fitting in the window — the dominant case, and especially
the Fast-encoded blocks that dominate L=-7..L=1 — this is the
difference between one inlined SIMD copy and a non-inlined dispatch
through free_slice_parts-shape branches.

First step toward backlog item #132 (full flat-buffer DecodeBuffer
mode). Full dual-mode refactor stays out of this PR; this inline
gate alone unblocks the existing flat fast paths in extend() and
reserve() so they actually compose with repeat().
Copilot AI review requested due to automatic review settings May 19, 2026 15:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

zstd/src/decoding/sequence_section_decoder.rs:458

  • lookup_ml_code has the same pattern as lookup_ll_code: a runtime assert! in a hot loop even though the ML decoder table is bounded by MAX_MATCH_LENGTH_CODE at construction time. Consider using debug_assert! + unchecked access (or otherwise removing the runtime assert) so malformed inputs don’t panic and the per-sequence path avoids an extra conditional branch.
#[inline(always)]
fn lookup_ml_code(code: u8) -> (u32, u8) {
    let idx = code as usize;
    assert!(idx < ML_META.len(), "Illegal match length code was: {code}");
    unpack_code_meta(ML_META[idx])

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Address Copilot finding on PR #197: lookup_ll_code and lookup_ml_code
were paying a runtime assert!() (a release-mode bounds-checked indexed
load) for an invariant that the FSE table construction already enforces
upstream.

The LL FSE table is built with max_symbol == MAX_LITERAL_LENGTH_CODE
(35); build_decoding_table returns FSETableError::TooManySymbols if
read_probabilities yields more entries than that, and the RLE byte
path is range-checked in maybe_update_fse_tables. ML uses the same
shape with max_symbol == MAX_MATCH_LENGTH_CODE (52). A code reaching
either lookup is invariant in 0..=35 / 0..=52.

Switch assert! -> debug_assert!() and use get_unchecked. Removes one
predicted-but-still-branch per LL/ML symbol from the hot decode
loop.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

polaz added 2 commits May 19, 2026 19:32
…FlatBuf

Lays compile-time-monomorphised scaffolding for backlog item #132
(flat-buffer mode for frames the ring would never wrap on). The
earlier dynamic-dispatch attempt (BufferStorage enum) measured a
+43-58% regression on small-frame decompress because every push /
repeat paid a runtime match. The generic split here keeps both
backends as separate compiled instantiations; wrap dispatch is
erased from the flat path at compile time rather than branched at
runtime.

No behavioural change yet — every caller still pinned to
DecodeBuffer with the RingBuffer trait default.

- buffer_backend.rs: BufferBackend trait — storage-side interface
  DecodeBuffer needs (extend, repeat-shape, drain-shape, rollback).
- flat_buf.rs: FlatBuf Vec-backed no-wrap impl + 7 unit tests
  (append, extend_and_fill, extend_from_within_unchecked,
  drop_first_n head advance + match-source persistence, set_tail
  rollback, clear). Capacity sized once at frame reset with
  WILDCOPY_OVERLENGTH trailing slack so SIMD overshoots stay inside
  the allocation.
- ringbuffer.rs: impl BufferBackend for RingBuffer — thin forwarder.
- decode_buffer.rs: parameterised over B: BufferBackend with
  RingBuffer default; from_backend(buffer, window_size) constructor
  wraps a pre-sized FlatBuf for Phase 2; DrainGuard becomes generic.
- mod.rs: register buffer_backend + flat_buf modules.

Flat backend allow(dead_code) until Phase 2 wires FrameDecoder.reset
to switch on Single_Segment_flag (separate PR — caller cascade
through DecoderScratch / block_decoder / sequence_section_decoder /
sequence_execution all needs to become generic over B).

513/513 lib tests pass (506 prior + 7 new FlatBuf), clippy clean on
default + --no-default-features + --no-default-features --features
hash.
Cascade the compile-time generic from Phase 1 (844f761) up to
DecoderScratch. The struct is now parameterised over a
BufferBackend with the same RingBuffer default, so every existing
caller stays on the historical type via inference and no behavioural
change ships in this commit either.

Wiring FrameDecoder.reset to actually instantiate
DecoderScratch<FlatBuf> on Single_Segment_flag frames is the next
phase — that requires the FrameDecoderState type to become a small
enum that holds either the Ring or Flat variant and pattern-match
in the top-level FrameDecoder methods, so it stays as a separate
commit to keep this one as a pure infrastructure cascade.

513/513 lib tests pass, clippy clean on all three feature modes.
Copilot AI review requested due to automatic review settings May 19, 2026 16:34
…ence decoders

block_decoder::decode_block_content, block_decoder::decompress_block,
sequence_section_decoder::decode_and_execute_sequences,
sequence_section_decoder::execute_one_sequence,
sequence_execution::execute_sequences_fields — all become generic
over B: BufferBackend. The compiler now monomorphises the full
hot-path stack per backend; with FrameDecoderState still pinned to
the RingBuffer default no behavioural change ships, but the wiring
for the flat path is now mechanical (single FrameDecoderState
variant flip in the next phase).

513/513 lib tests pass, clippy clean.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/decoding/ringbuffer.rs (1)

16-16: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use a single WILDCOPY_OVERLENGTH source across backends.

This file keeps a separate WILDCOPY_OVERLENGTH from buffer_backend.rs; drift between them would invalidate shared wildcopy safety assumptions. Please import and reuse the shared constant here too.

Also applies to: 815-880

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/ringbuffer.rs` at line 16, Remove the local
WILDCOPY_OVERLENGTH constant in ringbuffer.rs and instead import and reuse the
shared constant from buffer_backend.rs (reference symbols: WILDCOPY_OVERLENGTH,
ringbuffer.rs, buffer_backend.rs); update any uses in ringbuffer.rs (including
the wildcopy-related logic around lines ~815-880) to reference the imported
WILDCOPY_OVERLENGTH so the backend and ringbuffer share the same safety
assumption.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 63-68: FlatBuf::reserve currently computes the reserve amount as
needed - self.buf.capacity(), which is wrong because Vec::reserve takes
"additional from len" not "from capacity"; update the calculation in
reserve(&mut self, n: usize) to compute additional =
needed.saturating_sub(self.buf.len()) and call self.buf.reserve(additional +
WILDCOPY_OVERLENGTH) (or ensure the reserved amount is at least n +
WILDCOPY_OVERLENGTH) so that when len < capacity we still guarantee at least n
writable bytes for the unsafe copies using self.buf and preserve the
WILDCOPY_OVERLENGTH slack.

---

Outside diff comments:
In `@zstd/src/decoding/ringbuffer.rs`:
- Line 16: Remove the local WILDCOPY_OVERLENGTH constant in ringbuffer.rs and
instead import and reuse the shared constant from buffer_backend.rs (reference
symbols: WILDCOPY_OVERLENGTH, ringbuffer.rs, buffer_backend.rs); update any uses
in ringbuffer.rs (including the wildcopy-related logic around lines ~815-880) to
reference the imported WILDCOPY_OVERLENGTH so the backend and ringbuffer share
the same safety assumption.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1026fcbd-162a-4fcc-b0eb-c2ec75ac4e8d

📥 Commits

Reviewing files that changed from the base of the PR and between 058b8d5 and e6c7143.

📒 Files selected for processing (6)
  • zstd/src/decoding/buffer_backend.rs
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/flat_buf.rs
  • zstd/src/decoding/mod.rs
  • zstd/src/decoding/ringbuffer.rs
  • zstd/src/decoding/scratch.rs

Comment thread zstd/src/decoding/flat_buf.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/decode_buffer.rs Outdated
Comment thread zstd/src/decoding/buffer_backend.rs Outdated
Comment thread zstd/src/decoding/buffer_backend.rs Outdated
Comment thread zstd/src/decoding/flat_buf.rs Outdated
polaz added 2 commits May 19, 2026 20:17
…3 review

Phase 4 of backlog item #132. FrameDecoderState.decoder_scratch is
now a DecoderScratchKind enum that picks between
DecoderScratch<RingBuffer> and DecoderScratch<FlatBuf> based on
FrameHeader.descriptor.single_segment_flag() at reset time.

Match dispatch fires once per FrameDecoder entry point
(decode_block_content, drain, can_drain, ...) — never inside the hot
push/repeat loop, which stays fully monomorphised through the
DecodeBuffer<B> generic landed in Phases 1-3.

Critical: DecoderScratchKind::reset reuses the existing scratch
allocations (FSE / HUF tables, sequence vec) when the backend kind
is unchanged across frames. The first iteration replaced the whole
scratch on every reset and measured +255% regression on small
frames; the reuse path keeps the small-frame cost flat.

Benchmark on M1 vs pre-Phase-4 baseline:
- small-1k-random: 172 -> 166 ns (-3.5%)
- small-4k-log-lines c_stream: 1.80 -> 1.65 us (-8%)
- large-log-stream rust_stream: 18.0 -> 17.0 ms (-5%)
- decodecorpus z000033: within +/- 1% noise

Also addressed CodeRabbit / Copilot round-3 review threads:

- flat_buf::reserve math: Vec::reserve is "additional from len",
  not "delta from capacity" — fix to use saturating_sub on the gap
  so an alloc actually happens when len < capacity < len+n.
- sequence_section_decoder::LL_META doc: now says "rely on
  debug_assert + unsafe get_unchecked + upstream gates" instead of
  the no-longer-existing release-mode runtime check.
- decode_buffer::DecodeBuffer flat doc: "selected by
  FrameDecoder" -> "intended for selection" until Phase 4 lands.
- buffer_backend::WILDCOPY_OVERLENGTH: single shared constant, both
  RingBuffer and FlatBuf import from buffer_backend so the slack
  contract cannot drift.
- buffer_backend::cap doc: clarified it's the per-instance
  realloc-detection sentinel, not a portable size — RingBuffer's
  cap excludes the slack, FlatBuf's includes it; the checkpoint
  only compares equality on the same instance so the asymmetry is
  fine.

513/513 lib tests pass, clippy clean on default + --no-default-features
+ --no-default-features --features hash.
libFuzzer artifact crash-e33ba082... exercised a multi-frame stream
where frame 2 reset to a flat backend with a larger window_size than
frame 1's pre-existing FlatBuf capacity. The reserve(n) math from
both the original Phase 1 attempt and CodeRabbit's "additional = n
- available" suggestion under-reserves on that case: when len == 0
and the existing capacity is non-zero, `(n - available)` shrinks
the additional argument by exactly `available`, so Vec::reserve
only ensures `capacity >= (n - available) + slack`, which is short
by `available`.

Subsequent extend_from_within_unchecked then panicked on the
`dst_off + len <= self.buf.capacity()` debug assert.

Fix: call `self.buf.reserve(n + WILDCOPY_OVERLENGTH)` directly.
Vec::reserve's contract is "ensures capacity >= len + additional",
which is exactly the contract callers need; the gap-from-capacity
forms above were re-deriving the same condition incorrectly.

Pinned the failing input as a regression test
(`multi_frame_flat_buf_path_does_not_panic`) in
src/tests/fuzz_regressions.rs alongside the existing
`malformed_block_does_not_panic_via_restore_checkpoint` regression
so any future iteration of the flat backend has to re-pass it.

514/514 lib tests pass, clippy clean.
Copilot AI review requested due to automatic review settings May 19, 2026 17:39
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 63-74: The reserve method currently calls Vec::reserve without
zeroing the newly allocated slack; update FlatBuf::reserve to remember the
previous capacity (or previous capacity - len available), call
self.buf.reserve(...), then zero (set to 0) the newly allocated bytes up to the
new capacity including the WILDCOPY_OVERLENGTH slack so the buffer's trailing
bytes remain defined (matching with_capacity's behavior) — ensure this zeroing
covers exactly the additional bytes added by the reserve and does not change
len, and keep references to FlatBuf, reserve, WILDCOPY_OVERLENGTH and
extend_from_within_unchecked to guide placement of the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 382e1a7e-c8e8-4864-941f-d6be89fdee69

📥 Commits

Reviewing files that changed from the base of the PR and between 058b8d5 and 3536fb1.

📒 Files selected for processing (10)
  • zstd/src/decoding/block_decoder.rs
  • zstd/src/decoding/buffer_backend.rs
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/flat_buf.rs
  • zstd/src/decoding/frame_decoder.rs
  • zstd/src/decoding/mod.rs
  • zstd/src/decoding/ringbuffer.rs
  • zstd/src/decoding/scratch.rs
  • zstd/src/decoding/sequence_execution.rs
  • zstd/src/decoding/sequence_section_decoder.rs

Comment thread zstd/src/decoding/flat_buf.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Comment thread zstd/src/decoding/decode_buffer.rs Outdated
Comment thread zstd/src/decoding/mod.rs Outdated
Comment thread zstd/src/decoding/scratch.rs Outdated
Comment thread zstd/src/tests/fuzz_regressions.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
zstd/src/decoding/flat_buf.rs (1)

63-83: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

reserve() still breaks the initialized-slack invariant after a grow.

with_capacity() explicitly zero-fills the trailing WILDCOPY_OVERLENGTH region so overshoot reads stay defined, but reserve() only increases Vec capacity and never initializes the newly added bytes. After the first grow, the backend can reintroduce UB on the same wildcopy paths this file is documenting around.

Suggested fix
     #[inline]
     fn reserve(&mut self, n: usize) {
+        let old_cap = self.buf.capacity();
         // `Vec::reserve(additional)` guarantees
         // `capacity >= len + additional`; passing
         // `n + WILDCOPY_OVERLENGTH` is the exact contract callers
@@
-        self.buf.reserve(n.saturating_add(WILDCOPY_OVERLENGTH));
+        self.buf.reserve(n.saturating_add(WILDCOPY_OVERLENGTH));
+        let new_cap = self.buf.capacity();
+        if new_cap > old_cap {
+            unsafe {
+                ptr::write_bytes(self.buf.as_mut_ptr().add(old_cap), 0, new_cap - old_cap);
+            }
+        }
     }
Does Rust's standard-library `Vec::reserve` initialize the newly reserved memory, or does it only grow capacity while leaving the added region uninitialized? Please answer from the official `Vec` documentation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/flat_buf.rs` around lines 63 - 83, The reserve() method on
FlatBuf grows self.buf's capacity but leaves newly reserved bytes uninitialized,
breaking the initialized-slack invariant that with_capacity() enforces for
WILDCOPY_OVERLENGTH; update reserve() (the method on the type with field
self.buf in flat_buf.rs) to initialize (zero-fill or otherwise set) the newly
allocated slack region up to WILDCOPY_OVERLENGTH after calling Vec::reserve so
overshoot reads remain defined—mirror the initialization approach used in
with_capacity(), and ensure any downstream callers like
extend_from_within_unchecked can rely on the initialized slack.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/tests/fuzz_regressions.rs`:
- Around line 182-185: The test currently uses `if let Ok(mut decoder) =
crate::decoding::StreamingDecoder::new(data)` which silently skips the body when
construction fails; change it to require construction to succeed (e.g. `let mut
decoder = crate::decoding::StreamingDecoder::new(data).unwrap();`) so the test
fails if `new` errors, then keep ignoring only the `read_to_end` result (e.g.
`let _ = decoder.read_to_end(&mut output);`) so the flat-buffer panic path is
exercised.

---

Duplicate comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 63-83: The reserve() method on FlatBuf grows self.buf's capacity
but leaves newly reserved bytes uninitialized, breaking the initialized-slack
invariant that with_capacity() enforces for WILDCOPY_OVERLENGTH; update
reserve() (the method on the type with field self.buf in flat_buf.rs) to
initialize (zero-fill or otherwise set) the newly allocated slack region up to
WILDCOPY_OVERLENGTH after calling Vec::reserve so overshoot reads remain
defined—mirror the initialization approach used in with_capacity(), and ensure
any downstream callers like extend_from_within_unchecked can rely on the
initialized slack.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: afc2450a-5a58-4fca-8b90-17206399fe66

📥 Commits

Reviewing files that changed from the base of the PR and between 3536fb1 and 12343cd.

📒 Files selected for processing (3)
  • Cargo.toml
  • zstd/src/decoding/flat_buf.rs
  • zstd/src/tests/fuzz_regressions.rs

Comment thread zstd/src/tests/fuzz_regressions.rs Outdated
Two unrelated changes folded into one commit because the perf opt
and the doc/test fixes shipped together while addressing review
threads:

- ringbuffer: fuse reserve + flat-extend on the common literal push.
  Profile (decodecorpus-z000033 L=-7 c_stream, 5GB workload via
  samply at 4999Hz) put RingBuffer::extend at 15% self-time; a
  share of that was the redundant `self.reserve(len)` call
  dispatched before the existing flat fast path. Hoist a fused
  fast path to the top of `extend`: when head <= tail AND
  `len < cap - tail`, both reserve() AND the wrap-dispatch are
  skipped. The strict `<` keeps tail < cap so invariant 4 holds
  without the post-write normalisation branch; the original
  `<=` flat path is kept as a second branch (behind reserve)
  for the `tail+len==cap` boundary.
- Cargo.toml: enable `debug = "line-tables-only"` on the bench
  profile so samply can resolve hot function names.

Round-5 review-thread fixes:

- flat_buf::with_capacity: drop the zero-init of the slack region
  entirely. FlatBuf only WRITES past `len` (extend, extend_and_fill,
  extend_from_within_unchecked) before any matching `set_len`, and
  only READS bytes inside `head..buf.len()` (as_slices, drain
  helpers). The trailing slack region is therefore intentionally
  uninitialised — no UB, no per-frame O(cap) zero pass.
- decode_buffer / mod / scratch docs: drop the "Phase 2 wiring
  pending" / "intended for selection" wording — FrameDecoder
  already instantiates DecoderScratch<FlatBuf> via
  DecoderScratchKind on single-segment frames.
- mod.rs: remove `#[allow(dead_code)]` on flat_buf (genuinely
  used now).
- fuzz_regressions::multi_frame_flat_buf_path_does_not_panic:
  `if let Ok(decoder) = StreamingDecoder::new(data)` →
  `.expect("…")` so a future regression that broke ctor for this
  artifact cannot silently turn the test into a no-op.

514/514 lib tests pass, clippy clean on default +
--no-default-features + --no-default-features --features hash.
Address Copilot finding on PR #197: FlatBuf::extend_from_reader was
using Vec::resize(old + fill_length, 0) to grow length, then
immediately overwriting the freshly-zeroed bytes via read_exact.
Raw blocks decode up to 128 KiB through this path, so the eager
memset doubled the write traffic on the raw-block reset shape.

Switch to a read-into-spare-capacity pattern: reserve(fill_length),
build a `&mut [u8]` slice over the spare capacity via
from_raw_parts_mut, hand it to read_exact, and only call set_len
after the read returns Ok. Failed reads leave length unchanged,
matching the prior truncate-on-error observable behaviour.

SAFETY notes added inline cover the two unsafe blocks: the read_slot
construction (capacity guarantee + sole-writer contract until
set_len commits) and the set_len itself (initialisation contract).

515/515 lib tests pass, clippy clean on default + --tests +
--no-default-features + --no-default-features --features hash.
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 19, 2026

@coderabbitai Re: decode_buffer.rs:83-93 outside-diff (regression test for from_backend clear) — already addressed in commit 8468e03: from_backend_clears_prepopulated_backend test added to decode_buffer::tests module. Test feeds a populated RingBuffer into from_backend, asserts len() == 0 after construction, pushes fresh bytes and verifies drain returns exactly those bytes.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz — confirmed, the test coverage is solid. from_backend_clears_prepopulated_backend hits exactly the right invariants: verifying len() == 0 post-construction proves the clear happened, and the push-then-drain round-trip confirms the backend is in a clean, usable state afterward. No further action needed here.

@polaz polaz requested a review from Copilot May 19, 2026 19:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/decoding/flat_buf.rs Outdated
`new_tail` only ever comes from a previous `tail()` call (== `buf.len()`),
and `buf.len()` is monotonically non-decreasing between the checkpoint
and the rollback (the caller only writes between snapshot and restore).
Asserting `<= self.buf.capacity()` was a loose upper bound that would
silently let a caller bug pass through `set_len` into uninitialised slack
bytes — UB. Tightening to `<= self.buf.len()` catches that in debug
builds without changing release semantics.
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 19, 2026

@coderabbitai Re: decode_buffer.rs:83-93 outside-diff — already landed in 8468e03 as from_backend_clears_prepopulated_backend (decode_buffer.rs:551). Same shape as the suggested diff: pre-populate the backend, call from_backend, assert len()==0, then push+drain to confirm it's usable. No new test needed.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz — Already noted and confirmed in my previous reply. The test at decode_buffer.rs:551 covers the right invariants (len()==0 post-construction + push/drain round-trip), and no new test is needed. Nothing further to action here.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/huff0/huff0_decoder.rs Outdated
The two `#[cfg]`-gated tail expressions (the x86 match and the
non-x86 scalar fallback) are both valid as `ExpressionWithBlock`
statements without a separating `;` — `match {...}` and `{...}` are
both block-terminated and the parser accepts them back-to-back.
Wrapping the x86 match in an outer `{}` makes the two branches
symmetrical (both are `#[cfg(...)] { ... }`) and removes the
recurring reviewer false-positive about the construct looking like
two expression statements without a separator. Generated code is
unchanged.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/decoding/flat_buf.rs Outdated
with_capacity and BufferBackend::reserve both budget
WILDCOPY_OVERLENGTH slack past the target capacity. The previous
self.buf.reserve(fill_length) call inside extend_from_reader skipped
that slack — read_exact today does not overshoot so this is not a
correctness bug, but it leaves a footgun if the raw-block path ever
picks up SIMD/wildcopy writes. Routing through self.reserve makes
the slack invariant uniform across all growth paths.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/decoding/flat_buf.rs
Comment thread zstd/src/decoding/flat_buf.rs Outdated
…xtend_from_reader

Forming `&mut [u8]` via `slice::from_raw_parts_mut` over uninitialised
Vec spare capacity is UB regardless of whether a subsequent write
initialises it — `&mut T` must always reference initialised memory of
the target type. The earlier shape skipped zero-fill before
read_exact, which paid soundness for ~one memset per 128 KiB raw
block — net cost is negligible vs the UB risk on any future Read
impl that touches the buffer before writing.

Switches extend_from_reader to Vec::resize(.., 0) + indexed slice
form. Also drops the redundant explicit BufferBackend import from
the test mod (use super::* already brings it in scope).
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 136-145: The issue is that extend() and extend_and_fill() grow
self.buf directly and can break the WILDCOPY_OVERLENGTH slack invariant; update
both methods (extend and extend_and_fill in flat_buf.rs) to route all growth
through self.buf.reserve() (using the same slack size used by
with_capacity()/extend_from_reader()) before extending or resizing so that any
reallocation preserves the documented overlength headroom required for wildcopy
semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 254087a6-6a0c-4638-b798-a1c1e8c1c1fb

📥 Commits

Reviewing files that changed from the base of the PR and between 98d44f7 and 0b42ce7.

📒 Files selected for processing (1)
  • zstd/src/decoding/flat_buf.rs

Comment thread zstd/src/decoding/flat_buf.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated no new comments.

@polaz polaz merged commit 57e968f into main May 19, 2026
29 checks passed
@polaz polaz deleted the feat/#178-decode-speed-2 branch May 19, 2026 21:14
@sw-release-bot sw-release-bot Bot mentioned this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants