perf(decode): pack LL/ML metadata + hot-path micro-opts by polaz · Pull Request #197 · structured-world/structured-zstd

polaz · 2026-05-19T15:27:11Z

Summary

Follow-on to #194 (merged). Continues the decode-speed work for issue #178.

Micro-optimisations on the FSE / HUF hot path

Pack LL/ML base+extra_bits into one u32 lookup (c2eed0b6): replace the pair of LL_BASE: [u32; 36] + LL_EXTRA_BITS: [u8; 36] arrays with a single LL_META: [u32; 36] (low 24 bits = baseline, high 8 bits = extra_bits). One u32 load + shift + mask vs two distinct cache-line touches. Compile-time validation via const fn reconstructs the packed table from donor-spec arrays so the contents cannot drift silently.
debug_assert + get_unchecked on lookup_ll/ml_code hot path (058b8d51): FSE table-construction invariants (max_symbol = MAX_LITERAL_LENGTH_CODE / MAX_MATCH_LENGTH_CODE, RLE byte range-check) already guarantee code <= 35 / code <= 52; drop the release-mode bounds check.
Collapse aarch64 HUF dispatch (a695a60d): the three aarch64_neon / aarch64_sve decode_symbol_and_advance kernels were copy-paste clones of the scalar body. The match was paying a 3-arm dispatch for three identical bodies on every single-symbol decode. Cfg-split: x86 keeps the runtime match (real BMI2 vs scalar perf delta), aarch64 calls scalar directly with no match.
Inline extend_from_within_unchecked (4daf2926): match-copy hot path (DecodeBuffer::repeat) is the second-hottest decode entry; #[inline] lets the compiler fold the head < tail flat-layout fast path into the caller.
Fuse reserve + flat-extend on RingBuffer::extend (41b357e6): hoist a single fused fast path with head <= tail && len < cap - tail to the top of extend — both reserve() and the wrap-dispatch are skipped on the dominant literal-push shape.

Flat-buffer backend (compile-time generic, backlog item #132)

Phases 1–4 land the BufferBackend trait, FlatBuf (Vec<u8>-backed no-wrap impl), and a compile-time-monomorphised DecodeBuffer<B: BufferBackend = RingBuffer> cascaded through DecoderScratch<B> and every block / sequence decoder helper. FrameDecoderState.decoder_scratch becomes a DecoderScratchKind enum that picks Ring or Flat at reset() time based on FrameHeader.descriptor.single_segment_flag(); the match fires once per FrameDecoder entry point (decode_block_content, drain, can_drain, …), never inside the hot push/repeat loop.

Why compile-time generic instead of a runtime enum: the earlier
runtime enum BufferStorage { Ring, Flat } attempt paid match
overhead in every push/repeat and measured +43–58% regression on
small-frame decompress. The generic split monomorphises each backend
independently and erases wrap dispatch from the flat side entirely
at compile time.

Critical detail: DecoderScratchKind::reset reuses the existing scratch allocations (FSE / HUF tables, sequence vec) when the backend kind is unchanged across frames. An earlier iteration that replaced the whole scratch on every reset measured +255% regression on small frames; the reuse path keeps the small-frame cost flat.

FlatBuf::reserve semantics fix (12343cd3): Vec::reserve(additional) is "additional bytes beyond len", not "delta from capacity"; on a multi-frame stream where frame 2 has window_size > frame 1's prior FlatBuf capacity, the previous gap-from-capacity formula under-reserved by the existing capacity. Pinned the failing libFuzzer artifact (crash-e33ba082…) as multi_frame_flat_buf_path_does_not_panic in src/tests/fuzz_regressions.rs.

Tests

514/514 lib tests pass (506 prior + 7 new FlatBuf unit tests + 1 new fuzz regression multi_frame_flat_buf_path_does_not_panic)
Clippy clean on default + --no-default-features + --no-default-features --features hash
4/4 fuzz regression tests pass

Measured wins (M1, decompress L=-7 fast)

Compared to pre-PR baseline (origin/main after #194 merged):

Scenario	Before	After	Δ
`small-1k-random rust_stream`	172 ns	131 ns	−24%
`small-10k-random rust_stream`	1.46 µs	1.11 µs	−24%
`small-4k-log-lines c_stream`	1.80 µs	1.28 µs	−29%
`decodecorpus z000033 c_stream`	1.19 ms	928 µs	−22%
`decodecorpus z000033 rust_stream`	5.22 ms	4.07 ms	−22%
`low-entropy-1m c_stream`	316 µs	252 µs	−20%
`large-log-stream rust_stream`	18.0 ms	14.1 ms	−22%

FFI gap on decodecorpus z000033: c_stream 2.73× → 2.11×, rust_stream 4.62× → 3.60×.

Source of most of the win: 41b357e6 (RingBuffer fused reserve + flat-extend), which collapsed two separate dispatches into a single hoisted fast path for the dominant literal-push shape.

Part of #178.

Summary by CodeRabbit

New Features
- Added example programs for file compression and decode performance profiling.
Bug Fixes
- Added a regression test to prevent multi-frame decoding panics.
Refactor
- Reworked decoder internals to support pluggable output backends and a new flat-buffer path for single-segment frames, improving decoding efficiency and memory behavior.
- Simplified and optimized Huffman/sequence decoding paths and buffer handling for greater performance and robustness.
Chores
- Updated benchmark build profile settings.

Replace the previous pair of separate const arrays (LL_BASE [u32; 36] + LL_EXTRA_BITS [u8; 36] = 180 B over two cache-line regions, indexed twice per sequence) with a single packed table (LL_META [u32; 36] = 144 B, indexed once). Layout: low 24 bits = baseline (max 65536 fits trivially), high 8 bits = extra_bits (max 16). Same for ML. Hot path now performs one u32 load + a shift + mask to recover both fields, instead of two distinct loads from two cache-line regions. Saves one memory access per LL/ML pair, i.e. two accesses per sequence — compounds over thousands of sequences per block. Validity of the packing is enforced at compile time by reconstructing the table via a const fn from the original donor-spec arrays, so the literal table contents and the spec do not drift apart silently.

coderabbitai · 2026-05-19T15:27:25Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ebc057fa-a5e6-4cc9-8ab0-c51e74064770

📥 Commits

Reviewing files that changed from the base of the PR and between 98d44f7 and 0b42ce7.

📒 Files selected for processing (1)

zstd/src/decoding/flat_buf.rs

📝 Walkthrough

Walkthrough

Adds a BufferBackend trait and WILDCOPY_OVERLENGTH; implements a Vec-backed FlatBuf and RingBuffer BufferBackend impl; makes DecodeBuffer/DecoderScratch generic over backends and wires Frame/Block decoding to backend-dispatched helpers. Packs LL/ML metadata into u32 tables and updates sequence lookups. Simplifies Huff0 single-symbol dispatch.

Changes

Buffer backend and decode buffer genericization

Layer / File(s)	Summary
BufferBackend trait and constant `zstd/src/decoding/buffer_backend.rs`, `zstd/src/decoding/mod.rs`	Adds `WILDCOPY_OVERLENGTH` and the `pub(crate) trait BufferBackend` API used by decoders.
Generic DecodeBuffer and test updates `zstd/src/decoding/decode_buffer.rs`, `zstd/src/decoding/mod.rs`	Make `DecodeBuffer<B: BufferBackend = RingBuffer>` generic, add `from_backend`, generalize `Read` impl and `DrainGuard`, and update tests to `DecodeBuffer::<RingBuffer>::new(...)`.
FlatBuf backend implementation and tests `zstd/src/decoding/flat_buf.rs`	Add `FlatBuf` Vec-backed backend implementing `BufferBackend` with head/tail semantics, unchecked within-buffer copy, extend/read helpers, and unit tests.
RingBuffer BufferBackend impl and inlining/Safety docs `zstd/src/decoding/ringbuffer.rs`	Add `impl super::buffer_backend::BufferBackend for RingBuffer` with inline-forwarding wrappers; add `#[inline]` attributes and expand SAFETY docs for unchecked extend helpers.
DecoderScratch generic wiring `zstd/src/decoding/scratch.rs`	Make `DecoderScratch<B: BufferBackend = RingBuffer>` and update its `buffer` field and impl to use `DecodeBuffer<B>`.
Frame/Block dispatch wiring `zstd/src/decoding/frame_decoder.rs`, `zstd/src/decoding/block_decoder.rs`	Introduce `DecoderScratchKind` and update `FrameDecoderState`/block decode functions to dispatch to backend-tagged scratch methods and route `decode_block_content` through the new generic helpers.
Cargo, examples, and tests `Cargo.toml`, `zstd/examples/*`, `zstd/src/tests/fuzz_regressions.rs`	Add bench profile debug setting, new examples `encode_l4` and `profile_decode`, and a fuzz regression test covering multi-frame flat-buf path.

Packed Metadata for Sequence Decoder Lookup Tables

Layer / File(s)	Summary
Packed metadata tables and pack/unpack helpers `zstd/src/decoding/sequence_section_decoder.rs`	Introduce `LL_META` and `ML_META` packed u32 representations and add `const fn pack_code_meta` / `unpack_code_meta`.
Updated lookup functions and generic sequence decode `zstd/src/decoding/sequence_section_decoder.rs`	`decode_and_execute_sequences` and `execute_one_sequence` are generic over `B: BufferBackend`; `lookup_ll_code` and `lookup_ml_code` now `debug_assert!` bounds and return `(baseline, extra_bits)` by unpacking a single packed entry via `unsafe get_unchecked`.

Huffman Single-Symbol Dispatch

Layer / File(s)	Summary
Dispatch logic for single-symbol decode `zstd/src/huff0/huff0_decoder.rs`	`HuffmanDecoder::decode_symbol_and_advance` retains x86/x86_64 SIMD/BMI2 selection for the single-symbol path and routes non-x86 targets to the scalar implementation.
Remove dead aarch64 single-symbol kernels `zstd/src/huff0/huff0_decoder.rs`	Delete aarch64 NEON/SVE single-symbol decode implementations and add comments noting scalar fallback for those targets.

Sequence Diagram(s)

sequenceDiagram
  participant FrameDecoder
  participant DecoderScratchKind
  participant DecodeBuffer
  participant BufferBackend
  FrameDecoder->>DecoderScratchKind: decode_block_content(...)
  DecoderScratchKind->>DecodeBuffer: forward decode
  DecodeBuffer->>BufferBackend: extend/extend_from_within_unchecked/as_slices
  BufferBackend-->>DecodeBuffer: storage/slots
  DecodeBuffer-->>DecoderScratchKind: decode result / bytes_written
  DecoderScratchKind-->>FrameDecoder: block decode result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

perf(decode): collapse simd_copy CopyFn indirection + FSE LUT/unchecked + no_std CI gate #189: Matches lookup-table refactor and WILDCOPY_OVERLENGTH/wildcopy contract work present in this PR.
feat(encoding+decoding): FrameEmitInfo block-layout introspection + opt-in per-block XXH64 sidecar #173: Overlaps frame-level checksum and block-boundary plumbing refactors routed through DecoderScratchKind.

Possibly related PRs

structured-world/structured-zstd#194: Overlapping work on decode_and_execute_sequences and sequence lookup logic.
structured-world/structured-zstd#92: Related Huff0 SIMD/single-symbol dispatch changes.
structured-world/structured-zstd#114: Related buffer-management and DecodeBuffer plumbing refactors.

Poem

🐰 I packed the bits in tidy rows,

Baseline and extras snug in one u32 pose,
FlatBuf keeps the head while RingBuffer rolls,
Scalar hums where SIMD lost its roles,
A rabbit hops — the decoder flows.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'perf(decode): pack LL/ML metadata + hot-path micro-opts' directly and specifically summarizes the main change: packing literal/match metadata into compact arrays and optimizing the decoding hot path, which are the primary performance-focused changes in this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/#178-decode-speed-2

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR continues decode hot-path optimization work by reducing per-sequence table lookups in the FSE-driven sequence decoder. It packs literal-length (LL) and match-length (ML) baseline + extra-bits metadata into single u32 entries, so each symbol lookup becomes one load plus bit ops instead of two separate table loads.

Changes:

Replace LL_BASE/LL_EXTRA_BITS with a packed LL_META table and add unpack helper.
Replace ML_BASE/ML_EXTRA_BITS with a packed ML_META table and add shared pack helper.
Update lookup_ll_code / lookup_ml_code to use the packed metadata tables.

codecov · 2026-05-19T15:30:23Z

Codecov Report

❌ Patch coverage is 93.01075% with 26 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/decoding/frame_decoder.rs	88.69%	13 Missing ⚠️
zstd/src/decoding/sequence_section_decoder.rs	54.54%	10 Missing ⚠️
zstd/src/decoding/flat_buf.rs	97.85%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Address Copilot finding on PR #197: pack_code_meta's doc comment claimed compile-time validation but no asserts were actually present. Add const-evaluated asserts that - bases[i] & 0xFF00_0000 == 0 (baseline must fit in 24 bits) - extra_bits[i] <= 16 (zstd format §3.1.1.3.2.1.1 limit) Any future spec extension that violates either invariant now fails the build instead of silently clobbering the packed payload.

extend_from_within_unchecked and its branchless variant are the hottest path from DecodeBuffer::repeat — every non-repcode, non-overlapping match calls one of them. They were missing #[inline], so cross-crate consumers (the codepath through DecodeBuffer) could not fold the flat-layout (head < tail) fast path into the caller and paid a real function-call hop per match copy. For frames fitting in the window — the dominant case, and especially the Fast-encoded blocks that dominate L=-7..L=1 — this is the difference between one inlined SIMD copy and a non-inlined dispatch through free_slice_parts-shape branches. First step toward backlog item #132 (full flat-buffer DecodeBuffer mode). Full dual-mode refactor stays out of this PR; this inline gate alone unblocks the existing flat fast paths in extend() and reserve() so they actually compose with repeat().

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

zstd/src/decoding/sequence_section_decoder.rs:458

lookup_ml_code has the same pattern as lookup_ll_code: a runtime assert! in a hot loop even though the ML decoder table is bounded by MAX_MATCH_LENGTH_CODE at construction time. Consider using debug_assert! + unchecked access (or otherwise removing the runtime assert) so malformed inputs don’t panic and the per-sequence path avoids an extra conditional branch.

#[inline(always)]
fn lookup_ml_code(code: u8) -> (u32, u8) {
    let idx = code as usize;
    assert!(idx < ML_META.len(), "Illegal match length code was: {code}");
    unpack_code_meta(ML_META[idx])

Address Copilot finding on PR #197: lookup_ll_code and lookup_ml_code were paying a runtime assert!() (a release-mode bounds-checked indexed load) for an invariant that the FSE table construction already enforces upstream. The LL FSE table is built with max_symbol == MAX_LITERAL_LENGTH_CODE (35); build_decoding_table returns FSETableError::TooManySymbols if read_probabilities yields more entries than that, and the RLE byte path is range-checked in maybe_update_fse_tables. ML uses the same shape with max_symbol == MAX_MATCH_LENGTH_CODE (52). A code reaching either lookup is invariant in 0..=35 / 0..=52. Switch assert! -> debug_assert!() and use get_unchecked. Removes one predicted-but-still-branch per LL/ML symbol from the hot decode loop.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

…FlatBuf Lays compile-time-monomorphised scaffolding for backlog item #132 (flat-buffer mode for frames the ring would never wrap on). The earlier dynamic-dispatch attempt (BufferStorage enum) measured a +43-58% regression on small-frame decompress because every push / repeat paid a runtime match. The generic split here keeps both backends as separate compiled instantiations; wrap dispatch is erased from the flat path at compile time rather than branched at runtime. No behavioural change yet — every caller still pinned to DecodeBuffer with the RingBuffer trait default. - buffer_backend.rs: BufferBackend trait — storage-side interface DecodeBuffer needs (extend, repeat-shape, drain-shape, rollback). - flat_buf.rs: FlatBuf Vec-backed no-wrap impl + 7 unit tests (append, extend_and_fill, extend_from_within_unchecked, drop_first_n head advance + match-source persistence, set_tail rollback, clear). Capacity sized once at frame reset with WILDCOPY_OVERLENGTH trailing slack so SIMD overshoots stay inside the allocation. - ringbuffer.rs: impl BufferBackend for RingBuffer — thin forwarder. - decode_buffer.rs: parameterised over B: BufferBackend with RingBuffer default; from_backend(buffer, window_size) constructor wraps a pre-sized FlatBuf for Phase 2; DrainGuard becomes generic. - mod.rs: register buffer_backend + flat_buf modules. Flat backend allow(dead_code) until Phase 2 wires FrameDecoder.reset to switch on Single_Segment_flag (separate PR — caller cascade through DecoderScratch / block_decoder / sequence_section_decoder / sequence_execution all needs to become generic over B). 513/513 lib tests pass (506 prior + 7 new FlatBuf), clippy clean on default + --no-default-features + --no-default-features --features hash.

Cascade the compile-time generic from Phase 1 (844f761) up to DecoderScratch. The struct is now parameterised over a BufferBackend with the same RingBuffer default, so every existing caller stays on the historical type via inference and no behavioural change ships in this commit either. Wiring FrameDecoder.reset to actually instantiate DecoderScratch<FlatBuf> on Single_Segment_flag frames is the next phase — that requires the FrameDecoderState type to become a small enum that holds either the Ring or Flat variant and pattern-match in the top-level FrameDecoder methods, so it stays as a separate commit to keep this one as a pure infrastructure cascade. 513/513 lib tests pass, clippy clean on all three feature modes.

…ence decoders block_decoder::decode_block_content, block_decoder::decompress_block, sequence_section_decoder::decode_and_execute_sequences, sequence_section_decoder::execute_one_sequence, sequence_execution::execute_sequences_fields — all become generic over B: BufferBackend. The compiler now monomorphises the full hot-path stack per backend; with FrameDecoderState still pinned to the RingBuffer default no behavioural change ships, but the wiring for the flat path is now mechanical (single FrameDecoderState variant flip in the next phase). 513/513 lib tests pass, clippy clean.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

zstd/src/decoding/ringbuffer.rs (1)
16-16: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use a single WILDCOPY_OVERLENGTH source across backends.

This file keeps a separate WILDCOPY_OVERLENGTH from buffer_backend.rs; drift between them would invalidate shared wildcopy safety assumptions. Please import and reuse the shared constant here too.

Also applies to: 815-880
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/ringbuffer.rs` at line 16, Remove the local
WILDCOPY_OVERLENGTH constant in ringbuffer.rs and instead import and reuse the
shared constant from buffer_backend.rs (reference symbols: WILDCOPY_OVERLENGTH,
ringbuffer.rs, buffer_backend.rs); update any uses in ringbuffer.rs (including
the wildcopy-related logic around lines ~815-880) to reference the imported
WILDCOPY_OVERLENGTH so the backend and ringbuffer share the same safety
assumption.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 63-68: FlatBuf::reserve currently computes the reserve amount as
needed - self.buf.capacity(), which is wrong because Vec::reserve takes
"additional from len" not "from capacity"; update the calculation in
reserve(&mut self, n: usize) to compute additional =
needed.saturating_sub(self.buf.len()) and call self.buf.reserve(additional +
WILDCOPY_OVERLENGTH) (or ensure the reserved amount is at least n +
WILDCOPY_OVERLENGTH) so that when len < capacity we still guarantee at least n
writable bytes for the unsafe copies using self.buf and preserve the
WILDCOPY_OVERLENGTH slack.

---

Outside diff comments:
In `@zstd/src/decoding/ringbuffer.rs`:
- Line 16: Remove the local WILDCOPY_OVERLENGTH constant in ringbuffer.rs and
instead import and reuse the shared constant from buffer_backend.rs (reference
symbols: WILDCOPY_OVERLENGTH, ringbuffer.rs, buffer_backend.rs); update any uses
in ringbuffer.rs (including the wildcopy-related logic around lines ~815-880) to
reference the imported WILDCOPY_OVERLENGTH so the backend and ringbuffer share
the same safety assumption.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1026fcbd-162a-4fcc-b0eb-c2ec75ac4e8d

📥 Commits

Reviewing files that changed from the base of the PR and between 058b8d5 and e6c7143.

📒 Files selected for processing (6)

zstd/src/decoding/buffer_backend.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/flat_buf.rs
zstd/src/decoding/mod.rs
zstd/src/decoding/ringbuffer.rs
zstd/src/decoding/scratch.rs

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

…3 review Phase 4 of backlog item #132. FrameDecoderState.decoder_scratch is now a DecoderScratchKind enum that picks between DecoderScratch<RingBuffer> and DecoderScratch<FlatBuf> based on FrameHeader.descriptor.single_segment_flag() at reset time. Match dispatch fires once per FrameDecoder entry point (decode_block_content, drain, can_drain, ...) — never inside the hot push/repeat loop, which stays fully monomorphised through the DecodeBuffer<B> generic landed in Phases 1-3. Critical: DecoderScratchKind::reset reuses the existing scratch allocations (FSE / HUF tables, sequence vec) when the backend kind is unchanged across frames. The first iteration replaced the whole scratch on every reset and measured +255% regression on small frames; the reuse path keeps the small-frame cost flat. Benchmark on M1 vs pre-Phase-4 baseline: - small-1k-random: 172 -> 166 ns (-3.5%) - small-4k-log-lines c_stream: 1.80 -> 1.65 us (-8%) - large-log-stream rust_stream: 18.0 -> 17.0 ms (-5%) - decodecorpus z000033: within +/- 1% noise Also addressed CodeRabbit / Copilot round-3 review threads: - flat_buf::reserve math: Vec::reserve is "additional from len", not "delta from capacity" — fix to use saturating_sub on the gap so an alloc actually happens when len < capacity < len+n. - sequence_section_decoder::LL_META doc: now says "rely on debug_assert + unsafe get_unchecked + upstream gates" instead of the no-longer-existing release-mode runtime check. - decode_buffer::DecodeBuffer flat doc: "selected by FrameDecoder" -> "intended for selection" until Phase 4 lands. - buffer_backend::WILDCOPY_OVERLENGTH: single shared constant, both RingBuffer and FlatBuf import from buffer_backend so the slack contract cannot drift. - buffer_backend::cap doc: clarified it's the per-instance realloc-detection sentinel, not a portable size — RingBuffer's cap excludes the slack, FlatBuf's includes it; the checkpoint only compares equality on the same instance so the asymmetry is fine. 513/513 lib tests pass, clippy clean on default + --no-default-features + --no-default-features --features hash.

libFuzzer artifact crash-e33ba082... exercised a multi-frame stream where frame 2 reset to a flat backend with a larger window_size than frame 1's pre-existing FlatBuf capacity. The reserve(n) math from both the original Phase 1 attempt and CodeRabbit's "additional = n - available" suggestion under-reserves on that case: when len == 0 and the existing capacity is non-zero, `(n - available)` shrinks the additional argument by exactly `available`, so Vec::reserve only ensures `capacity >= (n - available) + slack`, which is short by `available`. Subsequent extend_from_within_unchecked then panicked on the `dst_off + len <= self.buf.capacity()` debug assert. Fix: call `self.buf.reserve(n + WILDCOPY_OVERLENGTH)` directly. Vec::reserve's contract is "ensures capacity >= len + additional", which is exactly the contract callers need; the gap-from-capacity forms above were re-deriving the same condition incorrectly. Pinned the failing input as a regression test (`multi_frame_flat_buf_path_does_not_panic`) in src/tests/fuzz_regressions.rs alongside the existing `malformed_block_does_not_panic_via_restore_checkpoint` regression so any future iteration of the flat backend has to re-pass it. 514/514 lib tests pass, clippy clean.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 63-74: The reserve method currently calls Vec::reserve without
zeroing the newly allocated slack; update FlatBuf::reserve to remember the
previous capacity (or previous capacity - len available), call
self.buf.reserve(...), then zero (set to 0) the newly allocated bytes up to the
new capacity including the WILDCOPY_OVERLENGTH slack so the buffer's trailing
bytes remain defined (matching with_capacity's behavior) — ensure this zeroing
covers exactly the additional bytes added by the reserve and does not change
len, and keep references to FlatBuf, reserve, WILDCOPY_OVERLENGTH and
extend_from_within_unchecked to guide placement of the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 382e1a7e-c8e8-4864-941f-d6be89fdee69

📥 Commits

Reviewing files that changed from the base of the PR and between 058b8d5 and 3536fb1.

📒 Files selected for processing (10)

zstd/src/decoding/block_decoder.rs
zstd/src/decoding/buffer_backend.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/flat_buf.rs
zstd/src/decoding/frame_decoder.rs
zstd/src/decoding/mod.rs
zstd/src/decoding/ringbuffer.rs
zstd/src/decoding/scratch.rs
zstd/src/decoding/sequence_execution.rs
zstd/src/decoding/sequence_section_decoder.rs

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

zstd/src/decoding/flat_buf.rs (1)

63-83: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

reserve() still breaks the initialized-slack invariant after a grow.

with_capacity() explicitly zero-fills the trailing WILDCOPY_OVERLENGTH region so overshoot reads stay defined, but reserve() only increases Vec capacity and never initializes the newly added bytes. After the first grow, the backend can reintroduce UB on the same wildcopy paths this file is documenting around.

Suggested fix

     #[inline]
     fn reserve(&mut self, n: usize) {
+        let old_cap = self.buf.capacity();
         // `Vec::reserve(additional)` guarantees
         // `capacity >= len + additional`; passing
         // `n + WILDCOPY_OVERLENGTH` is the exact contract callers
@@
-        self.buf.reserve(n.saturating_add(WILDCOPY_OVERLENGTH));
+        self.buf.reserve(n.saturating_add(WILDCOPY_OVERLENGTH));
+        let new_cap = self.buf.capacity();
+        if new_cap > old_cap {
+            unsafe {
+                ptr::write_bytes(self.buf.as_mut_ptr().add(old_cap), 0, new_cap - old_cap);
+            }
+        }
     }

Does Rust's standard-library `Vec::reserve` initialize the newly reserved memory, or does it only grow capacity while leaving the added region uninitialized? Please answer from the official `Vec` documentation.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/flat_buf.rs` around lines 63 - 83, The reserve() method on
FlatBuf grows self.buf's capacity but leaves newly reserved bytes uninitialized,
breaking the initialized-slack invariant that with_capacity() enforces for
WILDCOPY_OVERLENGTH; update reserve() (the method on the type with field
self.buf in flat_buf.rs) to initialize (zero-fill or otherwise set) the newly
allocated slack region up to WILDCOPY_OVERLENGTH after calling Vec::reserve so
overshoot reads remain defined—mirror the initialization approach used in
with_capacity(), and ensure any downstream callers like
extend_from_within_unchecked can rely on the initialized slack.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/tests/fuzz_regressions.rs`:
- Around line 182-185: The test currently uses `if let Ok(mut decoder) =
crate::decoding::StreamingDecoder::new(data)` which silently skips the body when
construction fails; change it to require construction to succeed (e.g. `let mut
decoder = crate::decoding::StreamingDecoder::new(data).unwrap();`) so the test
fails if `new` errors, then keep ignoring only the `read_to_end` result (e.g.
`let _ = decoder.read_to_end(&mut output);`) so the flat-buffer panic path is
exercised.

---

Duplicate comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 63-83: The reserve() method on FlatBuf grows self.buf's capacity
but leaves newly reserved bytes uninitialized, breaking the initialized-slack
invariant that with_capacity() enforces for WILDCOPY_OVERLENGTH; update
reserve() (the method on the type with field self.buf in flat_buf.rs) to
initialize (zero-fill or otherwise set) the newly allocated slack region up to
WILDCOPY_OVERLENGTH after calling Vec::reserve so overshoot reads remain
defined—mirror the initialization approach used in with_capacity(), and ensure
any downstream callers like extend_from_within_unchecked can rely on the
initialized slack.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: afc2450a-5a58-4fca-8b90-17206399fe66

📥 Commits

Reviewing files that changed from the base of the PR and between 3536fb1 and 12343cd.

📒 Files selected for processing (3)

Cargo.toml
zstd/src/decoding/flat_buf.rs
zstd/src/tests/fuzz_regressions.rs

Two unrelated changes folded into one commit because the perf opt and the doc/test fixes shipped together while addressing review threads: - ringbuffer: fuse reserve + flat-extend on the common literal push. Profile (decodecorpus-z000033 L=-7 c_stream, 5GB workload via samply at 4999Hz) put RingBuffer::extend at 15% self-time; a share of that was the redundant `self.reserve(len)` call dispatched before the existing flat fast path. Hoist a fused fast path to the top of `extend`: when head <= tail AND `len < cap - tail`, both reserve() AND the wrap-dispatch are skipped. The strict `<` keeps tail < cap so invariant 4 holds without the post-write normalisation branch; the original `<=` flat path is kept as a second branch (behind reserve) for the `tail+len==cap` boundary. - Cargo.toml: enable `debug = "line-tables-only"` on the bench profile so samply can resolve hot function names. Round-5 review-thread fixes: - flat_buf::with_capacity: drop the zero-init of the slack region entirely. FlatBuf only WRITES past `len` (extend, extend_and_fill, extend_from_within_unchecked) before any matching `set_len`, and only READS bytes inside `head..buf.len()` (as_slices, drain helpers). The trailing slack region is therefore intentionally uninitialised — no UB, no per-frame O(cap) zero pass. - decode_buffer / mod / scratch docs: drop the "Phase 2 wiring pending" / "intended for selection" wording — FrameDecoder already instantiates DecoderScratch<FlatBuf> via DecoderScratchKind on single-segment frames. - mod.rs: remove `#[allow(dead_code)]` on flat_buf (genuinely used now). - fuzz_regressions::multi_frame_flat_buf_path_does_not_panic: `if let Ok(decoder) = StreamingDecoder::new(data)` → `.expect("…")` so a future regression that broke ctor for this artifact cannot silently turn the test into a no-op. 514/514 lib tests pass, clippy clean on default + --no-default-features + --no-default-features --features hash.

Address Copilot finding on PR #197: FlatBuf::extend_from_reader was using Vec::resize(old + fill_length, 0) to grow length, then immediately overwriting the freshly-zeroed bytes via read_exact. Raw blocks decode up to 128 KiB through this path, so the eager memset doubled the write traffic on the raw-block reset shape. Switch to a read-into-spare-capacity pattern: reserve(fill_length), build a `&mut [u8]` slice over the spare capacity via from_raw_parts_mut, hand it to read_exact, and only call set_len after the read returns Ok. Failed reads leave length unchanged, matching the prior truncate-on-error observable behaviour. SAFETY notes added inline cover the two unsafe blocks: the read_slot construction (capacity guarantee + sole-writer contract until set_len commits) and the set_len itself (initialisation contract). 515/515 lib tests pass, clippy clean on default + --tests + --no-default-features + --no-default-features --features hash.

polaz · 2026-05-19T19:20:26Z

@coderabbitai Re: decode_buffer.rs:83-93 outside-diff (regression test for from_backend clear) — already addressed in commit 8468e03: from_backend_clears_prepopulated_backend test added to decode_buffer::tests module. Test feeds a populated RingBuffer into from_backend, asserts len() == 0 after construction, pushes fresh bytes and verifies drain returns exactly those bytes.

coderabbitai · 2026-05-19T19:20:42Z

Tip

For best results, initiate chat on the files or code changes.

@polaz — confirmed, the test coverage is solid. from_backend_clears_prepopulated_backend hits exactly the right invariants: verifying len() == 0 post-construction proves the clear happened, and the push-then-drain round-trip confirms the backend is in a clean, usable state afterward. No further action needed here.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

`new_tail` only ever comes from a previous `tail()` call (== `buf.len()`), and `buf.len()` is monotonically non-decreasing between the checkpoint and the rollback (the caller only writes between snapshot and restore). Asserting `<= self.buf.capacity()` was a loose upper bound that would silently let a caller bug pass through `set_len` into uninitialised slack bytes — UB. Tightening to `<= self.buf.len()` catches that in debug builds without changing release semantics.

polaz · 2026-05-19T19:38:55Z

@coderabbitai Re: decode_buffer.rs:83-93 outside-diff — already landed in 8468e03 as from_backend_clears_prepopulated_backend (decode_buffer.rs:551). Same shape as the suggested diff: pre-populate the backend, call from_backend, assert len()==0, then push+drain to confirm it's usable. No new test needed.

coderabbitai · 2026-05-19T19:39:16Z

Tip

For best results, initiate chat on the files or code changes.

@polaz — Already noted and confirmed in my previous reply. The test at decode_buffer.rs:551 covers the right invariants (len()==0 post-construction + push/drain round-trip), and no new test is needed. Nothing further to action here.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

The two `#[cfg]`-gated tail expressions (the x86 match and the non-x86 scalar fallback) are both valid as `ExpressionWithBlock` statements without a separating `;` — `match {...}` and `{...}` are both block-terminated and the parser accepts them back-to-back. Wrapping the x86 match in an outer `{}` makes the two branches symmetrical (both are `#[cfg(...)] { ... }`) and removes the recurring reviewer false-positive about the construct looking like two expression statements without a separator. Generated code is unchanged.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

with_capacity and BufferBackend::reserve both budget WILDCOPY_OVERLENGTH slack past the target capacity. The previous self.buf.reserve(fill_length) call inside extend_from_reader skipped that slack — read_exact today does not overshoot so this is not a correctness bug, but it leaves a footgun if the raw-block path ever picks up SIMD/wildcopy writes. Routing through self.reserve makes the slack invariant uniform across all growth paths.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

…xtend_from_reader Forming `&mut [u8]` via `slice::from_raw_parts_mut` over uninitialised Vec spare capacity is UB regardless of whether a subsequent write initialises it — `&mut T` must always reference initialised memory of the target type. The earlier shape skipped zero-fill before read_exact, which paid soundness for ~one memset per 128 KiB raw block — net cost is negligible vs the UB risk on any future Read impl that touches the buffer before writing. Switches extend_from_reader to Vec::resize(.., 0) + indexed slice form. Also drops the redundant explicit BufferBackend import from the test mod (use super::* already brings it in scope).

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 136-145: The issue is that extend() and extend_and_fill() grow
self.buf directly and can break the WILDCOPY_OVERLENGTH slack invariant; update
both methods (extend and extend_and_fill in flat_buf.rs) to route all growth
through self.buf.reserve() (using the same slack size used by
with_capacity()/extend_from_reader()) before extending or resizing so that any
reallocation preserves the documented overlength headroom required for wildcopy
semantics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 254087a6-6a0c-4638-b798-a1c1e8c1c1fb

📥 Commits

Reviewing files that changed from the base of the PR and between 98d44f7 and 0b42ce7.

📒 Files selected for processing (1)

zstd/src/decoding/flat_buf.rs

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 19, 2026 15:27

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

polaz added 2 commits May 19, 2026 18:31

Copilot AI review requested due to automatic review settings May 19, 2026 15:44

Copilot started reviewing on behalf of polaz May 19, 2026 15:45 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

polaz requested a review from Copilot May 19, 2026 16:19

Copilot started reviewing on behalf of polaz May 19, 2026 16:19 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

polaz added 2 commits May 19, 2026 19:32

Copilot AI review requested due to automatic review settings May 19, 2026 16:34

Copilot started reviewing on behalf of polaz May 19, 2026 16:35 View session

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/flat_buf.rs Outdated

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/decode_buffer.rs Outdated

Comment thread zstd/src/decoding/buffer_backend.rs Outdated

Comment thread zstd/src/decoding/buffer_backend.rs Outdated

polaz commented May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/flat_buf.rs Outdated

polaz added 2 commits May 19, 2026 20:17

Copilot AI review requested due to automatic review settings May 19, 2026 17:39

Copilot started reviewing on behalf of polaz May 19, 2026 17:40 View session

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/flat_buf.rs Outdated

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/decode_buffer.rs Outdated

Comment thread zstd/src/decoding/mod.rs Outdated

Comment thread zstd/src/decoding/scratch.rs Outdated

Comment thread zstd/src/tests/fuzz_regressions.rs Outdated

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/tests/fuzz_regressions.rs Outdated

polaz mentioned this pull request May 19, 2026

Roadmap: structured-zstd feature parity with C zstd #28

Open

polaz requested a review from Copilot May 19, 2026 19:21

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/flat_buf.rs Outdated

polaz requested a review from Copilot May 19, 2026 19:41

Copilot started reviewing on behalf of polaz May 19, 2026 19:42 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/huff0/huff0_decoder.rs Outdated

polaz requested a review from Copilot May 19, 2026 20:04

Copilot started reviewing on behalf of polaz May 19, 2026 20:05 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/flat_buf.rs Outdated

polaz requested a review from Copilot May 19, 2026 20:19

Copilot started reviewing on behalf of polaz May 19, 2026 20:20 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/flat_buf.rs

Comment thread zstd/src/decoding/flat_buf.rs Outdated

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/flat_buf.rs

polaz requested a review from Copilot May 19, 2026 21:05

Copilot started reviewing on behalf of polaz May 19, 2026 21:06 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

polaz merged commit 57e968f into main May 19, 2026
29 checks passed

polaz deleted the feat/#178-decode-speed-2 branch May 19, 2026 21:14

sw-release-bot Bot mentioned this pull request May 19, 2026

chore: release v0.0.22 #156

Merged

This was referenced May 20, 2026

perf(decode): pack (base, extra_bits) into single u16 FSE Entry field #210

Closed

perf(fse): elide bounds check on init_state + update_state decode reads #214

Merged

Conversation

polaz commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Micro-optimisations on the FSE / HUF hot path

Flat-buffer backend (compile-time generic, backlog item #132)

Tests

Measured wins (M1, decompress L=-7 fast)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

polaz commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

polaz commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

polaz commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading