perf(decode): pack LL/ML metadata + hot-path micro-opts#197
Conversation
Replace the previous pair of separate const arrays (LL_BASE [u32; 36] + LL_EXTRA_BITS [u8; 36] = 180 B over two cache-line regions, indexed twice per sequence) with a single packed table (LL_META [u32; 36] = 144 B, indexed once). Layout: low 24 bits = baseline (max 65536 fits trivially), high 8 bits = extra_bits (max 16). Same for ML. Hot path now performs one u32 load + a shift + mask to recover both fields, instead of two distinct loads from two cache-line regions. Saves one memory access per LL/ML pair, i.e. two accesses per sequence — compounds over thousands of sequences per block. Validity of the packing is enforced at compile time by reconstructing the table via a const fn from the original donor-spec arrays, so the literal table contents and the spec do not drift apart silently.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a BufferBackend trait and WILDCOPY_OVERLENGTH; implements a Vec-backed FlatBuf and RingBuffer BufferBackend impl; makes DecodeBuffer/DecoderScratch generic over backends and wires Frame/Block decoding to backend-dispatched helpers. Packs LL/ML metadata into u32 tables and updates sequence lookups. Simplifies Huff0 single-symbol dispatch. ChangesBuffer backend and decode buffer genericization
Packed Metadata for Sequence Decoder Lookup Tables
Huffman Single-Symbol Dispatch
Sequence Diagram(s)sequenceDiagram
participant FrameDecoder
participant DecoderScratchKind
participant DecodeBuffer
participant BufferBackend
FrameDecoder->>DecoderScratchKind: decode_block_content(...)
DecoderScratchKind->>DecodeBuffer: forward decode
DecodeBuffer->>BufferBackend: extend/extend_from_within_unchecked/as_slices
BufferBackend-->>DecodeBuffer: storage/slots
DecodeBuffer-->>DecoderScratchKind: decode result / bytes_written
DecoderScratchKind-->>FrameDecoder: block decode result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR continues decode hot-path optimization work by reducing per-sequence table lookups in the FSE-driven sequence decoder. It packs literal-length (LL) and match-length (ML) baseline + extra-bits metadata into single u32 entries, so each symbol lookup becomes one load plus bit ops instead of two separate table loads.
Changes:
- Replace
LL_BASE/LL_EXTRA_BITSwith a packedLL_METAtable and add unpack helper. - Replace
ML_BASE/ML_EXTRA_BITSwith a packedML_METAtable and add shared pack helper. - Update
lookup_ll_code/lookup_ml_codeto use the packed metadata tables.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Address Copilot finding on PR #197: pack_code_meta's doc comment claimed compile-time validation but no asserts were actually present. Add const-evaluated asserts that - bases[i] & 0xFF00_0000 == 0 (baseline must fit in 24 bits) - extra_bits[i] <= 16 (zstd format §3.1.1.3.2.1.1 limit) Any future spec extension that violates either invariant now fails the build instead of silently clobbering the packed payload.
extend_from_within_unchecked and its branchless variant are the hottest path from DecodeBuffer::repeat — every non-repcode, non-overlapping match calls one of them. They were missing #[inline], so cross-crate consumers (the codepath through DecodeBuffer) could not fold the flat-layout (head < tail) fast path into the caller and paid a real function-call hop per match copy. For frames fitting in the window — the dominant case, and especially the Fast-encoded blocks that dominate L=-7..L=1 — this is the difference between one inlined SIMD copy and a non-inlined dispatch through free_slice_parts-shape branches. First step toward backlog item #132 (full flat-buffer DecodeBuffer mode). Full dual-mode refactor stays out of this PR; this inline gate alone unblocks the existing flat fast paths in extend() and reserve() so they actually compose with repeat().
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
zstd/src/decoding/sequence_section_decoder.rs:458
lookup_ml_codehas the same pattern aslookup_ll_code: a runtimeassert!in a hot loop even though the ML decoder table is bounded byMAX_MATCH_LENGTH_CODEat construction time. Consider usingdebug_assert!+ unchecked access (or otherwise removing the runtime assert) so malformed inputs don’t panic and the per-sequence path avoids an extra conditional branch.
#[inline(always)]
fn lookup_ml_code(code: u8) -> (u32, u8) {
let idx = code as usize;
assert!(idx < ML_META.len(), "Illegal match length code was: {code}");
unpack_code_meta(ML_META[idx])
Address Copilot finding on PR #197: lookup_ll_code and lookup_ml_code were paying a runtime assert!() (a release-mode bounds-checked indexed load) for an invariant that the FSE table construction already enforces upstream. The LL FSE table is built with max_symbol == MAX_LITERAL_LENGTH_CODE (35); build_decoding_table returns FSETableError::TooManySymbols if read_probabilities yields more entries than that, and the RLE byte path is range-checked in maybe_update_fse_tables. ML uses the same shape with max_symbol == MAX_MATCH_LENGTH_CODE (52). A code reaching either lookup is invariant in 0..=35 / 0..=52. Switch assert! -> debug_assert!() and use get_unchecked. Removes one predicted-but-still-branch per LL/ML symbol from the hot decode loop.
…FlatBuf Lays compile-time-monomorphised scaffolding for backlog item #132 (flat-buffer mode for frames the ring would never wrap on). The earlier dynamic-dispatch attempt (BufferStorage enum) measured a +43-58% regression on small-frame decompress because every push / repeat paid a runtime match. The generic split here keeps both backends as separate compiled instantiations; wrap dispatch is erased from the flat path at compile time rather than branched at runtime. No behavioural change yet — every caller still pinned to DecodeBuffer with the RingBuffer trait default. - buffer_backend.rs: BufferBackend trait — storage-side interface DecodeBuffer needs (extend, repeat-shape, drain-shape, rollback). - flat_buf.rs: FlatBuf Vec-backed no-wrap impl + 7 unit tests (append, extend_and_fill, extend_from_within_unchecked, drop_first_n head advance + match-source persistence, set_tail rollback, clear). Capacity sized once at frame reset with WILDCOPY_OVERLENGTH trailing slack so SIMD overshoots stay inside the allocation. - ringbuffer.rs: impl BufferBackend for RingBuffer — thin forwarder. - decode_buffer.rs: parameterised over B: BufferBackend with RingBuffer default; from_backend(buffer, window_size) constructor wraps a pre-sized FlatBuf for Phase 2; DrainGuard becomes generic. - mod.rs: register buffer_backend + flat_buf modules. Flat backend allow(dead_code) until Phase 2 wires FrameDecoder.reset to switch on Single_Segment_flag (separate PR — caller cascade through DecoderScratch / block_decoder / sequence_section_decoder / sequence_execution all needs to become generic over B). 513/513 lib tests pass (506 prior + 7 new FlatBuf), clippy clean on default + --no-default-features + --no-default-features --features hash.
Cascade the compile-time generic from Phase 1 (844f761) up to DecoderScratch. The struct is now parameterised over a BufferBackend with the same RingBuffer default, so every existing caller stays on the historical type via inference and no behavioural change ships in this commit either. Wiring FrameDecoder.reset to actually instantiate DecoderScratch<FlatBuf> on Single_Segment_flag frames is the next phase — that requires the FrameDecoderState type to become a small enum that holds either the Ring or Flat variant and pattern-match in the top-level FrameDecoder methods, so it stays as a separate commit to keep this one as a pure infrastructure cascade. 513/513 lib tests pass, clippy clean on all three feature modes.
…ence decoders block_decoder::decode_block_content, block_decoder::decompress_block, sequence_section_decoder::decode_and_execute_sequences, sequence_section_decoder::execute_one_sequence, sequence_execution::execute_sequences_fields — all become generic over B: BufferBackend. The compiler now monomorphises the full hot-path stack per backend; with FrameDecoderState still pinned to the RingBuffer default no behavioural change ships, but the wiring for the flat path is now mechanical (single FrameDecoderState variant flip in the next phase). 513/513 lib tests pass, clippy clean.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/decoding/ringbuffer.rs (1)
16-16: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winUse a single
WILDCOPY_OVERLENGTHsource across backends.This file keeps a separate
WILDCOPY_OVERLENGTHfrombuffer_backend.rs; drift between them would invalidate shared wildcopy safety assumptions. Please import and reuse the shared constant here too.Also applies to: 815-880
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/decoding/ringbuffer.rs` at line 16, Remove the local WILDCOPY_OVERLENGTH constant in ringbuffer.rs and instead import and reuse the shared constant from buffer_backend.rs (reference symbols: WILDCOPY_OVERLENGTH, ringbuffer.rs, buffer_backend.rs); update any uses in ringbuffer.rs (including the wildcopy-related logic around lines ~815-880) to reference the imported WILDCOPY_OVERLENGTH so the backend and ringbuffer share the same safety assumption.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 63-68: FlatBuf::reserve currently computes the reserve amount as
needed - self.buf.capacity(), which is wrong because Vec::reserve takes
"additional from len" not "from capacity"; update the calculation in
reserve(&mut self, n: usize) to compute additional =
needed.saturating_sub(self.buf.len()) and call self.buf.reserve(additional +
WILDCOPY_OVERLENGTH) (or ensure the reserved amount is at least n +
WILDCOPY_OVERLENGTH) so that when len < capacity we still guarantee at least n
writable bytes for the unsafe copies using self.buf and preserve the
WILDCOPY_OVERLENGTH slack.
---
Outside diff comments:
In `@zstd/src/decoding/ringbuffer.rs`:
- Line 16: Remove the local WILDCOPY_OVERLENGTH constant in ringbuffer.rs and
instead import and reuse the shared constant from buffer_backend.rs (reference
symbols: WILDCOPY_OVERLENGTH, ringbuffer.rs, buffer_backend.rs); update any uses
in ringbuffer.rs (including the wildcopy-related logic around lines ~815-880) to
reference the imported WILDCOPY_OVERLENGTH so the backend and ringbuffer share
the same safety assumption.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1026fcbd-162a-4fcc-b0eb-c2ec75ac4e8d
📒 Files selected for processing (6)
zstd/src/decoding/buffer_backend.rszstd/src/decoding/decode_buffer.rszstd/src/decoding/flat_buf.rszstd/src/decoding/mod.rszstd/src/decoding/ringbuffer.rszstd/src/decoding/scratch.rs
…3 review Phase 4 of backlog item #132. FrameDecoderState.decoder_scratch is now a DecoderScratchKind enum that picks between DecoderScratch<RingBuffer> and DecoderScratch<FlatBuf> based on FrameHeader.descriptor.single_segment_flag() at reset time. Match dispatch fires once per FrameDecoder entry point (decode_block_content, drain, can_drain, ...) — never inside the hot push/repeat loop, which stays fully monomorphised through the DecodeBuffer<B> generic landed in Phases 1-3. Critical: DecoderScratchKind::reset reuses the existing scratch allocations (FSE / HUF tables, sequence vec) when the backend kind is unchanged across frames. The first iteration replaced the whole scratch on every reset and measured +255% regression on small frames; the reuse path keeps the small-frame cost flat. Benchmark on M1 vs pre-Phase-4 baseline: - small-1k-random: 172 -> 166 ns (-3.5%) - small-4k-log-lines c_stream: 1.80 -> 1.65 us (-8%) - large-log-stream rust_stream: 18.0 -> 17.0 ms (-5%) - decodecorpus z000033: within +/- 1% noise Also addressed CodeRabbit / Copilot round-3 review threads: - flat_buf::reserve math: Vec::reserve is "additional from len", not "delta from capacity" — fix to use saturating_sub on the gap so an alloc actually happens when len < capacity < len+n. - sequence_section_decoder::LL_META doc: now says "rely on debug_assert + unsafe get_unchecked + upstream gates" instead of the no-longer-existing release-mode runtime check. - decode_buffer::DecodeBuffer flat doc: "selected by FrameDecoder" -> "intended for selection" until Phase 4 lands. - buffer_backend::WILDCOPY_OVERLENGTH: single shared constant, both RingBuffer and FlatBuf import from buffer_backend so the slack contract cannot drift. - buffer_backend::cap doc: clarified it's the per-instance realloc-detection sentinel, not a portable size — RingBuffer's cap excludes the slack, FlatBuf's includes it; the checkpoint only compares equality on the same instance so the asymmetry is fine. 513/513 lib tests pass, clippy clean on default + --no-default-features + --no-default-features --features hash.
libFuzzer artifact crash-e33ba082... exercised a multi-frame stream where frame 2 reset to a flat backend with a larger window_size than frame 1's pre-existing FlatBuf capacity. The reserve(n) math from both the original Phase 1 attempt and CodeRabbit's "additional = n - available" suggestion under-reserves on that case: when len == 0 and the existing capacity is non-zero, `(n - available)` shrinks the additional argument by exactly `available`, so Vec::reserve only ensures `capacity >= (n - available) + slack`, which is short by `available`. Subsequent extend_from_within_unchecked then panicked on the `dst_off + len <= self.buf.capacity()` debug assert. Fix: call `self.buf.reserve(n + WILDCOPY_OVERLENGTH)` directly. Vec::reserve's contract is "ensures capacity >= len + additional", which is exactly the contract callers need; the gap-from-capacity forms above were re-deriving the same condition incorrectly. Pinned the failing input as a regression test (`multi_frame_flat_buf_path_does_not_panic`) in src/tests/fuzz_regressions.rs alongside the existing `malformed_block_does_not_panic_via_restore_checkpoint` regression so any future iteration of the flat backend has to re-pass it. 514/514 lib tests pass, clippy clean.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 63-74: The reserve method currently calls Vec::reserve without
zeroing the newly allocated slack; update FlatBuf::reserve to remember the
previous capacity (or previous capacity - len available), call
self.buf.reserve(...), then zero (set to 0) the newly allocated bytes up to the
new capacity including the WILDCOPY_OVERLENGTH slack so the buffer's trailing
bytes remain defined (matching with_capacity's behavior) — ensure this zeroing
covers exactly the additional bytes added by the reserve and does not change
len, and keep references to FlatBuf, reserve, WILDCOPY_OVERLENGTH and
extend_from_within_unchecked to guide placement of the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 382e1a7e-c8e8-4864-941f-d6be89fdee69
📒 Files selected for processing (10)
zstd/src/decoding/block_decoder.rszstd/src/decoding/buffer_backend.rszstd/src/decoding/decode_buffer.rszstd/src/decoding/flat_buf.rszstd/src/decoding/frame_decoder.rszstd/src/decoding/mod.rszstd/src/decoding/ringbuffer.rszstd/src/decoding/scratch.rszstd/src/decoding/sequence_execution.rszstd/src/decoding/sequence_section_decoder.rs
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
zstd/src/decoding/flat_buf.rs (1)
63-83:⚠️ Potential issue | 🔴 Critical | ⚡ Quick win
reserve()still breaks the initialized-slack invariant after a grow.
with_capacity()explicitly zero-fills the trailingWILDCOPY_OVERLENGTHregion so overshoot reads stay defined, butreserve()only increasesVeccapacity and never initializes the newly added bytes. After the first grow, the backend can reintroduce UB on the same wildcopy paths this file is documenting around.Suggested fix
#[inline] fn reserve(&mut self, n: usize) { + let old_cap = self.buf.capacity(); // `Vec::reserve(additional)` guarantees // `capacity >= len + additional`; passing // `n + WILDCOPY_OVERLENGTH` is the exact contract callers @@ - self.buf.reserve(n.saturating_add(WILDCOPY_OVERLENGTH)); + self.buf.reserve(n.saturating_add(WILDCOPY_OVERLENGTH)); + let new_cap = self.buf.capacity(); + if new_cap > old_cap { + unsafe { + ptr::write_bytes(self.buf.as_mut_ptr().add(old_cap), 0, new_cap - old_cap); + } + } }Does Rust's standard-library `Vec::reserve` initialize the newly reserved memory, or does it only grow capacity while leaving the added region uninitialized? Please answer from the official `Vec` documentation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/decoding/flat_buf.rs` around lines 63 - 83, The reserve() method on FlatBuf grows self.buf's capacity but leaves newly reserved bytes uninitialized, breaking the initialized-slack invariant that with_capacity() enforces for WILDCOPY_OVERLENGTH; update reserve() (the method on the type with field self.buf in flat_buf.rs) to initialize (zero-fill or otherwise set) the newly allocated slack region up to WILDCOPY_OVERLENGTH after calling Vec::reserve so overshoot reads remain defined—mirror the initialization approach used in with_capacity(), and ensure any downstream callers like extend_from_within_unchecked can rely on the initialized slack.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/tests/fuzz_regressions.rs`:
- Around line 182-185: The test currently uses `if let Ok(mut decoder) =
crate::decoding::StreamingDecoder::new(data)` which silently skips the body when
construction fails; change it to require construction to succeed (e.g. `let mut
decoder = crate::decoding::StreamingDecoder::new(data).unwrap();`) so the test
fails if `new` errors, then keep ignoring only the `read_to_end` result (e.g.
`let _ = decoder.read_to_end(&mut output);`) so the flat-buffer panic path is
exercised.
---
Duplicate comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 63-83: The reserve() method on FlatBuf grows self.buf's capacity
but leaves newly reserved bytes uninitialized, breaking the initialized-slack
invariant that with_capacity() enforces for WILDCOPY_OVERLENGTH; update
reserve() (the method on the type with field self.buf in flat_buf.rs) to
initialize (zero-fill or otherwise set) the newly allocated slack region up to
WILDCOPY_OVERLENGTH after calling Vec::reserve so overshoot reads remain
defined—mirror the initialization approach used in with_capacity(), and ensure
any downstream callers like extend_from_within_unchecked can rely on the
initialized slack.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: afc2450a-5a58-4fca-8b90-17206399fe66
📒 Files selected for processing (3)
Cargo.tomlzstd/src/decoding/flat_buf.rszstd/src/tests/fuzz_regressions.rs
Two unrelated changes folded into one commit because the perf opt
and the doc/test fixes shipped together while addressing review
threads:
- ringbuffer: fuse reserve + flat-extend on the common literal push.
Profile (decodecorpus-z000033 L=-7 c_stream, 5GB workload via
samply at 4999Hz) put RingBuffer::extend at 15% self-time; a
share of that was the redundant `self.reserve(len)` call
dispatched before the existing flat fast path. Hoist a fused
fast path to the top of `extend`: when head <= tail AND
`len < cap - tail`, both reserve() AND the wrap-dispatch are
skipped. The strict `<` keeps tail < cap so invariant 4 holds
without the post-write normalisation branch; the original
`<=` flat path is kept as a second branch (behind reserve)
for the `tail+len==cap` boundary.
- Cargo.toml: enable `debug = "line-tables-only"` on the bench
profile so samply can resolve hot function names.
Round-5 review-thread fixes:
- flat_buf::with_capacity: drop the zero-init of the slack region
entirely. FlatBuf only WRITES past `len` (extend, extend_and_fill,
extend_from_within_unchecked) before any matching `set_len`, and
only READS bytes inside `head..buf.len()` (as_slices, drain
helpers). The trailing slack region is therefore intentionally
uninitialised — no UB, no per-frame O(cap) zero pass.
- decode_buffer / mod / scratch docs: drop the "Phase 2 wiring
pending" / "intended for selection" wording — FrameDecoder
already instantiates DecoderScratch<FlatBuf> via
DecoderScratchKind on single-segment frames.
- mod.rs: remove `#[allow(dead_code)]` on flat_buf (genuinely
used now).
- fuzz_regressions::multi_frame_flat_buf_path_does_not_panic:
`if let Ok(decoder) = StreamingDecoder::new(data)` →
`.expect("…")` so a future regression that broke ctor for this
artifact cannot silently turn the test into a no-op.
514/514 lib tests pass, clippy clean on default +
--no-default-features + --no-default-features --features hash.
Address Copilot finding on PR #197: FlatBuf::extend_from_reader was using Vec::resize(old + fill_length, 0) to grow length, then immediately overwriting the freshly-zeroed bytes via read_exact. Raw blocks decode up to 128 KiB through this path, so the eager memset doubled the write traffic on the raw-block reset shape. Switch to a read-into-spare-capacity pattern: reserve(fill_length), build a `&mut [u8]` slice over the spare capacity via from_raw_parts_mut, hand it to read_exact, and only call set_len after the read returns Ok. Failed reads leave length unchanged, matching the prior truncate-on-error observable behaviour. SAFETY notes added inline cover the two unsafe blocks: the read_slot construction (capacity guarantee + sole-writer contract until set_len commits) and the set_len itself (initialisation contract). 515/515 lib tests pass, clippy clean on default + --tests + --no-default-features + --no-default-features --features hash.
|
@coderabbitai Re: |
|
Tip For best results, initiate chat on the files or code changes.
|
`new_tail` only ever comes from a previous `tail()` call (== `buf.len()`), and `buf.len()` is monotonically non-decreasing between the checkpoint and the rollback (the caller only writes between snapshot and restore). Asserting `<= self.buf.capacity()` was a loose upper bound that would silently let a caller bug pass through `set_len` into uninitialised slack bytes — UB. Tightening to `<= self.buf.len()` catches that in debug builds without changing release semantics.
|
@coderabbitai Re: |
|
Tip For best results, initiate chat on the files or code changes.
|
The two `#[cfg]`-gated tail expressions (the x86 match and the
non-x86 scalar fallback) are both valid as `ExpressionWithBlock`
statements without a separating `;` — `match {...}` and `{...}` are
both block-terminated and the parser accepts them back-to-back.
Wrapping the x86 match in an outer `{}` makes the two branches
symmetrical (both are `#[cfg(...)] { ... }`) and removes the
recurring reviewer false-positive about the construct looking like
two expression statements without a separator. Generated code is
unchanged.
with_capacity and BufferBackend::reserve both budget WILDCOPY_OVERLENGTH slack past the target capacity. The previous self.buf.reserve(fill_length) call inside extend_from_reader skipped that slack — read_exact today does not overshoot so this is not a correctness bug, but it leaves a footgun if the raw-block path ever picks up SIMD/wildcopy writes. Routing through self.reserve makes the slack invariant uniform across all growth paths.
…xtend_from_reader Forming `&mut [u8]` via `slice::from_raw_parts_mut` over uninitialised Vec spare capacity is UB regardless of whether a subsequent write initialises it — `&mut T` must always reference initialised memory of the target type. The earlier shape skipped zero-fill before read_exact, which paid soundness for ~one memset per 128 KiB raw block — net cost is negligible vs the UB risk on any future Read impl that touches the buffer before writing. Switches extend_from_reader to Vec::resize(.., 0) + indexed slice form. Also drops the redundant explicit BufferBackend import from the test mod (use super::* already brings it in scope).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/decoding/flat_buf.rs`:
- Around line 136-145: The issue is that extend() and extend_and_fill() grow
self.buf directly and can break the WILDCOPY_OVERLENGTH slack invariant; update
both methods (extend and extend_and_fill in flat_buf.rs) to route all growth
through self.buf.reserve() (using the same slack size used by
with_capacity()/extend_from_reader()) before extending or resizing so that any
reallocation preserves the documented overlength headroom required for wildcopy
semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 254087a6-6a0c-4638-b798-a1c1e8c1c1fb
📒 Files selected for processing (1)
zstd/src/decoding/flat_buf.rs
Summary
Follow-on to #194 (merged). Continues the decode-speed work for issue #178.
Micro-optimisations on the FSE / HUF hot path
c2eed0b6): replace the pair ofLL_BASE: [u32; 36]+LL_EXTRA_BITS: [u8; 36]arrays with a singleLL_META: [u32; 36](low 24 bits = baseline, high 8 bits = extra_bits). One u32 load + shift + mask vs two distinct cache-line touches. Compile-time validation via const fn reconstructs the packed table from donor-spec arrays so the contents cannot drift silently.debug_assert+get_uncheckedonlookup_ll/ml_codehot path (058b8d51): FSE table-construction invariants (max_symbol = MAX_LITERAL_LENGTH_CODE/MAX_MATCH_LENGTH_CODE, RLE byte range-check) already guaranteecode <= 35/code <= 52; drop the release-mode bounds check.a695a60d): the threeaarch64_neon/aarch64_svedecode_symbol_and_advancekernels were copy-paste clones of the scalar body. The match was paying a 3-arm dispatch for three identical bodies on every single-symbol decode. Cfg-split: x86 keeps the runtime match (real BMI2 vs scalar perf delta), aarch64 calls scalar directly with no match.extend_from_within_unchecked(4daf2926): match-copy hot path (DecodeBuffer::repeat) is the second-hottest decode entry;#[inline]lets the compiler fold thehead < tailflat-layout fast path into the caller.reserve+ flat-extend onRingBuffer::extend(41b357e6): hoist a single fused fast path withhead <= tail && len < cap - tailto the top ofextend— bothreserve()and the wrap-dispatch are skipped on the dominant literal-push shape.Flat-buffer backend (compile-time generic, backlog item #132)
Phases 1–4 land the
BufferBackendtrait,FlatBuf(Vec<u8>-backed no-wrap impl), and a compile-time-monomorphisedDecodeBuffer<B: BufferBackend = RingBuffer>cascaded throughDecoderScratch<B>and every block / sequence decoder helper.FrameDecoderState.decoder_scratchbecomes aDecoderScratchKindenum that picksRingorFlatatreset()time based onFrameHeader.descriptor.single_segment_flag(); the match fires once per FrameDecoder entry point (decode_block_content, drain, can_drain, …), never inside the hot push/repeat loop.Why compile-time generic instead of a runtime enum: the earlier
runtime
enum BufferStorage { Ring, Flat }attempt paid matchoverhead in every push/repeat and measured +43–58% regression on
small-frame decompress. The generic split monomorphises each backend
independently and erases wrap dispatch from the flat side entirely
at compile time.
Critical detail:
DecoderScratchKind::resetreuses the existing scratch allocations (FSE / HUF tables, sequence vec) when the backend kind is unchanged across frames. An earlier iteration that replaced the whole scratch on every reset measured +255% regression on small frames; the reuse path keeps the small-frame cost flat.FlatBuf::reservesemantics fix (12343cd3):Vec::reserve(additional)is "additional bytes beyond len", not "delta from capacity"; on a multi-frame stream where frame 2 haswindow_size > frame 1's prior FlatBuf capacity, the previous gap-from-capacity formula under-reserved by the existing capacity. Pinned the failing libFuzzer artifact (crash-e33ba082…) asmulti_frame_flat_buf_path_does_not_panicinsrc/tests/fuzz_regressions.rs.Tests
FlatBufunit tests + 1 new fuzz regressionmulti_frame_flat_buf_path_does_not_panic)--no-default-features+--no-default-features --features hashMeasured wins (M1, decompress L=-7 fast)
Compared to pre-PR baseline (origin/main after #194 merged):
small-1k-random rust_streamsmall-10k-random rust_streamsmall-4k-log-lines c_streamdecodecorpus z000033 c_streamdecodecorpus z000033 rust_streamlow-entropy-1m c_streamlarge-log-stream rust_streamFFI gap on
decodecorpus z000033: c_stream 2.73× → 2.11×, rust_stream 4.62× → 3.60×.Source of most of the win:
41b357e6(RingBuffer fused reserve + flat-extend), which collapsed two separate dispatches into a single hoisted fast path for the dominant literal-push shape.Part of #178.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Chores