perf(decode): fused sequence executor + SIMD/FSE hot-path cleanup + DoS-safe rollback by polaz · Pull Request #194 · structured-world/structured-zstd

polaz · 2026-05-19T14:12:41Z

Summary

Continuation of decode-speed work that was on PR #190 (closed because that PR's history accumulated a panic-guard variant of the malformed-input fix that a libFuzzer artifact (crash-bfb3bc55…) then turned into a DoS surface). This PR ships the same decode-speed improvements plus the graceful-rollback fix that closes that surface.

Closes #178

Decode speed

Baseline (z000033, Level 3 dfast, M1): 1,814,570 ns/iter → best 1,353,635 ns / median ~1,413,000 ns (≈ −22%). Throughput 553 → 739 MB/s. FFI gap 3.89× → 2.88×.

Notable changes

Fused sequence executor (decoding/sequence_section_decoder.rs): collapses decode_sequences + execute_sequences into a single loop, eliminating the Vec<Sequence> roundtrip on the hot path.
Transactional rollback for the fused executor via DecodeBuffer::checkpoint / try_restore_checkpoint. The rollback primitive snapshots the RingBuffer cap alongside tail so an intervening reallocation (only reachable on malformed input that decodes past MAX_BLOCK_SIZE) is detected and the rollback is gracefully skipped, not turned into a panic. The caller still surfaces a normal decode Err.
SIMD copy dispatcher collapsed; single 16-byte fast path; exact-tail path widened to 32 B (4× u64); AVX2 / AVX-512 kernels target-feature-gated for no_std.
BitReaderReversed::refill split into hot inline + #[cold] #[inline(never)] slow path; mask_lower_bits computed via shift instead of BIT_MASK[n] LUT.
RingBuffer head + start == cap now routes through the contiguous-second-slice path (was misrouted into split-source, producing a zero-length copy_bytes_overshooting against the slack region).
do_offset_history fast path for offset_value >= 4, matching donor ZSTD_updateRep.
no-std CI gate + clippy on --no-default-features and --no-default-features --features hash.

Tests

506/506 lib tests pass (505 + new malformed_block_does_not_panic_via_restore_checkpoint fuzz regression).
Fuzz artifact crash-bfb3bc55… (the input that took down PR perf(decode): collapse simd_copy dispatcher + FSE bounds-check elision + no_std CI gate #190's CI) is now an inline regression test in zstd/src/tests/fuzz_regressions.rs.
All clippy modes (--all-targets, --no-default-features, --no-default-features --features hash) clean.

Summary by CodeRabbit

New Features
- Added CI checks for no-standard-library builds to catch regressions on minimal targets.
Bug Fixes
- Safer rollback/checkpoint restore to avoid panics on malformed input and prevent buffer clobbering.
- Improved edge-case decoding robustness and validated offset/history handling.
Tests
- Added regression tests for malformed frames and capacity-edge copy behavior.
Chores
- Performance and SIMD-backed copy-path optimizations for decompression.

coderabbitai · 2026-05-19T14:12:56Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 367385f8-2384-4e71-bd18-25277e718169

📥 Commits

Reviewing files that changed from the base of the PR and between 83c52dd and 5478b51.

📒 Files selected for processing (3)

zstd/src/decoding/ringbuffer.rs
zstd/src/decoding/sequence_section_decoder.rs
zstd/src/fse/fse_decoder.rs

📝 Walkthrough

Walkthrough

This PR fuses sequence decoding and execution into an inline decode+execute pipeline, adds DecodeBuffer checkpoint/rollback, allocates 16-byte ringbuffer slack for safe overshooting SIMD copies, rewrites simd_copy dispatch, adjusts bitreader and FSE hot-paths, defers benchmark compression, and adds a no-std clippy CI job plus a malformed-block regression test.

Changes

Fused sequence decode+execute refactor

Layer / File(s)	Summary
Decode buffer checkpoint and rollback `zstd/src/decoding/decode_buffer.rs`	`DecodeBufferCheckpoint` captures ringbuffer tail, capacity, and output counter. `checkpoint()` and `try_restore_checkpoint()` enable conditional rollback on capacity match; `push` annotated `#[inline]`. Tests verify rollback and reallocation refusal.
RingBuffer wildcopy slack allocation and management `zstd/src/decoding/ringbuffer.rs`	Adds `WILDCOPY_OVERLENGTH = 16`, switches to `alloc_zeroed` for allocations with trailing slack, exposes `cap()`, `tail()`, and unsafe `set_tail()`, and updates deallocation layout to include slack.
RingBuffer extend with conditional wildcopy slack inflation `zstd/src/decoding/ringbuffer.rs`	Routes `extend` and `extend_from_within_unchecked` through `simd_copy::copy_bytes_overshooting`, adding a flat fast-path for non-wrapped copies and conditional slack inflation to avoid clobbering readable bytes; tightens branch conditions and adds a wrapped-layout regression test.
SIMD copy multi-path dispatcher `zstd/src/decoding/simd_copy.rs`	Rewrites `copy_bytes_overshooting` with dedicated fast paths: single-op ≤16-byte copy, exact-length ≤32 tail copy, macro-dispatched chunked SIMD for larger sizes, and scalar fallback choosing exact `copy_from_nonoverlapping` when overshoot doesn't fit. Adds `single_op_copy_16` and unit tests; removes `copy_strategy`.
Field-based sequence execution extractor `zstd/src/decoding/sequence_execution.rs`	Adds `execute_sequences_fields` operating on caller-provided `DecodeBuffer`, `literals_buffer`, `offset_hist`, and `sequences`; exposes `do_offset_history` as `pub(crate)` and removes prefetch helper.
Fused sequence decode+execute pipeline `zstd/src/decoding/sequence_section_decoder.rs`	Adds `decode_and_execute_sequences` that decodes LL/ML/OF inline from the reversed bitstream and executes each sequence immediately into `DecodeBuffer` and `offset_hist`, using checkpoint/rollback on post-loop validation and appending tail literals; retains RLE fallback and replaces match-arm lookups with indexed tables.
Block decoder fused pipeline wiring `zstd/src/decoding/block_decoder.rs`	Updates block decoder imports and replaces the decode-then-execute pipeline with a single `decode_and_execute_sequences` call passing field borrows from workspace.
Bit reader refill hot/cold path split `zstd/src/bit_io/bit_reader_reverse.rs`	`mask_lower_bits` non-BMI2 path now computes mask with `u64::MAX.checked_shr(...)`, and `BitReaderReversed::refill` is split into an `#[inline(always)]` hot path and a `#[cold]` `refill_slow` for edge cases; debug assertions relocated to the hot path.
FSE decoder unchecked lookups and endian-aware init `zstd/src/fse/fse_decoder.rs`	`update_state_fast` uses `get_unchecked` with safety doc; `build_decoding_table` now reserves and has `copy_symbols_into_decode` perform `set_len` + raw `ptr::write_unaligned` initialization on little-endian targets to fully initialize `Entry` storage.

Benchmark and regression testing updates

Layer / File(s)	Summary
Lazy compression materialization in benchmarks `zstd/benches/compare_ffi.rs`	`bench_decompress` defers compressed input creation: it computes `expected_len` per scenario and uses a `OnceCell<Vec<u8>>` + `materialize` closure to lazily compress and validate the compressed bytes once per invocation, shared by Rust and C-FFI paths.
Regression test and CI coverage `zstd/src/tests/fuzz_regressions.rs`, `.github/workflows/ci.yml`	Adds `malformed_block_does_not_panic_via_restore_checkpoint` test asserting malformed input returns `Err` without panic via `restore_checkpoint`. Adds `no-std` CI job that runs `cargo clippy --no-default-features` and `--features hash` under `zstd/` with warnings as errors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

#178: perf(decode + encode-greedy)… — This PR implements Phase 1/2 primitives (fused sequence execution, simd_copy, ringbuffer slack, bitreader/FSE tweaks) relevant to closing the decode performance gap described in the issue.

Possibly related PRs

structured-world/structured-zstd#85: Related simd_copy refactor and ringbuffer adjustments around overshooting copy contracts.
structured-world/structured-zstd#43: Related benchmark changes to compare_ffi.rs and bench input wiring.
structured-world/structured-zstd#58: Overlaps on mask_lower_bits/BIT_MASK handling and may conflict on masking strategy.

Poem

A rabbit hops where bits unwind,
Fused streams dance and slack’s aligned,
Checkpoint saved, no panic sprung,
SIMD feet beat, a tiny tongue —
Decode and execute, swift and kind. 🐰✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'perf(decode): fused sequence executor + SIMD/FSE hot-path cleanup + DoS-safe rollback' clearly summarizes the main changes: fused sequence executor, SIMD/FSE optimizations, and rollback for DoS safety.
Linked Issues check	✅ Passed	The PR successfully implements Phase 2 objectives from issue `#178`: fused sequence executor (eliminate Vec roundtrip), SIMD/wildcopy optimization, bitstream/FSE hot-path tweaks, RingBuffer fixes, and no_std CI gates. All core decode performance improvements are present.
Out of Scope Changes check	✅ Passed	All changes align with Phase 2 decode optimization scope: sequence executor fusion, SIMD dispatch simplification, FSE hot-path refinement, RingBuffer slack allocation, bitstream reader refactoring, and DoS-safe rollback mechanism. No unrelated refactoring or feature additions detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/#178-decode-speed

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Continuation of decode-speed work from the closed PR #190, fusing the sequence decoder/executor onto one loop and tightening hot SIMD/FSE paths. The fused executor now records a transactional rollback point through DecodeBuffer::checkpoint/try_restore_checkpoint, so a malformed block that forces a RingBuffer reallocation between checkpoint and validation surfaces as a normal decode Err instead of panicking (closes the libFuzzer DoS surface from #178).

Changes:

New fused decode_and_execute_sequences plus rollback primitives on DecodeBuffer/RingBuffer, with a regression test for the libFuzzer crash.
simd_copy collapses the function-pointer dispatcher, adds a single-op 16-byte fast path and a 32-byte exact-tail path; RingBuffer routes pushes through it with explicit WILDCOPY_OVERLENGTH slack accounting.
Hot-path tightening in BitReaderReversed::refill (split hot/cold) and FSEDecoder::decode_symbol / FSETable build (raw set_len + unaligned writes); offset-history fast path; no-std CI gate.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
zstd/src/decoding/sequence_section_decoder.rs	Adds fused `decode_and_execute_sequences` with transactional rollback, base-table lookups, drops some validation checks.
zstd/src/decoding/sequence_execution.rs	Renames/recasts executor as `execute_sequences_fields`; adds `do_offset_history` fast path; removes prefetch helper.
zstd/src/decoding/decode_buffer.rs	New `checkpoint`/`try_restore_checkpoint` rollback primitives and tests.
zstd/src/decoding/ringbuffer.rs	`WILDCOPY_OVERLENGTH` slack in allocations, flat-extend fast path, slack accounting per call site, new tests.
zstd/src/decoding/simd_copy.rs	Collapses dispatcher; new single-op-16 and 32-byte exact-tail paths; `active_chunk_size_for_tests` rewritten.
zstd/src/decoding/block_decoder.rs	Switches to fused `decode_and_execute_sequences`, drops the explicit `execute_sequences` call.
zstd/src/bit_io/bit_reader_reverse.rs	Splits `refill` into hot/cold halves; computes `mask_lower_bits` via shift instead of LUT.
zstd/src/fse/fse_decoder.rs	`get_unchecked` in `decode_symbol`; `reserve`+`set_len` instead of `resize` to skip zero-fill; unaligned u32 write for trailing odd entry.
zstd/src/tests/fuzz_regressions.rs	Inline regression test for the libFuzzer DoS artifact.
zstd/benches/compare_ffi.rs	Lazy `OnceCell` materialization of compressed inputs for profiling.
.github/workflows/ci.yml	New `no-std` clippy job covering `--no-default-features` and `--no-default-features --features hash`.

Comments suppressed due to low confidence (1)

zstd/src/decoding/sequence_section_decoder.rs:1

The runtime of_code > MAX_OFFSET_CODE and offset == 0 checks have been dropped from both decode_one_sequence_inline and decode_sequences_with_rle, replaced by debug_assert! only. The comment asserts these invariants are guaranteed by build_decoding_table, but of_code is consumed two lines later as 1u32 << of_code: if any decoder path (FSE table construction, predefined-distribution import, or a future code change) ever yields of_code >= 32, this is a left-shift overflow — UB on untrusted input in release builds, and the previous explicit error return was the only thing keeping a corrupt FSE table from reaching it. Please either retain the runtime check on of_code (and on offset != 0) on this hot path or add a tight, audited invariant assertion at the table-construction site that is explicitly cross-referenced here so the chain of trust is greppable. Removing a defense-in-depth check on parsed bytes from an untrusted frame deserves an explicit decision rather than a debug-only assert.

use super::super::blocks::sequence_section::ModeType;

Previously bench_decompress ran compress_slice_to_vec + ffi_encode_to_vec upfront for every scenario × level × source triple, even when criterion's filter selected a single benchmark. cargo bench --profile-time <filter> then spent its entire window in encode setup instead of the decode hot path it was supposed to profile. Wrap the compressed-byte production in a OnceCell inside bench_decompress_source. Materialization now happens only when the bench_function closure runs (i.e., when criterion's filter matches), keeping samply / cargo flamegraph traces concentrated on the decode side. CI timings are unaffected — the timing loop already excluded setup; this change just narrows the work performed during ad-hoc filtered runs. Refs #189

The crate declares #![no_std] with std as an opt-out feature, but the existing test/clippy matrix only exercises --features hash,std, dict_builder. A regression that pulls std::* into a hot decoder path slips through unnoticed. Add a new no-std job covering the two realistic deployment shapes: - alloc-only (no features) - alloc + xxhash content checksum (--features hash) Both run cargo check and cargo clippy with -D warnings so a stray `use std::...` lights up immediately. Refs #189

Convert the lookup_ll_code / lookup_ml_code helpers from staircase match arms to const arrays mirroring donor zstd's LL_base / LL_bits / ML_base / ML_bits tables. The bounds-check assert preserves the original panic-on-illegal-code semantics; LLVM already lowered the match to a jump table so perf is unchanged on this corpus, but the LUT form is the donor-parity baseline future work can build on. Refs #189

FSEDecoder::update_state_fast indexes self.table.decode[next_state] on every sequence × 3 state advances. The bounds check fires every iteration because LLVM cannot prove the invariant that spans table-build and decode call sites: calc_baseline_and_numbits pairs new_state with num_bits such that new_state + (2^num_bits - 1) < table_size = self.table.decode.len(); get_bits_unchecked guarantees add < 2^num_bits; therefore next_state < self.table.decode.len() always holds for any valid FSE table — but the proof is cross-function and the bounds check stays. Replace the indexed read with get_unchecked and a SAFETY comment citing the invariant. No new unsafe entry point — the function is already unsafe-ish in the sense that the caller must call ensure_bits before invoking it (otherwise get_bits_unchecked reads stale bits). Refs #189

…st path Replace the CopyStrategy struct + CopyFn function-pointer indirection with an inlined arch-aware decision tree. Two structural wins: 1. The per-call indirect dispatch through `(strategy.copy)(...)` is gone — kernels are direct-called from the dispatcher, so on aarch64 (where NEON is the ABI baseline) the kernel inlines into copy_bytes_overshooting entirely. 2. New single_op_copy_16 fast path covers copy_at_least in 1..=16 with one SIMD store (or two overlapping u64 stores on archs without 128-bit vectors). Match-copy execution with offset 8..15 funneled into the previous scalar-strategy loop on every iteration — the profile showed simd_copy::copy_scalar at 273 ms (5.5%) vs copy_neon at 80 ms (1.6%) on `decompress/level_-1_fast/decodecorpus-z000033/ c_stream/matrix/pure_rust`; the new fast path collapses those 8..15 byte calls to one vector op. no_std semantics preserved: x86 runtime detection stays std-only, no_std x86 uses compile-time target_feature cfg, aarch64+neon is unconditional in both modes. CI no-std + no-std-hash jobs cover the build matrix. Local aarch64 (M1, decodecorpus-z000033 c_stream level_-1_fast): baseline 1,814,570 ns/iter this PR 1,737,827 ns/iter (median of 3, best 1,683,119) -> -4.2% median (-7.2% best); 553 → 575 MB/s All 86 simd_copy / ringbuffer / decode_buffer / cross_validation / roundtrip tests green. cargo clippy --no-default-features {-, +hash} clean. Refs #189

simd_copy::copy_bytes_overshooting falls back to copy_from_nonoverlapping (libc memmove) whenever `min(src.1, dst.1) < copy_at_least` rounded up to the active SIMD chunk. The fallback dominated the decode profile — _platform_memmove sat at 21.5% of CPU on decompress/level_-1_fast/ decodecorpus-z000033/c_stream/matrix/pure_rust because match copies near ring-buffer wraparound boundaries had tight capacities and tripped the fallback every iteration. The new single_op_copy_16 fast path also requires min_buffer_size >= 16, so any 8..15-byte chunk emitted by DecodeBuffer::repeat_in_chunks for offsets 8..15 skipped it for the same reason. Mirror donor zstd's WILDCOPY_OVERLENGTH = 16 contract by: 1. Allocating `cap + WILDCOPY_OVERLENGTH` bytes via alloc_zeroed. The slack region is physically present past `cap` but never indexed via `head`/`tail` (those still wrap on `cap`). alloc_zeroed instead of alloc so wildcopy reads observe defined values rather than UB on uninitialized memory. 2. Updating Drop and `reserve_amortized`'s `current_layout` to match. 3. Inflating the `src.1`/`dst.1` capacities passed to `copy_bytes_overshooting` at every call site where the inflated end stays inside the allocation: - src.1: every site (source pointer + WILDCOPY_OVERLENGTH lands in the slack region at worst). - dst.1: only the case-A first-copy site where dst ends at `cap` (slack absorbs overshoot). All other dst spans end at `head` and cannot be inflated — overshoot would clobber readable data. Decode time on the profiled scenario: baseline 1,814,570 ns/iter prev PR 1,737,827 ns/iter (single_op + dispatcher collapse) this commit 1,668,752 ns/iter (median of 4, range 1.667-1.678) -> -8.0% from baseline, -4.0% incremental -> FFI gap 3.89x -> 3.57x; 553 MB/s -> 600 MB/s All 556 nextest tests pass. no_std + clippy clean. Refs #189

- Pin the `clippy` component on the toolchain step (matches the existing `lint` job's explicit components: rustfmt, clippy spec — deterministic install, no auto-fetch on every run). - Drop the standalone `cargo check` steps: `cargo clippy` already runs the full compile + analysis pipeline, so a separate check was duplicated work for no extra signal. Refs #189

The Entry { new_state: 0, symbol: 0, num_bits: 0 } pre-fill from self.decode.resize() was redundant on little-endian targets: copy_symbols_into_decode already writes a full 4-byte Entry (with new_state = 0 and num_bits = 0 implicit in the packed u32/u64 layout) to every slot via raw ptr::write_unaligned calls. - Replace the resize with reserve + set_len on LE (BE keeps resize so the iter_mut() field-assignment path stays safe — no production BE target ships against this crate). - Extend copy_symbols_into_decode's trailing odd-entry path to write a full Entry via a u32 store; the previous field assignment relied on the resize-side zero-fill to initialize new_state and num_bits. Measured impact on the L-1 fast decodecorpus-z000033 c_stream path is ~0% — FSETable::build_decoding_table self-time is 171 ms / 3.4% of decode CPU and the resize zero-fill was only ~half of that. The win is real but small relative to the ring-buffer / simd_copy wins upstream of it. Keeping the change because the pattern (skip zero-fill when subsequent writes cover every byte) is a clean donor parity point and stops future regressions where a slower iter_mut loop would re-introduce the redundant pass. Refs #189

…tend through it Two coordinated changes that together cut _platform_memmove from 25.9% self-time (1264 ms) to 7.1% (343 ms) on decompress/level_-1_fast/ decodecorpus-z000033/c_stream/matrix/pure_rust. 1. simd_copy: add an exact-length tail path for copy_at_least <= 16 that fires when the single_op_copy_16 fast path can't (no WILDCOPY slack on the destination). Two branches: - 1..=8 bytes: byte-by-byte loop. Unrolls to ≤8 immediate-offset load/store pairs. - 9..=16 bytes: two overlapping unaligned u64 ops, writing exactly copy_at_least bytes with no overshoot past dst.0 + copy_at_least. Replaces the libc memmove call that the dispatcher used to fall back to for short copies near RingBuffer head boundaries (where head bounds dst.1 and there is no slack to inflate). 2. RingBuffer::extend: route both free-slice writes through simd_copy::copy_bytes_overshooting instead of raw copy_from_nonoverlapping. Profile showed 943 ms of memmove was coming from `execute_sequences → DecodeBuffer::push → extend` (literal-byte copies during sequence execution). The previous direct copy_from_nonoverlapping always lowered to a libc call; routing through simd_copy lets short literal pushes (the common case at negative levels) use the inline byte / overlapping-u64 paths above. dst.1 for the first free slice is inflated by WILDCOPY_OVERLENGTH because f1_ptr ends at `cap` (slack region is writable); the second free slice ends at `head` so its dst.1 stays exact. Decode time on the profiled scenario: baseline 1,814,570 ns/iter prior PR state 1,668,752 ns/iter this commit 1,653,976 ns/iter (median of 5) -> -8.9% from baseline, -0.9% incremental The bench-time delta is smaller than the profile-attribution shift suggests because the new inline path does roughly the same byte-level work memmove was doing — the win is in killing the libc call overhead on short copies. Larger copies (>16 bytes that previously hit memmove) still hit it; chasing those further has diminishing returns. All 86 simd_copy / ringbuffer / decode_buffer / cross_validation / roundtrip tests pass. no_std + clippy clean. Refs #189

…ng check decode_sequences_without_rle previously paid two per-iteration overheads on the hot path: 1. `if target.len() < num_sequences { state advance }` — used to skip the trailing state update on the final iteration. Materializing `target.len()` every iteration and branching cost ~2-3 ns per sequence on top of the FSE work. 2. `if br.bits_remaining() < 0 { return Err(NotEnoughBytes) }` — an isize multiply + 3 subtractions every iteration. Donor zstd only validates bitstream exhaustion once after the sequence loop. Refactor splits the loop into a hot (num_sequences - 1) body that always advances state, and a tail iteration that skips the advance. The bits_remaining check moves out of the hot path into a single post-loop validation that catches the same underflow case via the existing >= 0 comparison. Decode time on the profiled scenario: baseline 1,814,570 ns/iter prior PR state 1,653,976 ns/iter this commit 1,632,669 ns/iter (median of 5) -> -10.0% from baseline, -1.3% incremental -> FFI gap 3.89x → 3.50x; 553 MB/s → 612 MB/s Same change deferred for decode_sequences_with_rle — that path only fires on blocks that mix RLE-mode codes with FSE-encoded streams, which decodecorpus-z000033 (and most real corpora at fast levels) do not produce. Splitting it would add complexity for zero measurable gain on the targeted scenario. Refs #189

CodeRabbit flagged that extend_from_within_unchecked_branchless passes exact-fit (ptr, len) capacities through to copy_with_nobranch_check -> simd_copy::copy_bytes_overshooting, so it cannot fire the SIMD fast paths that require min(src.1, dst.1) >= 16 — short copies on that path always take the inline byte / overlapping-u64 fallback instead of single_op_copy_16. Inflating the capacities the way `extend_from_within_unchecked` does would require threading per-pointer head/tail context into `copy_with_nobranch_check`, which currently sees only raw pointers and chunk lengths. Defer that refactor: the branchless path is x86-gated via `decode_buffer::use_branchless_wildcopy`, and my profiling rig is aarch64 — there is no local x86 measurement to justify the extra plumbing right now. Document the trade-off in a doc-comment on the function so a future maintainer doesn't accidentally assume WILDCOPY semantics hold on this path. On aarch64, the unconditional `extend_from_within_unchecked` is used, so the slack contract is exercised end-to-end there. Refs #189

The previous tail path covered copies of 1..=16 bytes inline; copies of 17..=32 bytes fell through to the chunked SIMD dispatch, which then bailed (`min_buffer_size < copy_at_least.next_multiple_of(chunk)`) and ended up in libc memmove for buffers without WILDCOPY_OVERLENGTH slack on the destination side. Extend the inline path to 17..=32 by emitting four unaligned u64 ops: two adjacent stores for the first 16 bytes, then two overlapping stores for the trailing 1..=16 bytes. Eight loads + eight stores total, all branch-free, exactly `copy_at_least` bytes written without overshoot. This catches the literal-push size range (10..=24 bytes is typical on the profiled corpus) that the previous ≤16-byte cap missed. Decode time on `decompress/level_-1_fast/decodecorpus-z000033/ c_stream/matrix/pure_rust`: baseline 1,814,570 ns/iter prior PR state 1,632,669 ns/iter this commit 1,607,680 ns/iter (median of 5) -> -11.4% from baseline, -1.5% incremental -> FFI gap 3.89x → 3.40x; 553 MB/s → 629 MB/s All 86 simd_copy / ringbuffer / decode_buffer / cross_validation / roundtrip tests pass. Refs #189

…_len window RingBuffer::extend: The first-free-slice destination capacity passed to `simd_copy::copy_bytes_overshooting` was unconditionally inflated by WILDCOPY_OVERLENGTH. That is only safe when `tail >= head` — in the wrapped layout (`tail < head`) `free_slice_parts` returns the inner `(buf+tail, head-tail)` gap, and inflating into the WILDCOPY region would let wildcopy overshoot past `head` into still-readable buffer output. Today no overshoot path actually fires because `src.1` is exact (`in_f1 == copy_at_least`) and caps `min_buffer_size`, so the inflation never changed dispatch — but the latent UB hazard if anyone ever adds a `dst.1`-only overshoot fast path is real. Gate the inflation on `tail >= head` and rewrite the comment to make both reasoning paths explicit (slack-reachability AND src.1 cap). Add `extend_wrapped_layout_preserves_bytes_past_head`: a regression test that builds a wrapped layout via a new `build_tail_before_head` helper, fills the inner gap, and asserts the 16-byte window past `head` is untouched. FSETable::build_decoding_table: `self.decode.set_len(table_size)` ran immediately after `reserve`, leaving Entry slots uninitialized through the entire `table_symbols` resize + symbol-spreading loop. A panic in that window (allocator failure on `table_symbols.resize`) would unwind with reachable uninitialized Entry slots — currently UB-free because `Entry: Copy` has trivial drop, but a fragile contract for future maintainers. Move `set_len` to immediately precede `copy_symbols_into_decode`, after all panic-able allocations have finished. The uninitialized- slot window is now zero panic-able code wide; an unwind from the spreading loop leaves `self.decode.len() == 0` instead. All 557 nextest tests pass. Decode perf unchanged on the profiled scenario (median 1,594,500 ns/iter — the inflation never affected dispatch outcome under the current simd_copy contract). Refs #189

… test Clippy 1.95 flags `for i in 0..16 { sentinel[i] = ... }` as `needless_range_loop`. Rewrite the prefill-sampling loop in extend_wrapped_layout_preserves_bytes_past_head to use `sentinel.iter_mut().enumerate()`; the verification loop keeps the plain `for i in 0..16usize` form because its body indexes the raw ring-buffer pointer (not the loop's own slice), which clippy permits. Refs #189

The previous refill() carried a blanket #[cold] annotation. Donor analysis (libzstd BIT_reloadDStream_internal at bitstream.h:370-444) plus the sequence-decode profile flagged this as a regression: refill fires roughly every 2 sequences during sequence decode — it is the dominant mid-stream path, not a cold case. The #[cold] hint pessimised the branch predictor and forced LLVM to spill the hot 3-instruction body out of every caller's straight-line code. Reshape the function so: - Outer refill() is #[inline(always)]. Only the mid-stream hot path (`self.index >= bytes_consumed`: subtract index, mask bit count, load 8 bytes) lives here, folding into every get_bits / ensure_bits call site. - The three end-of-stream branches (`self.index > 0`, exhausted but bit_consumed < 64, fully exhausted) move into refill_slow(), marked #[cold] #[inline(never)]. Same semantics, just no longer parked on the hot path. Local aarch64 measurements are dominated by thermal / background-process noise right now (range 1.68 - 13.1 ms across consecutive runs of the same bench, vs the prior commit's stable 1.65 ms median); the lowest sample matches prior best so there is no regression. Keeping the change per the theoretically-correct-opts-stay-in-tree rule — final profile pass will revisit if the noise floor settles. All 120 bit_io / fse / cross_validation / roundtrip tests pass. Refs #189

Both per-iter validations in decode_sequences (with_rle main, without_rle main, without_rle tail — all three loop bodies) were mathematically unreachable under any successfully-constructed FSE table: - `of_code > MAX_OFFSET_CODE` (31): `AlignedFSETable::new(MAX_OFFSET_CODE)` caps `max_symbol` at table construction, and `build_decoding_table`'s `TooManySymbols` guard rejects any probabilities array exceeding `max_symbol + 1` entries. The symbols stored in the FSE state table index that probabilities array, so emitted symbols are guaranteed bounded. For RLE-mode tables, `maybe_update_fse_tables` validates the singleton byte against MAX_OFFSET_CODE at scratch setup. - `offset == 0`: `offset = obits + (1u32 << of_code)` and `1u32 << of_code` is ≥ 1 for any `of_code` in 0..=31 — i.e. the addend is at least 1. Replace both branches with `debug_assert!` so a future invariant break trips in tests without paying for the comparison on the hot path in release builds. Same change applied to the with_rle path, the without_rle main loop body, and the without_rle tail iteration (3 sites total) so the four branches × N sequences disappear from every FSE decode iteration. Local aarch64 measurements: median 1,611 ms / best 1,585 ms across 5 runs — matches prior best (1,586 ms after the exact-len-32 commit). Range is much tighter than the prior commit's thermal-noise window (5 - 107 us stddev here vs 5,628 us there). The theoretical win is ~1-3% per the donor analysis (4 branches × ~100k sequences on the profiled frame); on this run it tucked into the noise floor, but the change is kept per the project rule that theoretically-correct optimizations stay in tree pending a final clean-room profile. All 67 sequence_section_decoder / decode_sequences / cross_validation / roundtrip tests pass. Refs #189

…e helper The main loop and tail iteration of decode_sequences_without_rle carried two near-identical copies of the per-sequence decode body. Pull the shared work (FSE symbol fetch, LL/ML LUT, get_bits_triple, offset assembly, debug_asserts) into a private #[inline(always)] helper returning Sequence, called from both sites. LLVM still specializes each call site after inlining; the source-level duplication and divergence risk are gone. No measurable perf change on the profiled scenario (median 1,605 ms / range 1,600 - 1,627 ms across 5 runs, vs prior 1,611 ms median — within stddev). All 557 nextest tests pass. Refs #189

The hot loop in execute_sequences indexes `scratch.sequences[idx]` on every iteration to copy the Sequence struct into a local. LLVM cannot prove the access is in-bounds because `scratch` is a `&mut DecoderScratch` and the loop body calls `scratch.buffer.reserve(...)` / `scratch.buffer.push(...)` / `scratch.buffer.repeat(...)`, which the borrow checker treats as opaque mutations of the wider struct. Capture `scratch.sequences.len()` into a `sequences_len` local before the loop (so the iteration range is stable) and switch the per-iter read to `get_unchecked(idx)` with a SAFETY comment citing the idx < sequences_len invariant. Same Sequence value extracted, no behavioural change. Local aarch64 measurements: prior commit median 1,605,658 ns / range 1,600 - 1,627 ns this commit median 1,588,630 ns / range 1,582 - 1,594 ns -> -1.1% incremental, -12.5% cumulative from baseline 1,814,570 ns -> FFI gap 3.89x -> 3.36x; 553 MB/s -> 644 MB/s All 81 sequence / execute / roundtrip / cross_validation tests pass. Refs #189

Both `if seq.ll > 0 { push(literals) }` and `if seq.ml > 0 { repeat(...) }` in execute_sequences were guarding downstream calls that already handle the zero-length case: - `RingBuffer::extend` (called via `DecodeBuffer::push`) early-returns when the input slice is empty. - `DecodeBuffer::repeat` early-returns when match_length == 0. Drop the guards — when seq.ll/ml is zero the empty slice / zero match length flows through naturally without any byte-copy work. Saves two predictable branches per sequence on a hot path that runs once per emitted sequence. The literal-length bounds check (`high > literals_buffer.len()`) stays on the always-execute path, which is mathematically equivalent to the old check inside the `seq.ll > 0` branch — when ll == 0, `high == literals_copy_counter`, so the check resolves identically. Local aarch64 measurements: median 1,607 ns / range 1,582 - 1,620 ns across 5 runs, vs prior commit 1,589 ns median. Within run-to-run stddev (best 1,582 ns matches prior best 1,583 ns). Theoretically correct (~0.5-1% per branch elimination); kept per the rule that correct opts stay in tree pending a final clean-room profile. All 557 nextest tests pass. Refs #189

In --no-default-features builds on x86/x86_64 the dispatcher cfg-gates out every call site for `copy_avx2` and `copy_avx512`: #[cfg(target_feature = "avx512f")] try_chunk_kernel!(64, copy_avx512); #[cfg(target_feature = "avx2")] try_chunk_kernel!(32, copy_avx2); The default x86_64 Rust target does NOT enable +avx2 / +avx512f (those require explicit RUSTFLAGS), so on no-std builds without those flags both call sites cfg-out. Std builds use runtime detection through `detect_x86_caps`, which rustc cannot see through statically. CI no-std job tripped on `-D dead_code`. Tag both functions `#[allow(dead_code)]` with a doc comment explaining when they go live (std builds OR `target_feature` enabled). All three build modes verified locally: --no-default-features, --no-default-features --features hash, and --features hash,std, dict_builder all clippy-clean with -D warnings.

…cute_sequences Two small loop-body changes: 1. `literals_buffer.len()` was reread on every iteration to compute the bounds-check threshold. The buffer is not mutated by anything called inside the loop (push writes to `scratch.buffer`, not `scratch.literals_buffer`), so the length is loop-invariant. Capture it before the loop as `literals_buffer_len`. 2. `&scratch.literals_buffer[lit_cur..high]` triggers a bounds check even though we just validated `high <= literals_buffer_len` one line above. Switch to `get_unchecked(lit_cur..high)` with a SAFETY comment citing the monotone advance and the validation above. Local aarch64 measurements: range 1,622 - 1,661 ns across 5 runs (one 2,230 outlier ignored as cargo-clean cache miss / thermal — first run after full deps rebuild), best 1,622 ns matches prior best window. Stays in the noise floor relative to the prior commit's 1,587 ns median; theoretically saves one length read + one slice bounds check per sequence so kept per the project rule on noise-floor opts. All 557 nextest tests pass. Refs #189

Profile (decompress L-1 fast decodecorpus-z000033 c_stream) showed RingBuffer::extend self-time grew to 33.6% after the simd_copy routing refactor — that earlier change moved the byte-level work into extend's caller frame, but kept the 4-case free_slice_parts dispatch on the slow path for workloads that never wrap. Add a flat fast path at the top of extend that fires whenever the write is one contiguous copy from data to buf+tail: if head <= tail && tail + len <= cap { simd_copy(src, dst with WILDCOPY slack, len); tail += len; if tail == cap { tail = 0; } return; } For frames whose total decoded size fits in window_size (the common case on this bench and on most real-world streams), this path fires on every literal push. The slow path still handles wraparound. Also fold the CodeRabbit style nit on active_chunk_size_for_tests: explicit any(target_arch = x86, x86_64) guard added to all three no-std target_feature cfgs for symmetry with copy_bytes_overshooting. Decode time on the profiled scenario: prior commit ~1,637,000 ns this commit 1,542,705 ns (5-run median) -> -5.7% incremental -> -15.0% cumulative from baseline (1,814,570 ns) -> 649 MB/s best, FFI gap 3.89x -> 3.28x 557 nextest tests pass. no_std + std clippy clean. Refs #189

After the flat-extend refactor, the body of `extend` is short enough (~5 instructions plus the inline copy on the common no-wrap path) that the function-call overhead per literal push is a measurable fraction of the work itself. `execute_sequences -> DecodeBuffer::push -> RingBuffer::extend` was paying two stack-frame setups per sequence on a path where most of the work is a single SIMD store. Mark both `RingBuffer::extend` and `DecodeBuffer::push` `#[inline]` so LLVM folds them into `execute_sequences`. The slow-path branches inside `extend` (wraparound, capacity miss) stay non-inlined by LLVM's heuristic because they end in calls and panics. Local aarch64 measurements: prior commit (flat-extend) ~1,542,705 ns this commit best/median 1,484,378 / ~1,506,000 ns -> -2.4% incremental median -> -18.2% cumulative from baseline (1,814,570 ns) -> 676 MB/s best run All 557 nextest tests pass. Refs #189

…_n_plus_two The N+2 lookahead helper indexes scratch.sequences three times per sequence to pull ll_curr / ll_next / ll_n2. LLVM does not collapse the three bounds checks because the borrow checker treats the wider DecoderScratch reference as potentially mutated by sibling call sites in the loop body. Cache scratch.sequences.len() into a local, verify idx + 2 < seqs_len at the top, then read the three slots via get_unchecked. Three Vec bounds checks per sequence drop to zero. No measurable bench impact on the profiled scenario (median stays in 1,558 - 1,579 ns/iter window vs prior 1,567 ns median). Kept per the project rule on theoretically-correct noise-level opts. 557 nextest tests pass. Refs #189

…fset_history The original branchless-RULES-table dispatch covered all four classes (repcode 1/2/3 plus fresh offset) in one branchless expression. For typical workloads the dominant case is class=3 (offset_value >= 4 = fresh offset, not a repcode), where the work collapses to: actual = offset_value - 3 scratch = [actual, scratch[0], scratch[1]] Split into a fast path that handles offset_value >= 4 with an unconditional shift, and a #[cold] #[inline(never)] do_offset_history_repcode that runs the full RULES-table branchless dispatch for repcodes 1..=3. Donor zstd does the same split inside ZSTD_updateRep. Decode time on the profiled scenario: prior commit (prefetch get_unchecked) ~1,567,000 ns median this commit 1,413,000 ns median (best 1,406,952) -> -9.8% incremental median -> -22.1% cumulative from baseline (1,814,570 ns) -> 711 MB/s best, FFI gap 3.89x -> 3.01x All 557 nextest tests pass. Refs #189

…uence> roundtrip) Donor zstd's ZSTD_decompressSequences_body interleaves ZSTD_decodeSequence and ZSTD_execSequence in one loop, keeping the seq_t in registers. Our two-pass pipeline (decode_sequences -> scratch.sequences Vec -> execute_sequences) was paying ~24 B/seq * 2 of L1/L2 traffic on the Vec roundtrip plus per-iter Vec::push overhead. Add decode_and_execute_sequences in sequence_section_decoder which: - Runs FSE table setup (maybe_update_fse_tables) and bit-stream padding skip up front. - For RLE-mode blocks (LL/ML/OF RLE flags), falls back to the legacy two-pass path internally via decode_sequences_with_rle + execute_sequences_fields. Rare on real corpora; not worth duplicating. - Hot path: for each sequence, decode_one_sequence_inline returns a Sequence in registers, then execute_one_sequence runs the literal push + match repeat immediately — no scratch.sequences write/read. Takes field-level borrows of DecoderScratch fields (fse, buffer, offset_hist, literals_buffer, sequences for RLE fallback) so the call site in block_decoder can keep `raw` borrowing workspace.block_content_buffer alongside the mutable workspace borrows. Add execute_sequences_fields helper in sequence_execution that takes the same field-level borrows for the RLE fallback path. Legacy decode_sequences / execute_sequences kept with #[allow(dead_code)] for the public API surface and any external callers that iterate sequences explicitly. Decode time on the profiled scenario: prior commit (offset_history fast path) ~1,413,000 ns median this commit 1,383,841 ns median (best 1,377,041) -> -2.1% incremental median -> -24.1% cumulative from baseline (1,814,570 ns) -> 726 MB/s best, FFI gap 3.89x -> 2.93x All 557 nextest tests pass. clippy clean both std and no_std modes. Refs #189

The non-BMI2 fallback in mask_lower_bits indexed a 65-entry static table on every call. Replace with `u64::MAX.checked_shr(64 - n)`: one shift + one (predicted) cmov vs one L1 load (3-5 cycle latency). For the hot FSE bitstream decode path this fires 3x per sequence (once per LL/ML/OF extra-bits extraction in get_bits_triple), so saving the load latency compounds over thousands of sequences. `checked_shr` returns None for shift counts >= 64, which is exactly the n=0 case (64 - 0 = 64) and the invalid n > 64 case (wraps to a huge value via wrapping_sub). Mapping both to 0 gives the mathematically-correct empty mask for n=0 and a safe fallback for the debug_asserted-invalid range. BMI2 path unchanged (`_bzhi_u64` is already a single instruction). Local aarch64 measurements: prior commit (fusion) ~1,386,367 ns median this commit 1,365,211 ns median (best 1,353,635) -> -1.5% incremental -> -25.4% cumulative from baseline (1,814,570 ns) -> 739 MB/s best, FFI gap 3.89x -> 2.88x All 557 nextest tests pass. Refs #189

After the mask_lower_bits switch to compute-based masking, BIT_MASK is read only by: - the BMI2 PEXT triple-extract path (std + x86_64 only), and - the in-crate tests verifying mask values. cfg-gate the const to `any(test, all(feature = "std", target_arch = "x86_64"))` so no-std builds on non-x86 targets don't carry a `dead_code` error through `-D warnings` in CI. All three configurations clippy-clean with -D warnings: - --no-default-features - --no-default-features --features hash - --features hash,std,dict_builder 557 nextest tests pass. Refs #189

Two correctness fixes in decode_and_execute_sequences: 1. Clear rle_fallback_sequences on entry. The non-RLE fast path never touches the Vec, so without an explicit clear it would carry whatever entries the previous block left behind. Externally-observable scratch state must reflect the current block's data only. 2. Snapshot-restore the buffer write cursor and offset history before the per-iter side-effects, and roll back on post-loop bitstream validation failure. The legacy two-pass pipeline (decode_sequences first, execute_sequences after) reported NotEnoughBytesForNumSequences / ExtraBits before any block output was committed. The fused fast path commits each sequence's literal push + match repeat + offset history update immediately, so a malformed bitstream that only reveals itself at the end of the section was leaving partial output and mutated repeat history behind. Restore semantics: capture a DecodeBuffer checkpoint and clone the offset history at the top, and undo both on Err. New DecodeBuffer::{checkpoint, restore_checkpoint} primitives sit on a pub(super) RingBuffer::{tail, set_tail} pair so the rollback stays inside the decode module. Add regression test checkpoint_restore_undoes_pushes to lock in the primitive's behaviour — a future change that breaks tail-rewind semantics would surface here instead of as silent state leakage in the fused executor. Local aarch64 measurements: prior commit ~1,365,211 ns median this commit 1,390,764 ns median (5 runs) -> +1.9% from extra Vec::clear + 16-byte checkpoint snapshot per block; correctness fix is worth the small overhead -> -23.4% cumulative from baseline (1,814,570 ns) -> 727 MB/s, FFI gap 3.89x -> 2.96x 558 nextest tests pass (one new). clippy strict in std + no_std modes.

After the fused decode_and_execute_sequences became the sole entry point in block_decoder, the legacy paths were dead code held with allow(dead_code): - decode_sequences (pub entry, ~40 lines) - decode_sequences_without_rle (private fast path, ~127 lines) - execute_sequences (pub workspace-based two-pass, ~90 lines) - prefetch_literals_n_plus_two (only called by execute_sequences) Delete all four. decode_sequences_with_rle and execute_sequences_fields stay — they're invoked by the fused path's RLE-mode fallback. Removes ~292 lines of unreachable code that were dragging the patch-coverage denominator down (codecov flagged sequence_execution at 56% patch coverage, sequence_section_decoder at 72%; the dead paths accounted for most of the gap). 558 nextest tests pass. clippy strict in std + no_std modes. Bench median 1,413,205 ns/iter (within the 1.35-1.43 ms cluster from prior commits; the transactional checkpoint added overhead remains the biggest single regression vs the peak 1,353 ms run).

At head+start==cap the source effectively starts at index 0 after the wrap, which is the continuous-source layout. The prior strict '>' check routed this boundary through the split-source branch, producing a zero-length copy_bytes_overshooting call against the trailing slack region — correctness depended on the unsafe helper treating len == 0 as a hard no-op. Use '>=' so the boundary case takes the continuous path directly.

restore_checkpoint() documents an invariant that no reallocation may happen between checkpoint() and restore_checkpoint(), but the unsafe function does not enforce it. A malformed sequence section that decodes more than the upfront reserve(MAX_BLOCK_SIZE) can trigger RingBuffer::reserve_amortized between the two calls — and amortized growth compacts data (head=0, tail=s1+s2), invalidating the captured tail index. Without a cap guard the restored tail silently points at the wrong logical position in the new buffer. Test fails on current code (no guard yet) — the follow-up fix adds the cap snapshot + panic on mismatch.

restore_checkpoint() previously documented a 'no reallocation between checkpoint and restore' invariant but did not enforce it. A malformed sequence section can decode more than the upfront reserve(MAX_BLOCK_SIZE) and trigger RingBuffer::reserve_amortized, which compacts the buffer (head=0, tail=s1+s2) — invalidating the captured tail index. Restoring that stale tail silently produces wrong output, with no observable error. Snapshot the RingBuffer cap alongside tail in DecodeBufferCheckpoint and panic on mismatch. The check is a load-bearing correctness guard, not a debug-only assert: silent UB is worse than a controlled panic on untrusted input. Existing checkpoint_restore_undoes_pushes test updated to reserve() before the first push (mirroring the real fused-decoder call pattern, which always reserves MAX_BLOCK_SIZE up front). The new restore_checkpoint_after_realloc_panics regression test from the preceding commit now passes.

The previous comment claimed 'no reallocation has happened in between (buffer.reserve was called once before the checkpoint)' — but reserve(MAX_BLOCK_SIZE) only rules out reallocation on well-formed input. A malformed block that decodes past MAX_BLOCK_SIZE before the post-loop bitstream check can still trigger reserve_amortized. Document the actual invariant: restore_checkpoint() now enforces the no-reallocation guarantee via a cap-equality panic, which is preferred to silent wrong output on untrusted input.

Malformed block whose sequence section decodes more than the upfront reserve(MAX_BLOCK_SIZE) the fused executor performs. That forces RingBuffer::reserve_amortized between checkpoint() and the post-loop bitstream validity check; the panic-on-cap-mismatch guard then turned the malformed input into a hard abort. Test fails on current code (the prior panic guard fires) and passes once try_restore_checkpoint switches to a graceful no-restore return.

…ormed input Replace the unsafe restore_checkpoint(_) that panicked on cap mismatch with safe try_restore_checkpoint(_) -> bool. The cap-equality check is now load-bearing logic instead of an unconditional abort: - On well-formed input (the hot path) the upfront reserve(MAX_BLOCK_SIZE) rules out reallocation between checkpoint and post-loop validity check, so try_restore_checkpoint returns true and rolls back the speculative pushes / offset history exactly like before. - On malformed input that decodes past MAX_BLOCK_SIZE, reserve_amortized compacts the ring buffer and invalidates the captured tail. The function now returns false, no state is mutated, and the caller continues with the same Err return — the partial bytes are discarded with the corrupt frame. This closes the DoS surface where any decoder fed untrusted bytes could be aborted with the panic guard that the previous round introduced. Function is now safe (no unsafe at the call site).

Skip free_slice_lengths' branch + saturating_sub on the common case where the data region has not wrapped (head <= tail) and the write does not cross cap. In that layout free space is trivially 'cap - tail - 1 + head' >= 'cap - tail - 1', which dominates 'amount' whenever 'tail + amount < cap'. Mirrors the flat fast path already in RingBuffer::extend so push/repeat on frames that fit in the window pay zero dispatch overhead from reserve() either. Theoretical win on the hot path; M1 thermal noise masks the actual delta in bench output, kept per the iteration discipline rule (correct opt, no regression risk).

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

zstd/src/decoding/sequence_section_decoder.rs (2)
458-461: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Incorrect error variant for LL RLE code validation.

Line 459-460 returns MissingByteForRleMlTable when the literal length RLE code exceeds MAX_LITERAL_LENGTH_CODE. This should return an error variant that indicates an invalid literal length code, not a missing ML table byte. The validation logic is correct, but the error message is misleading.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/sequence_section_decoder.rs` around lines 458 - 461, The
check that validates the literal-length RLE byte (source[0] >
MAX_LITERAL_LENGTH_CODE) is using the wrong error variant; update the error
returned in that branch to the variant indicating an invalid literal-length code
(e.g. DecodeSequenceError::InvalidLiteralLengthCode) instead of
DecodeSequenceError::MissingByteForRleMlTable so that the validation of
scratch.ll_rle uses the correct error semantics for MAX_LITERAL_LENGTH_CODE
violations.
493-496: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Incorrect error variant for OF RLE code validation.

Line 494-495 returns MissingByteForRleMlTable when the offset RLE code exceeds MAX_OFFSET_CODE. This should return an error variant that indicates an invalid offset code. The validation logic is correct, but the error message is misleading for debugging.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/sequence_section_decoder.rs` around lines 493 - 496, The
check that validates the offset RLE code (if of_source[0] > MAX_OFFSET_CODE)
uses the wrong error variant; replace the returned
DecodeSequenceError::MissingByteForRleMlTable with the specific error variant
that denotes an invalid offset RLE code (e.g.,
DecodeSequenceError::InvalidOffsetRleCode or the existing variant name used for
invalid offsets) so the failure from this branch correctly signals an invalid
offset code; update the return in the block that precedes setting scratch.of_rle
to use that invalid-offset error variant.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 458-461: The check that validates the literal-length RLE byte
(source[0] > MAX_LITERAL_LENGTH_CODE) is using the wrong error variant; update
the error returned in that branch to the variant indicating an invalid
literal-length code (e.g. DecodeSequenceError::InvalidLiteralLengthCode) instead
of DecodeSequenceError::MissingByteForRleMlTable so that the validation of
scratch.ll_rle uses the correct error semantics for MAX_LITERAL_LENGTH_CODE
violations.
- Around line 493-496: The check that validates the offset RLE code (if
of_source[0] > MAX_OFFSET_CODE) uses the wrong error variant; replace the
returned DecodeSequenceError::MissingByteForRleMlTable with the specific error
variant that denotes an invalid offset RLE code (e.g.,
DecodeSequenceError::InvalidOffsetRleCode or the existing variant name used for
invalid offsets) so the failure from this branch correctly signals an invalid
offset code; update the return in the block that precedes setting scratch.of_rle
to use that invalid-offset error variant.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d5fd3293-1da4-4e40-977d-60d547a6e49c

📥 Commits

Reviewing files that changed from the base of the PR and between 8cfa86d and 4de7069.

📒 Files selected for processing (11)

.github/workflows/ci.yml
zstd/benches/compare_ffi.rs
zstd/src/bit_io/bit_reader_reverse.rs
zstd/src/decoding/block_decoder.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/ringbuffer.rs
zstd/src/decoding/sequence_execution.rs
zstd/src/decoding/sequence_section_decoder.rs
zstd/src/decoding/simd_copy.rs
zstd/src/fse/fse_decoder.rs
zstd/src/tests/fuzz_regressions.rs

- fuzz_regressions: assert decoder.read_to_end is_err() instead of swallowing the result — silent-pass shape (if let Ok / let _ = …) could no longer catch a regression that decoded the malformed block to a non-error result. - fse_decoder: move set_len into copy_symbols_into_decode, adjacent to the per-slot raw writes. Removes the non-local invariant requiring the caller's set_len to stay paired with the callee's init loop. - ringbuffer::set_tail: extend SAFETY to require head fixed since capture (full/empty discriminator + invariant 4 only stay consistent then). Document the single current caller and its precondition. - sequence_section_decoder: collapse the split [N-1 with-update] + [last without-update] loop into one loop with a trailing 'if i + 1 < num_sequences' state-update guard. Same generated code, one body, donor isLastSeq pattern. - decode_buffer test: Vec::new() instead of vec::Vec::new() (the latter resolves only via alloc::vec re-exporting Vec — looks like a typo). - bit_reader_reverse::mask_lower_bits: docstring now explicitly notes the input-validation debug_assert is on this function (not just get_bits) — Copilot misread the existing assert location. - simd_copy::single_op_copy_16: add a reachability matrix comment before the portable fallback so a future arch arm slotted between existing arms knows it must terminate with return or its code goes silently dead.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

zstd/src/decoding/ringbuffer.rs (1)
124-155: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard reserve capacity math against usize wraparound.

Line 133 and Line 154 use unchecked addition. A very large amount from safe callers can wrap in release, causing reserve() to skip growth or compute an undersized new_cap; the later unsafe writes in extend, extend_and_fill, or extend_from_reader then trust space that was never actually reserved. Please switch these checks to non-overflowing arithmetic (amount < self.cap - self.tail for the fast path, checked_add before growth sizing) and fail fast on overflow.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/ringbuffer.rs` around lines 124 - 155, The fast-path in
reserve and the new capacity math in reserve_amortized are using unchecked
additions that can wrap; update reserve's condition to use non-overflowing
comparison (e.g., check amount < self.cap.saturating_sub(self.tail) or amount <
self.cap - self.tail using a checked branch equivalent) so the flat-path can't
be bypassed by wraparound, and in reserve_amortized switch to checked_add (use
usize::checked_add when computing cap+amount and when adding
WILDCOPY_OVERLENGTH) and fail fast (panic with a clear "capacity overflow"
message) if any checked_add returns None; ensure the Layout::array call uses the
verified new capacity so late unsafe writes in
extend/extend_and_fill/extend_from_reader cannot rely on undersized allocation.
zstd/src/fse/fse_decoder.rs (1)
60-72: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Make update_state_fast crate-private to prevent undefined behavior.

This method exposes an unsafe contract as a public API with insufficient enforcement. Safe code can construct FSEDecoder::new(&FSETable::new(..)) and invoke update_state_fast, which dereferences get_unchecked(0) on an empty decode vector. Public mutable fields (decode, accuracy_log) further break the invariant. Change pub fn to pub(crate) fn to preserve internal use while eliminating the UB surface.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/fse/fse_decoder.rs` around lines 60 - 72, The public method
update_state_fast exposes an unsafe contract that can be violated by external
code (e.g., callers can create FSEDecoder::new(&FSETable::new(..)) and trigger
an out‑of‑bounds get_unchecked on table.decode); make update_state_fast
crate-private by changing its signature from pub fn update_state_fast(...) to
pub(crate) fn update_state_fast(...). Also ensure callers inside the crate still
use update_state_fast and that external access to the FSETable fields (decode,
accuracy_log) remains restricted by keeping them non‑pub or documenting internal
invariants so the unsafe indexing invariant cannot be violated from outside the
crate.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 177-188: The code currently unconditionally restores the repcode
history via "*offset_hist = saved_offset_hist" even when
"buffer.try_restore_checkpoint(buffer_checkpoint)" fails; change this so the
history is only rewound when try_restore_checkpoint returns true (i.e. the
rollback actually happened). Concretely, call
"buffer.try_restore_checkpoint(buffer_checkpoint)" into a boolean (or use its
existing result), and move the "*offset_hist = saved_offset_hist" assignment
inside the branch that checks that it returned true; leave "offset_hist"
untouched when the restore fails so the buffer and offset history remain
consistent for reuse.

---

Outside diff comments:
In `@zstd/src/decoding/ringbuffer.rs`:
- Around line 124-155: The fast-path in reserve and the new capacity math in
reserve_amortized are using unchecked additions that can wrap; update reserve's
condition to use non-overflowing comparison (e.g., check amount <
self.cap.saturating_sub(self.tail) or amount < self.cap - self.tail using a
checked branch equivalent) so the flat-path can't be bypassed by wraparound, and
in reserve_amortized switch to checked_add (use usize::checked_add when
computing cap+amount and when adding WILDCOPY_OVERLENGTH) and fail fast (panic
with a clear "capacity overflow" message) if any checked_add returns None;
ensure the Layout::array call uses the verified new capacity so late unsafe
writes in extend/extend_and_fill/extend_from_reader cannot rely on undersized
allocation.

In `@zstd/src/fse/fse_decoder.rs`:
- Around line 60-72: The public method update_state_fast exposes an unsafe
contract that can be violated by external code (e.g., callers can create
FSEDecoder::new(&FSETable::new(..)) and trigger an out‑of‑bounds get_unchecked
on table.decode); make update_state_fast crate-private by changing its signature
from pub fn update_state_fast(...) to pub(crate) fn update_state_fast(...). Also
ensure callers inside the crate still use update_state_fast and that external
access to the FSETable fields (decode, accuracy_log) remains restricted by
keeping them non‑pub or documenting internal invariants so the unsafe indexing
invariant cannot be violated from outside the crate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ffa04f02-07bd-45b0-92a4-8ff37992dad1

📥 Commits

Reviewing files that changed from the base of the PR and between 4de7069 and 83c52dd.

📒 Files selected for processing (11)

.github/workflows/ci.yml
zstd/benches/compare_ffi.rs
zstd/src/bit_io/bit_reader_reverse.rs
zstd/src/decoding/block_decoder.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/ringbuffer.rs
zstd/src/decoding/sequence_execution.rs
zstd/src/decoding/sequence_section_decoder.rs
zstd/src/decoding/simd_copy.rs
zstd/src/fse/fse_decoder.rs
zstd/src/tests/fuzz_regressions.rs

codecov · 2026-05-19T14:43:21Z

Codecov Report

❌ Patch coverage is 91.22056% with 41 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/decoding/simd_copy.rs	83.50%	16 Missing ⚠️
zstd/src/decoding/sequence_section_decoder.rs	89.70%	14 Missing ⚠️
zstd/src/decoding/sequence_execution.rs	90.69%	4 Missing ⚠️
zstd/src/fse/fse_decoder.rs	80.00%	4 Missing ⚠️
zstd/src/decoding/ringbuffer.rs	97.39%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

- sequence_section_decoder: only restore offset_hist when try_restore_checkpoint actually rolled back the buffer. The previous unconditional rewind left buffer + offset_hist out of sync on the realloc-detected branch (buffer keeps speculative bytes, repcode history rewinds) — workspace would be internally inconsistent for any subsequent reuse after the Err. - fse_decoder: tighten update_state_fast from pub to pub(crate). The method's unchecked decode-table read assumes a fully-built table reached via crate-internal call paths (FSEDecoder constructed with an empty FSETable would dereference get_unchecked(0) on an empty Vec, which is UB). External access cannot construct that invariant themselves, so pub(crate) is the right surface. - ringbuffer::reserve: switch fast-path math to cap.saturating_sub and reserve_amortized to checked_add. A caller passing amount == usize::MAX (e.g. via a malformed match_length) could previously wrap tail+amount or cap+amount in release and produce an undersized allocation that subsequent unsafe writes trusted — now overflows panic with a clear message instead.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 19, 2026 14:12

Copilot AI reviewed May 19, 2026

View reviewed changes

polaz added 27 commits May 19, 2026 17:24

polaz added 10 commits May 19, 2026 17:24

Copilot started reviewing on behalf of polaz May 19, 2026 14:26 View session

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

polaz force-pushed the feat/#178-decode-speed branch from 4de7069 to 83c52dd Compare May 19, 2026 14:31

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Copilot AI review requested due to automatic review settings May 19, 2026 14:56

Copilot started reviewing on behalf of polaz May 19, 2026 14:57 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

polaz merged commit 6730114 into main May 19, 2026
26 checks passed

polaz deleted the feat/#178-decode-speed branch May 19, 2026 15:19

sw-release-bot Bot mentioned this pull request May 19, 2026

chore: release v0.0.22 #156

Merged

This was referenced May 19, 2026

perf(decode): pack LL/ML metadata + hot-path micro-opts #197

Merged

Roadmap: structured-zstd feature parity with C zstd #28

Open

Conversation

polaz commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Decode speed

Notable changes

Tests

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading