Skip to content

perf(decode): fused sequence executor + SIMD/FSE hot-path cleanup + DoS-safe rollback#194

Merged
polaz merged 39 commits into
mainfrom
feat/#178-decode-speed
May 19, 2026
Merged

perf(decode): fused sequence executor + SIMD/FSE hot-path cleanup + DoS-safe rollback#194
polaz merged 39 commits into
mainfrom
feat/#178-decode-speed

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 19, 2026

Summary

Continuation of decode-speed work that was on PR #190 (closed because that PR's history accumulated a panic-guard variant of the malformed-input fix that a libFuzzer artifact (crash-bfb3bc55…) then turned into a DoS surface). This PR ships the same decode-speed improvements plus the graceful-rollback fix that closes that surface.

Closes #178

Decode speed

Baseline (z000033, Level 3 dfast, M1): 1,814,570 ns/iter → best 1,353,635 ns / median ~1,413,000 ns (≈ −22%). Throughput 553 → 739 MB/s. FFI gap 3.89× → 2.88×.

Notable changes

  • Fused sequence executor (decoding/sequence_section_decoder.rs): collapses decode_sequences + execute_sequences into a single loop, eliminating the Vec<Sequence> roundtrip on the hot path.
  • Transactional rollback for the fused executor via DecodeBuffer::checkpoint / try_restore_checkpoint. The rollback primitive snapshots the RingBuffer cap alongside tail so an intervening reallocation (only reachable on malformed input that decodes past MAX_BLOCK_SIZE) is detected and the rollback is gracefully skipped, not turned into a panic. The caller still surfaces a normal decode Err.
  • SIMD copy dispatcher collapsed; single 16-byte fast path; exact-tail path widened to 32 B (4× u64); AVX2 / AVX-512 kernels target-feature-gated for no_std.
  • BitReaderReversed::refill split into hot inline + #[cold] #[inline(never)] slow path; mask_lower_bits computed via shift instead of BIT_MASK[n] LUT.
  • RingBuffer head + start == cap now routes through the contiguous-second-slice path (was misrouted into split-source, producing a zero-length copy_bytes_overshooting against the slack region).
  • do_offset_history fast path for offset_value >= 4, matching donor ZSTD_updateRep.
  • no-std CI gate + clippy on --no-default-features and --no-default-features --features hash.

Tests

Related

Summary by CodeRabbit

  • New Features

    • Added CI checks for no-standard-library builds to catch regressions on minimal targets.
  • Bug Fixes

    • Safer rollback/checkpoint restore to avoid panics on malformed input and prevent buffer clobbering.
    • Improved edge-case decoding robustness and validated offset/history handling.
  • Tests

    • Added regression tests for malformed frames and capacity-edge copy behavior.
  • Chores

    • Performance and SIMD-backed copy-path optimizations for decompression.

Review Change Stack

Copilot AI review requested due to automatic review settings May 19, 2026 14:12
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 367385f8-2384-4e71-bd18-25277e718169

📥 Commits

Reviewing files that changed from the base of the PR and between 83c52dd and 5478b51.

📒 Files selected for processing (3)
  • zstd/src/decoding/ringbuffer.rs
  • zstd/src/decoding/sequence_section_decoder.rs
  • zstd/src/fse/fse_decoder.rs

📝 Walkthrough

Walkthrough

This PR fuses sequence decoding and execution into an inline decode+execute pipeline, adds DecodeBuffer checkpoint/rollback, allocates 16-byte ringbuffer slack for safe overshooting SIMD copies, rewrites simd_copy dispatch, adjusts bitreader and FSE hot-paths, defers benchmark compression, and adds a no-std clippy CI job plus a malformed-block regression test.

Changes

Fused sequence decode+execute refactor

Layer / File(s) Summary
Decode buffer checkpoint and rollback
zstd/src/decoding/decode_buffer.rs
DecodeBufferCheckpoint captures ringbuffer tail, capacity, and output counter. checkpoint() and try_restore_checkpoint() enable conditional rollback on capacity match; push annotated #[inline]. Tests verify rollback and reallocation refusal.
RingBuffer wildcopy slack allocation and management
zstd/src/decoding/ringbuffer.rs
Adds WILDCOPY_OVERLENGTH = 16, switches to alloc_zeroed for allocations with trailing slack, exposes cap(), tail(), and unsafe set_tail(), and updates deallocation layout to include slack.
RingBuffer extend with conditional wildcopy slack inflation
zstd/src/decoding/ringbuffer.rs
Routes extend and extend_from_within_unchecked through simd_copy::copy_bytes_overshooting, adding a flat fast-path for non-wrapped copies and conditional slack inflation to avoid clobbering readable bytes; tightens branch conditions and adds a wrapped-layout regression test.
SIMD copy multi-path dispatcher
zstd/src/decoding/simd_copy.rs
Rewrites copy_bytes_overshooting with dedicated fast paths: single-op ≤16-byte copy, exact-length ≤32 tail copy, macro-dispatched chunked SIMD for larger sizes, and scalar fallback choosing exact copy_from_nonoverlapping when overshoot doesn't fit. Adds single_op_copy_16 and unit tests; removes copy_strategy.
Field-based sequence execution extractor
zstd/src/decoding/sequence_execution.rs
Adds execute_sequences_fields operating on caller-provided DecodeBuffer, literals_buffer, offset_hist, and sequences; exposes do_offset_history as pub(crate) and removes prefetch helper.
Fused sequence decode+execute pipeline
zstd/src/decoding/sequence_section_decoder.rs
Adds decode_and_execute_sequences that decodes LL/ML/OF inline from the reversed bitstream and executes each sequence immediately into DecodeBuffer and offset_hist, using checkpoint/rollback on post-loop validation and appending tail literals; retains RLE fallback and replaces match-arm lookups with indexed tables.
Block decoder fused pipeline wiring
zstd/src/decoding/block_decoder.rs
Updates block decoder imports and replaces the decode-then-execute pipeline with a single decode_and_execute_sequences call passing field borrows from workspace.
Bit reader refill hot/cold path split
zstd/src/bit_io/bit_reader_reverse.rs
mask_lower_bits non-BMI2 path now computes mask with u64::MAX.checked_shr(...), and BitReaderReversed::refill is split into an #[inline(always)] hot path and a #[cold] refill_slow for edge cases; debug assertions relocated to the hot path.
FSE decoder unchecked lookups and endian-aware init
zstd/src/fse/fse_decoder.rs
update_state_fast uses get_unchecked with safety doc; build_decoding_table now reserves and has copy_symbols_into_decode perform set_len + raw ptr::write_unaligned initialization on little-endian targets to fully initialize Entry storage.

Benchmark and regression testing updates

Layer / File(s) Summary
Lazy compression materialization in benchmarks
zstd/benches/compare_ffi.rs
bench_decompress defers compressed input creation: it computes expected_len per scenario and uses a OnceCell<Vec<u8>> + materialize closure to lazily compress and validate the compressed bytes once per invocation, shared by Rust and C-FFI paths.
Regression test and CI coverage
zstd/src/tests/fuzz_regressions.rs, .github/workflows/ci.yml
Adds malformed_block_does_not_panic_via_restore_checkpoint test asserting malformed input returns Err without panic via restore_checkpoint. Adds no-std CI job that runs cargo clippy --no-default-features and --features hash under zstd/ with warnings as errors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • #178: perf(decode + encode-greedy)… — This PR implements Phase 1/2 primitives (fused sequence execution, simd_copy, ringbuffer slack, bitreader/FSE tweaks) relevant to closing the decode performance gap described in the issue.

Possibly related PRs

Poem

A rabbit hops where bits unwind,
Fused streams dance and slack’s aligned,
Checkpoint saved, no panic sprung,
SIMD feet beat, a tiny tongue —
Decode and execute, swift and kind. 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf(decode): fused sequence executor + SIMD/FSE hot-path cleanup + DoS-safe rollback' clearly summarizes the main changes: fused sequence executor, SIMD/FSE optimizations, and rollback for DoS safety.
Linked Issues check ✅ Passed The PR successfully implements Phase 2 objectives from issue #178: fused sequence executor (eliminate Vec roundtrip), SIMD/wildcopy optimization, bitstream/FSE hot-path tweaks, RingBuffer fixes, and no_std CI gates. All core decode performance improvements are present.
Out of Scope Changes check ✅ Passed All changes align with Phase 2 decode optimization scope: sequence executor fusion, SIMD dispatch simplification, FSE hot-path refinement, RingBuffer slack allocation, bitstream reader refactoring, and DoS-safe rollback mechanism. No unrelated refactoring or feature additions detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#178-decode-speed

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Continuation of decode-speed work from the closed PR #190, fusing the sequence decoder/executor onto one loop and tightening hot SIMD/FSE paths. The fused executor now records a transactional rollback point through DecodeBuffer::checkpoint/try_restore_checkpoint, so a malformed block that forces a RingBuffer reallocation between checkpoint and validation surfaces as a normal decode Err instead of panicking (closes the libFuzzer DoS surface from #178).

Changes:

  • New fused decode_and_execute_sequences plus rollback primitives on DecodeBuffer/RingBuffer, with a regression test for the libFuzzer crash.
  • simd_copy collapses the function-pointer dispatcher, adds a single-op 16-byte fast path and a 32-byte exact-tail path; RingBuffer routes pushes through it with explicit WILDCOPY_OVERLENGTH slack accounting.
  • Hot-path tightening in BitReaderReversed::refill (split hot/cold) and FSEDecoder::decode_symbol / FSETable build (raw set_len + unaligned writes); offset-history fast path; no-std CI gate.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
zstd/src/decoding/sequence_section_decoder.rs Adds fused decode_and_execute_sequences with transactional rollback, base-table lookups, drops some validation checks.
zstd/src/decoding/sequence_execution.rs Renames/recasts executor as execute_sequences_fields; adds do_offset_history fast path; removes prefetch helper.
zstd/src/decoding/decode_buffer.rs New checkpoint/try_restore_checkpoint rollback primitives and tests.
zstd/src/decoding/ringbuffer.rs WILDCOPY_OVERLENGTH slack in allocations, flat-extend fast path, slack accounting per call site, new tests.
zstd/src/decoding/simd_copy.rs Collapses dispatcher; new single-op-16 and 32-byte exact-tail paths; active_chunk_size_for_tests rewritten.
zstd/src/decoding/block_decoder.rs Switches to fused decode_and_execute_sequences, drops the explicit execute_sequences call.
zstd/src/bit_io/bit_reader_reverse.rs Splits refill into hot/cold halves; computes mask_lower_bits via shift instead of LUT.
zstd/src/fse/fse_decoder.rs get_unchecked in decode_symbol; reserve+set_len instead of resize to skip zero-fill; unaligned u32 write for trailing odd entry.
zstd/src/tests/fuzz_regressions.rs Inline regression test for the libFuzzer DoS artifact.
zstd/benches/compare_ffi.rs Lazy OnceCell materialization of compressed inputs for profiling.
.github/workflows/ci.yml New no-std clippy job covering --no-default-features and --no-default-features --features hash.
Comments suppressed due to low confidence (1)

zstd/src/decoding/sequence_section_decoder.rs:1

  • The runtime of_code > MAX_OFFSET_CODE and offset == 0 checks have been dropped from both decode_one_sequence_inline and decode_sequences_with_rle, replaced by debug_assert! only. The comment asserts these invariants are guaranteed by build_decoding_table, but of_code is consumed two lines later as 1u32 << of_code: if any decoder path (FSE table construction, predefined-distribution import, or a future code change) ever yields of_code >= 32, this is a left-shift overflow — UB on untrusted input in release builds, and the previous explicit error return was the only thing keeping a corrupt FSE table from reaching it. Please either retain the runtime check on of_code (and on offset != 0) on this hot path or add a tight, audited invariant assertion at the table-construction site that is explicitly cross-referenced here so the chain of trust is greppable. Removing a defense-in-depth check on parsed bytes from an untrusted frame deserves an explicit decision rather than a debug-only assert.
use super::super::blocks::sequence_section::ModeType;

Comment thread zstd/src/tests/fuzz_regressions.rs Outdated
Comment thread zstd/src/fse/fse_decoder.rs Outdated
Comment thread zstd/src/decoding/ringbuffer.rs
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/decode_buffer.rs Outdated
Comment thread zstd/src/bit_io/bit_reader_reverse.rs
Comment thread zstd/src/decoding/simd_copy.rs
polaz added 27 commits May 19, 2026 17:24
Previously bench_decompress ran compress_slice_to_vec + ffi_encode_to_vec
upfront for every scenario × level × source triple, even when criterion's
filter selected a single benchmark. cargo bench --profile-time <filter>
then spent its entire window in encode setup instead of the decode hot
path it was supposed to profile.

Wrap the compressed-byte production in a OnceCell inside
bench_decompress_source. Materialization now happens only when the
bench_function closure runs (i.e., when criterion's filter matches),
keeping samply / cargo flamegraph traces concentrated on the decode
side. CI timings are unaffected — the timing loop already excluded
setup; this change just narrows the work performed during ad-hoc
filtered runs.

Refs #189
The crate declares #![no_std] with std as an opt-out feature, but the
existing test/clippy matrix only exercises --features hash,std,
dict_builder. A regression that pulls std::* into a hot decoder path
slips through unnoticed.

Add a new no-std job covering the two realistic deployment shapes:
- alloc-only (no features)
- alloc + xxhash content checksum (--features hash)

Both run cargo check and cargo clippy with -D warnings so a stray
`use std::...` lights up immediately.

Refs #189
Convert the lookup_ll_code / lookup_ml_code helpers from staircase
match arms to const arrays mirroring donor zstd's LL_base / LL_bits
/ ML_base / ML_bits tables. The bounds-check assert preserves the
original panic-on-illegal-code semantics; LLVM already lowered the
match to a jump table so perf is unchanged on this corpus, but the
LUT form is the donor-parity baseline future work can build on.

Refs #189
FSEDecoder::update_state_fast indexes self.table.decode[next_state]
on every sequence × 3 state advances. The bounds check fires every
iteration because LLVM cannot prove the invariant that spans
table-build and decode call sites:

calc_baseline_and_numbits pairs new_state with num_bits such that
new_state + (2^num_bits - 1) < table_size = self.table.decode.len();
get_bits_unchecked guarantees add < 2^num_bits; therefore
next_state < self.table.decode.len() always holds for any valid
FSE table — but the proof is cross-function and the bounds check
stays.

Replace the indexed read with get_unchecked and a SAFETY comment
citing the invariant. No new unsafe entry point — the function is
already unsafe-ish in the sense that the caller must call ensure_bits
before invoking it (otherwise get_bits_unchecked reads stale bits).

Refs #189
…st path

Replace the CopyStrategy struct + CopyFn function-pointer indirection
with an inlined arch-aware decision tree. Two structural wins:

1. The per-call indirect dispatch through `(strategy.copy)(...)` is
   gone — kernels are direct-called from the dispatcher, so on aarch64
   (where NEON is the ABI baseline) the kernel inlines into
   copy_bytes_overshooting entirely.

2. New single_op_copy_16 fast path covers copy_at_least in 1..=16 with
   one SIMD store (or two overlapping u64 stores on archs without
   128-bit vectors). Match-copy execution with offset 8..15 funneled
   into the previous scalar-strategy loop on every iteration — the
   profile showed simd_copy::copy_scalar at 273 ms (5.5%) vs copy_neon
   at 80 ms (1.6%) on `decompress/level_-1_fast/decodecorpus-z000033/
   c_stream/matrix/pure_rust`; the new fast path collapses those 8..15
   byte calls to one vector op.

no_std semantics preserved: x86 runtime detection stays std-only, no_std
x86 uses compile-time target_feature cfg, aarch64+neon is unconditional
in both modes. CI no-std + no-std-hash jobs cover the build matrix.

Local aarch64 (M1, decodecorpus-z000033 c_stream level_-1_fast):
  baseline 1,814,570 ns/iter
  this PR  1,737,827 ns/iter (median of 3, best 1,683,119)
  -> -4.2% median (-7.2% best); 553 → 575 MB/s

All 86 simd_copy / ringbuffer / decode_buffer / cross_validation /
roundtrip tests green. cargo clippy --no-default-features {-, +hash}
clean.

Refs #189
simd_copy::copy_bytes_overshooting falls back to copy_from_nonoverlapping
(libc memmove) whenever `min(src.1, dst.1) < copy_at_least` rounded up
to the active SIMD chunk. The fallback dominated the decode profile —
_platform_memmove sat at 21.5% of CPU on decompress/level_-1_fast/
decodecorpus-z000033/c_stream/matrix/pure_rust because match copies near
ring-buffer wraparound boundaries had tight capacities and tripped the
fallback every iteration. The new single_op_copy_16 fast path also
requires min_buffer_size >= 16, so any 8..15-byte chunk emitted by
DecodeBuffer::repeat_in_chunks for offsets 8..15 skipped it for the
same reason.

Mirror donor zstd's WILDCOPY_OVERLENGTH = 16 contract by:

1. Allocating `cap + WILDCOPY_OVERLENGTH` bytes via alloc_zeroed. The
   slack region is physically present past `cap` but never indexed via
   `head`/`tail` (those still wrap on `cap`). alloc_zeroed instead of
   alloc so wildcopy reads observe defined values rather than UB on
   uninitialized memory.
2. Updating Drop and `reserve_amortized`'s `current_layout` to match.
3. Inflating the `src.1`/`dst.1` capacities passed to
   `copy_bytes_overshooting` at every call site where the inflated end
   stays inside the allocation:
   - src.1: every site (source pointer + WILDCOPY_OVERLENGTH lands in
     the slack region at worst).
   - dst.1: only the case-A first-copy site where dst ends at `cap`
     (slack absorbs overshoot). All other dst spans end at `head` and
     cannot be inflated — overshoot would clobber readable data.

Decode time on the profiled scenario:
  baseline   1,814,570 ns/iter
  prev PR    1,737,827 ns/iter (single_op + dispatcher collapse)
  this commit 1,668,752 ns/iter (median of 4, range 1.667-1.678)
  -> -8.0% from baseline, -4.0% incremental
  -> FFI gap 3.89x -> 3.57x; 553 MB/s -> 600 MB/s

All 556 nextest tests pass. no_std + clippy clean.

Refs #189
- Pin the `clippy` component on the toolchain step (matches the
  existing `lint` job's explicit components: rustfmt, clippy spec —
  deterministic install, no auto-fetch on every run).
- Drop the standalone `cargo check` steps: `cargo clippy` already
  runs the full compile + analysis pipeline, so a separate check
  was duplicated work for no extra signal.

Refs #189
The Entry { new_state: 0, symbol: 0, num_bits: 0 } pre-fill from
self.decode.resize() was redundant on little-endian targets:
copy_symbols_into_decode already writes a full 4-byte Entry (with
new_state = 0 and num_bits = 0 implicit in the packed u32/u64 layout)
to every slot via raw ptr::write_unaligned calls.

- Replace the resize with reserve + set_len on LE (BE keeps resize so
  the iter_mut() field-assignment path stays safe — no production BE
  target ships against this crate).
- Extend copy_symbols_into_decode's trailing odd-entry path to write
  a full Entry via a u32 store; the previous field assignment relied
  on the resize-side zero-fill to initialize new_state and num_bits.

Measured impact on the L-1 fast decodecorpus-z000033 c_stream path is
~0% — FSETable::build_decoding_table self-time is 171 ms / 3.4% of
decode CPU and the resize zero-fill was only ~half of that. The win
is real but small relative to the ring-buffer / simd_copy wins
upstream of it. Keeping the change because the pattern (skip
zero-fill when subsequent writes cover every byte) is a clean donor
parity point and stops future regressions where a slower iter_mut
loop would re-introduce the redundant pass.

Refs #189
…tend through it

Two coordinated changes that together cut _platform_memmove from 25.9%
self-time (1264 ms) to 7.1% (343 ms) on decompress/level_-1_fast/
decodecorpus-z000033/c_stream/matrix/pure_rust.

1. simd_copy: add an exact-length tail path for copy_at_least <= 16 that
   fires when the single_op_copy_16 fast path can't (no WILDCOPY slack
   on the destination). Two branches:
   - 1..=8 bytes: byte-by-byte loop. Unrolls to ≤8 immediate-offset
     load/store pairs.
   - 9..=16 bytes: two overlapping unaligned u64 ops, writing exactly
     copy_at_least bytes with no overshoot past dst.0 + copy_at_least.
   Replaces the libc memmove call that the dispatcher used to fall back
   to for short copies near RingBuffer head boundaries (where head
   bounds dst.1 and there is no slack to inflate).

2. RingBuffer::extend: route both free-slice writes through
   simd_copy::copy_bytes_overshooting instead of raw
   copy_from_nonoverlapping. Profile showed 943 ms of memmove was
   coming from `execute_sequences → DecodeBuffer::push → extend`
   (literal-byte copies during sequence execution). The previous direct
   copy_from_nonoverlapping always lowered to a libc call; routing
   through simd_copy lets short literal pushes (the common case at
   negative levels) use the inline byte / overlapping-u64 paths above.
   dst.1 for the first free slice is inflated by WILDCOPY_OVERLENGTH
   because f1_ptr ends at `cap` (slack region is writable); the second
   free slice ends at `head` so its dst.1 stays exact.

Decode time on the profiled scenario:
  baseline           1,814,570 ns/iter
  prior PR state     1,668,752 ns/iter
  this commit        1,653,976 ns/iter (median of 5)
  -> -8.9% from baseline, -0.9% incremental

The bench-time delta is smaller than the profile-attribution shift
suggests because the new inline path does roughly the same byte-level
work memmove was doing — the win is in killing the libc call overhead
on short copies. Larger copies (>16 bytes that previously hit memmove)
still hit it; chasing those further has diminishing returns.

All 86 simd_copy / ringbuffer / decode_buffer / cross_validation /
roundtrip tests pass. no_std + clippy clean.

Refs #189
…ng check

decode_sequences_without_rle previously paid two per-iteration
overheads on the hot path:

1. `if target.len() < num_sequences { state advance }` — used to skip
   the trailing state update on the final iteration. Materializing
   `target.len()` every iteration and branching cost ~2-3 ns per
   sequence on top of the FSE work.
2. `if br.bits_remaining() < 0 { return Err(NotEnoughBytes) }` — an
   isize multiply + 3 subtractions every iteration. Donor zstd only
   validates bitstream exhaustion once after the sequence loop.

Refactor splits the loop into a hot (num_sequences - 1) body that
always advances state, and a tail iteration that skips the advance.
The bits_remaining check moves out of the hot path into a single
post-loop validation that catches the same underflow case via the
existing >= 0 comparison.

Decode time on the profiled scenario:
  baseline           1,814,570 ns/iter
  prior PR state     1,653,976 ns/iter
  this commit        1,632,669 ns/iter (median of 5)
  -> -10.0% from baseline, -1.3% incremental
  -> FFI gap 3.89x → 3.50x; 553 MB/s → 612 MB/s

Same change deferred for decode_sequences_with_rle — that path only
fires on blocks that mix RLE-mode codes with FSE-encoded streams,
which decodecorpus-z000033 (and most real corpora at fast levels)
do not produce. Splitting it would add complexity for zero measurable
gain on the targeted scenario.

Refs #189
CodeRabbit flagged that extend_from_within_unchecked_branchless passes
exact-fit (ptr, len) capacities through to copy_with_nobranch_check ->
simd_copy::copy_bytes_overshooting, so it cannot fire the SIMD fast
paths that require min(src.1, dst.1) >= 16 — short copies on that path
always take the inline byte / overlapping-u64 fallback instead of
single_op_copy_16.

Inflating the capacities the way `extend_from_within_unchecked` does
would require threading per-pointer head/tail context into
`copy_with_nobranch_check`, which currently sees only raw pointers and
chunk lengths. Defer that refactor: the branchless path is x86-gated
via `decode_buffer::use_branchless_wildcopy`, and my profiling rig is
aarch64 — there is no local x86 measurement to justify the extra
plumbing right now.

Document the trade-off in a doc-comment on the function so a future
maintainer doesn't accidentally assume WILDCOPY semantics hold on
this path. On aarch64, the unconditional `extend_from_within_unchecked`
is used, so the slack contract is exercised end-to-end there.

Refs #189
The previous tail path covered copies of 1..=16 bytes inline; copies of
17..=32 bytes fell through to the chunked SIMD dispatch, which then
bailed (`min_buffer_size < copy_at_least.next_multiple_of(chunk)`) and
ended up in libc memmove for buffers without WILDCOPY_OVERLENGTH slack
on the destination side.

Extend the inline path to 17..=32 by emitting four unaligned u64
ops: two adjacent stores for the first 16 bytes, then two overlapping
stores for the trailing 1..=16 bytes. Eight loads + eight stores
total, all branch-free, exactly `copy_at_least` bytes written without
overshoot. This catches the literal-push size range (10..=24 bytes is
typical on the profiled corpus) that the previous ≤16-byte cap missed.

Decode time on `decompress/level_-1_fast/decodecorpus-z000033/
c_stream/matrix/pure_rust`:
  baseline           1,814,570 ns/iter
  prior PR state     1,632,669 ns/iter
  this commit        1,607,680 ns/iter (median of 5)
  -> -11.4% from baseline, -1.5% incremental
  -> FFI gap 3.89x → 3.40x; 553 MB/s → 629 MB/s

All 86 simd_copy / ringbuffer / decode_buffer / cross_validation /
roundtrip tests pass.

Refs #189
…_len window

RingBuffer::extend:
The first-free-slice destination capacity passed to
`simd_copy::copy_bytes_overshooting` was unconditionally inflated by
WILDCOPY_OVERLENGTH. That is only safe when `tail >= head` — in the
wrapped layout (`tail < head`) `free_slice_parts` returns the inner
`(buf+tail, head-tail)` gap, and inflating into the WILDCOPY region
would let wildcopy overshoot past `head` into still-readable buffer
output. Today no overshoot path actually fires because `src.1` is
exact (`in_f1 == copy_at_least`) and caps `min_buffer_size`, so the
inflation never changed dispatch — but the latent UB hazard if
anyone ever adds a `dst.1`-only overshoot fast path is real.

Gate the inflation on `tail >= head` and rewrite the comment to make
both reasoning paths explicit (slack-reachability AND src.1 cap).
Add `extend_wrapped_layout_preserves_bytes_past_head`: a regression
test that builds a wrapped layout via a new `build_tail_before_head`
helper, fills the inner gap, and asserts the 16-byte window past
`head` is untouched.

FSETable::build_decoding_table:
`self.decode.set_len(table_size)` ran immediately after `reserve`,
leaving Entry slots uninitialized through the entire `table_symbols`
resize + symbol-spreading loop. A panic in that window (allocator
failure on `table_symbols.resize`) would unwind with reachable
uninitialized Entry slots — currently UB-free because `Entry: Copy`
has trivial drop, but a fragile contract for future maintainers.

Move `set_len` to immediately precede `copy_symbols_into_decode`,
after all panic-able allocations have finished. The uninitialized-
slot window is now zero panic-able code wide; an unwind from the
spreading loop leaves `self.decode.len() == 0` instead.

All 557 nextest tests pass. Decode perf unchanged on the profiled
scenario (median 1,594,500 ns/iter — the inflation never affected
dispatch outcome under the current simd_copy contract).

Refs #189
… test

Clippy 1.95 flags `for i in 0..16 { sentinel[i] = ... }` as
`needless_range_loop`. Rewrite the prefill-sampling loop in
extend_wrapped_layout_preserves_bytes_past_head to use
`sentinel.iter_mut().enumerate()`; the verification loop keeps the
plain `for i in 0..16usize` form because its body indexes the raw
ring-buffer pointer (not the loop's own slice), which clippy permits.

Refs #189
The previous refill() carried a blanket #[cold] annotation. Donor analysis
(libzstd BIT_reloadDStream_internal at bitstream.h:370-444) plus the
sequence-decode profile flagged this as a regression: refill fires
roughly every 2 sequences during sequence decode — it is the dominant
mid-stream path, not a cold case. The #[cold] hint pessimised the
branch predictor and forced LLVM to spill the hot 3-instruction body
out of every caller's straight-line code.

Reshape the function so:
- Outer refill() is #[inline(always)]. Only the mid-stream hot path
  (`self.index >= bytes_consumed`: subtract index, mask bit count,
  load 8 bytes) lives here, folding into every get_bits / ensure_bits
  call site.
- The three end-of-stream branches (`self.index > 0`, exhausted but
  bit_consumed < 64, fully exhausted) move into refill_slow(),
  marked #[cold] #[inline(never)]. Same semantics, just no longer
  parked on the hot path.

Local aarch64 measurements are dominated by thermal / background-process
noise right now (range 1.68 - 13.1 ms across consecutive runs of the
same bench, vs the prior commit's stable 1.65 ms median); the lowest
sample matches prior best so there is no regression. Keeping the
change per the theoretically-correct-opts-stay-in-tree rule — final
profile pass will revisit if the noise floor settles.

All 120 bit_io / fse / cross_validation / roundtrip tests pass.

Refs #189
Both per-iter validations in decode_sequences (with_rle main, without_rle
main, without_rle tail — all three loop bodies) were mathematically
unreachable under any successfully-constructed FSE table:

- `of_code > MAX_OFFSET_CODE` (31): `AlignedFSETable::new(MAX_OFFSET_CODE)`
  caps `max_symbol` at table construction, and `build_decoding_table`'s
  `TooManySymbols` guard rejects any probabilities array exceeding
  `max_symbol + 1` entries. The symbols stored in the FSE state table
  index that probabilities array, so emitted symbols are guaranteed
  bounded. For RLE-mode tables, `maybe_update_fse_tables` validates the
  singleton byte against MAX_OFFSET_CODE at scratch setup.
- `offset == 0`: `offset = obits + (1u32 << of_code)` and `1u32 << of_code`
  is ≥ 1 for any `of_code` in 0..=31 — i.e. the addend is at least 1.

Replace both branches with `debug_assert!` so a future invariant break
trips in tests without paying for the comparison on the hot path in
release builds. Same change applied to the with_rle path, the without_rle
main loop body, and the without_rle tail iteration (3 sites total) so
the four branches × N sequences disappear from every FSE decode
iteration.

Local aarch64 measurements: median 1,611 ms / best 1,585 ms across 5
runs — matches prior best (1,586 ms after the exact-len-32 commit).
Range is much tighter than the prior commit's thermal-noise window
(5 - 107 us stddev here vs 5,628 us there). The theoretical win is
~1-3% per the donor analysis (4 branches × ~100k sequences on the
profiled frame); on this run it tucked into the noise floor, but the
change is kept per the project rule that theoretically-correct
optimizations stay in tree pending a final clean-room profile.

All 67 sequence_section_decoder / decode_sequences / cross_validation /
roundtrip tests pass.

Refs #189
…e helper

The main loop and tail iteration of decode_sequences_without_rle carried
two near-identical copies of the per-sequence decode body. Pull the
shared work (FSE symbol fetch, LL/ML LUT, get_bits_triple, offset
assembly, debug_asserts) into a private #[inline(always)] helper
returning Sequence, called from both sites. LLVM still specializes each
call site after inlining; the source-level duplication and divergence
risk are gone.

No measurable perf change on the profiled scenario (median 1,605 ms /
range 1,600 - 1,627 ms across 5 runs, vs prior 1,611 ms median —
within stddev). All 557 nextest tests pass.

Refs #189
The hot loop in execute_sequences indexes `scratch.sequences[idx]`
on every iteration to copy the Sequence struct into a local. LLVM
cannot prove the access is in-bounds because `scratch` is a
`&mut DecoderScratch` and the loop body calls
`scratch.buffer.reserve(...)` / `scratch.buffer.push(...)` /
`scratch.buffer.repeat(...)`, which the borrow checker treats as
opaque mutations of the wider struct.

Capture `scratch.sequences.len()` into a `sequences_len` local before
the loop (so the iteration range is stable) and switch the per-iter
read to `get_unchecked(idx)` with a SAFETY comment citing the idx
< sequences_len invariant. Same Sequence value extracted, no
behavioural change.

Local aarch64 measurements:
  prior commit  median 1,605,658 ns / range 1,600 - 1,627 ns
  this commit   median 1,588,630 ns / range 1,582 - 1,594 ns
  -> -1.1% incremental, -12.5% cumulative from baseline 1,814,570 ns
  -> FFI gap 3.89x -> 3.36x; 553 MB/s -> 644 MB/s

All 81 sequence / execute / roundtrip / cross_validation tests pass.

Refs #189
Both `if seq.ll > 0 { push(literals) }` and `if seq.ml > 0 { repeat(...) }`
in execute_sequences were guarding downstream calls that already handle
the zero-length case:

- `RingBuffer::extend` (called via `DecodeBuffer::push`) early-returns
  when the input slice is empty.
- `DecodeBuffer::repeat` early-returns when match_length == 0.

Drop the guards — when seq.ll/ml is zero the empty slice / zero match
length flows through naturally without any byte-copy work. Saves two
predictable branches per sequence on a hot path that runs once per
emitted sequence.

The literal-length bounds check (`high > literals_buffer.len()`) stays
on the always-execute path, which is mathematically equivalent to the
old check inside the `seq.ll > 0` branch — when ll == 0,
`high == literals_copy_counter`, so the check resolves identically.

Local aarch64 measurements: median 1,607 ns / range 1,582 - 1,620 ns
across 5 runs, vs prior commit 1,589 ns median. Within run-to-run
stddev (best 1,582 ns matches prior best 1,583 ns). Theoretically
correct (~0.5-1% per branch elimination); kept per the rule that
correct opts stay in tree pending a final clean-room profile.

All 557 nextest tests pass.

Refs #189
In --no-default-features builds on x86/x86_64 the dispatcher cfg-gates
out every call site for `copy_avx2` and `copy_avx512`:

    #[cfg(target_feature = "avx512f")] try_chunk_kernel!(64, copy_avx512);
    #[cfg(target_feature = "avx2")]    try_chunk_kernel!(32, copy_avx2);

The default x86_64 Rust target does NOT enable +avx2 / +avx512f (those
require explicit RUSTFLAGS), so on no-std builds without those flags
both call sites cfg-out. Std builds use runtime detection through
`detect_x86_caps`, which rustc cannot see through statically.

CI no-std job tripped on `-D dead_code`. Tag both functions
`#[allow(dead_code)]` with a doc comment explaining when they go live
(std builds OR `target_feature` enabled).

All three build modes verified locally: --no-default-features,
--no-default-features --features hash, and --features hash,std,
dict_builder all clippy-clean with -D warnings.
…cute_sequences

Two small loop-body changes:

1. `literals_buffer.len()` was reread on every iteration to compute the
   bounds-check threshold. The buffer is not mutated by anything called
   inside the loop (push writes to `scratch.buffer`, not
   `scratch.literals_buffer`), so the length is loop-invariant. Capture
   it before the loop as `literals_buffer_len`.

2. `&scratch.literals_buffer[lit_cur..high]` triggers a bounds check
   even though we just validated `high <= literals_buffer_len` one
   line above. Switch to `get_unchecked(lit_cur..high)` with a SAFETY
   comment citing the monotone advance and the validation above.

Local aarch64 measurements: range 1,622 - 1,661 ns across 5 runs (one
2,230 outlier ignored as cargo-clean cache miss / thermal — first run
after full deps rebuild), best 1,622 ns matches prior best window.
Stays in the noise floor relative to the prior commit's 1,587 ns
median; theoretically saves one length read + one slice bounds check
per sequence so kept per the project rule on noise-floor opts.

All 557 nextest tests pass.

Refs #189
Profile (decompress L-1 fast decodecorpus-z000033 c_stream) showed
RingBuffer::extend self-time grew to 33.6% after the simd_copy routing
refactor — that earlier change moved the byte-level work into extend's
caller frame, but kept the 4-case free_slice_parts dispatch on the
slow path for workloads that never wrap.

Add a flat fast path at the top of extend that fires whenever the
write is one contiguous copy from data to buf+tail:

  if head <= tail && tail + len <= cap {
      simd_copy(src, dst with WILDCOPY slack, len);
      tail += len; if tail == cap { tail = 0; }
      return;
  }

For frames whose total decoded size fits in window_size (the common
case on this bench and on most real-world streams), this path fires
on every literal push. The slow path still handles wraparound.

Also fold the CodeRabbit style nit on active_chunk_size_for_tests:
explicit any(target_arch = x86, x86_64) guard added to all three
no-std target_feature cfgs for symmetry with copy_bytes_overshooting.

Decode time on the profiled scenario:
  prior commit  ~1,637,000 ns
  this commit    1,542,705 ns (5-run median)
  -> -5.7% incremental
  -> -15.0% cumulative from baseline (1,814,570 ns)
  -> 649 MB/s best, FFI gap 3.89x -> 3.28x

557 nextest tests pass. no_std + std clippy clean.

Refs #189
After the flat-extend refactor, the body of `extend` is short enough
(~5 instructions plus the inline copy on the common no-wrap path) that
the function-call overhead per literal push is a measurable fraction
of the work itself. `execute_sequences -> DecodeBuffer::push ->
RingBuffer::extend` was paying two stack-frame setups per sequence on
a path where most of the work is a single SIMD store.

Mark both `RingBuffer::extend` and `DecodeBuffer::push` `#[inline]`
so LLVM folds them into `execute_sequences`. The slow-path branches
inside `extend` (wraparound, capacity miss) stay non-inlined by
LLVM's heuristic because they end in calls and panics.

Local aarch64 measurements:
  prior commit (flat-extend)  ~1,542,705 ns
  this commit  best/median    1,484,378 / ~1,506,000 ns
  -> -2.4% incremental median
  -> -18.2% cumulative from baseline (1,814,570 ns)
  -> 676 MB/s best run

All 557 nextest tests pass.

Refs #189
…_n_plus_two

The N+2 lookahead helper indexes scratch.sequences three times per
sequence to pull ll_curr / ll_next / ll_n2. LLVM does not collapse
the three bounds checks because the borrow checker treats the wider
DecoderScratch reference as potentially mutated by sibling call
sites in the loop body. Cache scratch.sequences.len() into a local,
verify idx + 2 < seqs_len at the top, then read the three slots via
get_unchecked. Three Vec bounds checks per sequence drop to zero.

No measurable bench impact on the profiled scenario (median stays in
1,558 - 1,579 ns/iter window vs prior 1,567 ns median). Kept per
the project rule on theoretically-correct noise-level opts. 557
nextest tests pass.

Refs #189
…fset_history

The original branchless-RULES-table dispatch covered all four classes
(repcode 1/2/3 plus fresh offset) in one branchless expression. For
typical workloads the dominant case is class=3 (offset_value >= 4 =
fresh offset, not a repcode), where the work collapses to:

  actual = offset_value - 3
  scratch = [actual, scratch[0], scratch[1]]

Split into a fast path that handles offset_value >= 4 with an
unconditional shift, and a #[cold] #[inline(never)] do_offset_history_repcode
that runs the full RULES-table branchless dispatch for repcodes 1..=3.
Donor zstd does the same split inside ZSTD_updateRep.

Decode time on the profiled scenario:
  prior commit (prefetch get_unchecked)  ~1,567,000 ns median
  this commit                              1,413,000 ns median (best 1,406,952)
  -> -9.8% incremental median
  -> -22.1% cumulative from baseline (1,814,570 ns)
  -> 711 MB/s best, FFI gap 3.89x -> 3.01x

All 557 nextest tests pass.

Refs #189
…uence> roundtrip)

Donor zstd's ZSTD_decompressSequences_body interleaves ZSTD_decodeSequence
and ZSTD_execSequence in one loop, keeping the seq_t in registers. Our
two-pass pipeline (decode_sequences -> scratch.sequences Vec ->
execute_sequences) was paying ~24 B/seq * 2 of L1/L2 traffic on the Vec
roundtrip plus per-iter Vec::push overhead.

Add decode_and_execute_sequences in sequence_section_decoder which:
- Runs FSE table setup (maybe_update_fse_tables) and bit-stream padding
  skip up front.
- For RLE-mode blocks (LL/ML/OF RLE flags), falls back to the legacy
  two-pass path internally via decode_sequences_with_rle +
  execute_sequences_fields. Rare on real corpora; not worth duplicating.
- Hot path: for each sequence, decode_one_sequence_inline returns a
  Sequence in registers, then execute_one_sequence runs the literal
  push + match repeat immediately — no scratch.sequences write/read.

Takes field-level borrows of DecoderScratch fields (fse, buffer,
offset_hist, literals_buffer, sequences for RLE fallback) so the call
site in block_decoder can keep `raw` borrowing
workspace.block_content_buffer alongside the mutable workspace borrows.

Add execute_sequences_fields helper in sequence_execution that takes
the same field-level borrows for the RLE fallback path.

Legacy decode_sequences / execute_sequences kept with #[allow(dead_code)]
for the public API surface and any external callers that iterate
sequences explicitly.

Decode time on the profiled scenario:
  prior commit (offset_history fast path)  ~1,413,000 ns median
  this commit                                1,383,841 ns median (best 1,377,041)
  -> -2.1% incremental median
  -> -24.1% cumulative from baseline (1,814,570 ns)
  -> 726 MB/s best, FFI gap 3.89x -> 2.93x

All 557 nextest tests pass. clippy clean both std and no_std modes.

Refs #189
The non-BMI2 fallback in mask_lower_bits indexed a 65-entry static
table on every call. Replace with `u64::MAX.checked_shr(64 - n)`:
one shift + one (predicted) cmov vs one L1 load (3-5 cycle latency).
For the hot FSE bitstream decode path this fires 3x per sequence
(once per LL/ML/OF extra-bits extraction in get_bits_triple), so
saving the load latency compounds over thousands of sequences.

`checked_shr` returns None for shift counts >= 64, which is exactly
the n=0 case (64 - 0 = 64) and the invalid n > 64 case (wraps to a
huge value via wrapping_sub). Mapping both to 0 gives the
mathematically-correct empty mask for n=0 and a safe fallback for
the debug_asserted-invalid range.

BMI2 path unchanged (`_bzhi_u64` is already a single instruction).

Local aarch64 measurements:
  prior commit (fusion)  ~1,386,367 ns median
  this commit             1,365,211 ns median (best 1,353,635)
  -> -1.5% incremental
  -> -25.4% cumulative from baseline (1,814,570 ns)
  -> 739 MB/s best, FFI gap 3.89x -> 2.88x

All 557 nextest tests pass.

Refs #189
polaz added 10 commits May 19, 2026 17:24
After the mask_lower_bits switch to compute-based masking, BIT_MASK is
read only by:
- the BMI2 PEXT triple-extract path (std + x86_64 only), and
- the in-crate tests verifying mask values.

cfg-gate the const to `any(test, all(feature = "std", target_arch = "x86_64"))`
so no-std builds on non-x86 targets don't carry a `dead_code` error
through `-D warnings` in CI.

All three configurations clippy-clean with -D warnings:
- --no-default-features
- --no-default-features --features hash
- --features hash,std,dict_builder

557 nextest tests pass.

Refs #189
Two correctness fixes in decode_and_execute_sequences:

1. Clear rle_fallback_sequences on entry. The non-RLE fast path never
   touches the Vec, so without an explicit clear it would carry whatever
   entries the previous block left behind. Externally-observable scratch
   state must reflect the current block's data only.

2. Snapshot-restore the buffer write cursor and offset history before
   the per-iter side-effects, and roll back on post-loop bitstream
   validation failure. The legacy two-pass pipeline (decode_sequences
   first, execute_sequences after) reported NotEnoughBytesForNumSequences
   / ExtraBits before any block output was committed. The fused fast
   path commits each sequence's literal push + match repeat + offset
   history update immediately, so a malformed bitstream that only
   reveals itself at the end of the section was leaving partial output
   and mutated repeat history behind.

   Restore semantics: capture a DecodeBuffer checkpoint and clone the
   offset history at the top, and undo both on Err. New
   DecodeBuffer::{checkpoint, restore_checkpoint} primitives sit on a
   pub(super) RingBuffer::{tail, set_tail} pair so the rollback stays
   inside the decode module.

Add regression test checkpoint_restore_undoes_pushes to lock in the
primitive's behaviour — a future change that breaks tail-rewind
semantics would surface here instead of as silent state leakage in
the fused executor.

Local aarch64 measurements:
  prior commit   ~1,365,211 ns median
  this commit     1,390,764 ns median (5 runs)
  -> +1.9% from extra Vec::clear + 16-byte checkpoint snapshot per
     block; correctness fix is worth the small overhead
  -> -23.4% cumulative from baseline (1,814,570 ns)
  -> 727 MB/s, FFI gap 3.89x -> 2.96x

558 nextest tests pass (one new). clippy strict in std + no_std modes.
After the fused decode_and_execute_sequences became the sole entry
point in block_decoder, the legacy paths were dead code held with
allow(dead_code):
- decode_sequences (pub entry, ~40 lines)
- decode_sequences_without_rle (private fast path, ~127 lines)
- execute_sequences (pub workspace-based two-pass, ~90 lines)
- prefetch_literals_n_plus_two (only called by execute_sequences)

Delete all four. decode_sequences_with_rle and execute_sequences_fields
stay — they're invoked by the fused path's RLE-mode fallback. Removes
~292 lines of unreachable code that were dragging the patch-coverage
denominator down (codecov flagged sequence_execution at 56% patch
coverage, sequence_section_decoder at 72%; the dead paths accounted
for most of the gap).

558 nextest tests pass. clippy strict in std + no_std modes. Bench
median 1,413,205 ns/iter (within the 1.35-1.43 ms cluster from prior
commits; the transactional checkpoint added overhead remains the
biggest single regression vs the peak 1,353 ms run).
At head+start==cap the source effectively starts at index 0 after the
wrap, which is the continuous-source layout. The prior strict '>' check
routed this boundary through the split-source branch, producing a
zero-length copy_bytes_overshooting call against the trailing slack
region — correctness depended on the unsafe helper treating len == 0
as a hard no-op.

Use '>=' so the boundary case takes the continuous path directly.
restore_checkpoint() documents an invariant that no reallocation may
happen between checkpoint() and restore_checkpoint(), but the unsafe
function does not enforce it. A malformed sequence section that
decodes more than the upfront reserve(MAX_BLOCK_SIZE) can trigger
RingBuffer::reserve_amortized between the two calls — and amortized
growth compacts data (head=0, tail=s1+s2), invalidating the captured
tail index. Without a cap guard the restored tail silently points at
the wrong logical position in the new buffer.

Test fails on current code (no guard yet) — the follow-up fix adds
the cap snapshot + panic on mismatch.
restore_checkpoint() previously documented a 'no reallocation between
checkpoint and restore' invariant but did not enforce it. A malformed
sequence section can decode more than the upfront reserve(MAX_BLOCK_SIZE)
and trigger RingBuffer::reserve_amortized, which compacts the buffer
(head=0, tail=s1+s2) — invalidating the captured tail index. Restoring
that stale tail silently produces wrong output, with no observable
error.

Snapshot the RingBuffer cap alongside tail in DecodeBufferCheckpoint
and panic on mismatch. The check is a load-bearing correctness guard,
not a debug-only assert: silent UB is worse than a controlled panic on
untrusted input.

Existing checkpoint_restore_undoes_pushes test updated to reserve()
before the first push (mirroring the real fused-decoder call pattern,
which always reserves MAX_BLOCK_SIZE up front). The new
restore_checkpoint_after_realloc_panics regression test from the
preceding commit now passes.
The previous comment claimed 'no reallocation has happened in between
(buffer.reserve was called once before the checkpoint)' — but
reserve(MAX_BLOCK_SIZE) only rules out reallocation on well-formed
input. A malformed block that decodes past MAX_BLOCK_SIZE before the
post-loop bitstream check can still trigger reserve_amortized.

Document the actual invariant: restore_checkpoint() now enforces the
no-reallocation guarantee via a cap-equality panic, which is preferred
to silent wrong output on untrusted input.
Malformed block whose sequence section decodes more than the upfront
reserve(MAX_BLOCK_SIZE) the fused executor performs. That forces
RingBuffer::reserve_amortized between checkpoint() and the post-loop
bitstream validity check; the panic-on-cap-mismatch guard then turned
the malformed input into a hard abort.

Test fails on current code (the prior panic guard fires) and passes
once try_restore_checkpoint switches to a graceful no-restore return.
…ormed input

Replace the unsafe restore_checkpoint(_) that panicked on cap mismatch
with safe try_restore_checkpoint(_) -> bool. The cap-equality check is
now load-bearing logic instead of an unconditional abort:

- On well-formed input (the hot path) the upfront reserve(MAX_BLOCK_SIZE)
  rules out reallocation between checkpoint and post-loop validity
  check, so try_restore_checkpoint returns true and rolls back the
  speculative pushes / offset history exactly like before.
- On malformed input that decodes past MAX_BLOCK_SIZE, reserve_amortized
  compacts the ring buffer and invalidates the captured tail. The
  function now returns false, no state is mutated, and the caller
  continues with the same Err return — the partial bytes are discarded
  with the corrupt frame.

This closes the DoS surface where any decoder fed untrusted bytes
could be aborted with the panic guard that the previous round
introduced. Function is now safe (no unsafe at the call site).
Skip free_slice_lengths' branch + saturating_sub on the common case
where the data region has not wrapped (head <= tail) and the write
does not cross cap. In that layout free space is trivially
'cap - tail - 1 + head' >= 'cap - tail - 1', which dominates 'amount'
whenever 'tail + amount < cap'.

Mirrors the flat fast path already in RingBuffer::extend so push/repeat
on frames that fit in the window pay zero dispatch overhead from
reserve() either. Theoretical win on the hot path; M1 thermal noise
masks the actual delta in bench output, kept per the iteration
discipline rule (correct opt, no regression risk).
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
zstd/src/decoding/sequence_section_decoder.rs (2)

458-461: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Incorrect error variant for LL RLE code validation.

Line 459-460 returns MissingByteForRleMlTable when the literal length RLE code exceeds MAX_LITERAL_LENGTH_CODE. This should return an error variant that indicates an invalid literal length code, not a missing ML table byte. The validation logic is correct, but the error message is misleading.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/sequence_section_decoder.rs` around lines 458 - 461, The
check that validates the literal-length RLE byte (source[0] >
MAX_LITERAL_LENGTH_CODE) is using the wrong error variant; update the error
returned in that branch to the variant indicating an invalid literal-length code
(e.g. DecodeSequenceError::InvalidLiteralLengthCode) instead of
DecodeSequenceError::MissingByteForRleMlTable so that the validation of
scratch.ll_rle uses the correct error semantics for MAX_LITERAL_LENGTH_CODE
violations.

493-496: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Incorrect error variant for OF RLE code validation.

Line 494-495 returns MissingByteForRleMlTable when the offset RLE code exceeds MAX_OFFSET_CODE. This should return an error variant that indicates an invalid offset code. The validation logic is correct, but the error message is misleading for debugging.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/sequence_section_decoder.rs` around lines 493 - 496, The
check that validates the offset RLE code (if of_source[0] > MAX_OFFSET_CODE)
uses the wrong error variant; replace the returned
DecodeSequenceError::MissingByteForRleMlTable with the specific error variant
that denotes an invalid offset RLE code (e.g.,
DecodeSequenceError::InvalidOffsetRleCode or the existing variant name used for
invalid offsets) so the failure from this branch correctly signals an invalid
offset code; update the return in the block that precedes setting scratch.of_rle
to use that invalid-offset error variant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 458-461: The check that validates the literal-length RLE byte
(source[0] > MAX_LITERAL_LENGTH_CODE) is using the wrong error variant; update
the error returned in that branch to the variant indicating an invalid
literal-length code (e.g. DecodeSequenceError::InvalidLiteralLengthCode) instead
of DecodeSequenceError::MissingByteForRleMlTable so that the validation of
scratch.ll_rle uses the correct error semantics for MAX_LITERAL_LENGTH_CODE
violations.
- Around line 493-496: The check that validates the offset RLE code (if
of_source[0] > MAX_OFFSET_CODE) uses the wrong error variant; replace the
returned DecodeSequenceError::MissingByteForRleMlTable with the specific error
variant that denotes an invalid offset RLE code (e.g.,
DecodeSequenceError::InvalidOffsetRleCode or the existing variant name used for
invalid offsets) so the failure from this branch correctly signals an invalid
offset code; update the return in the block that precedes setting scratch.of_rle
to use that invalid-offset error variant.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d5fd3293-1da4-4e40-977d-60d547a6e49c

📥 Commits

Reviewing files that changed from the base of the PR and between 8cfa86d and 4de7069.

📒 Files selected for processing (11)
  • .github/workflows/ci.yml
  • zstd/benches/compare_ffi.rs
  • zstd/src/bit_io/bit_reader_reverse.rs
  • zstd/src/decoding/block_decoder.rs
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/ringbuffer.rs
  • zstd/src/decoding/sequence_execution.rs
  • zstd/src/decoding/sequence_section_decoder.rs
  • zstd/src/decoding/simd_copy.rs
  • zstd/src/fse/fse_decoder.rs
  • zstd/src/tests/fuzz_regressions.rs

- fuzz_regressions: assert decoder.read_to_end is_err() instead of
  swallowing the result — silent-pass shape (if let Ok / let _ = …)
  could no longer catch a regression that decoded the malformed block
  to a non-error result.
- fse_decoder: move set_len into copy_symbols_into_decode, adjacent to
  the per-slot raw writes. Removes the non-local invariant requiring
  the caller's set_len to stay paired with the callee's init loop.
- ringbuffer::set_tail: extend SAFETY to require head fixed since
  capture (full/empty discriminator + invariant 4 only stay
  consistent then). Document the single current caller and its
  precondition.
- sequence_section_decoder: collapse the split [N-1 with-update] +
  [last without-update] loop into one loop with a trailing
  'if i + 1 < num_sequences' state-update guard. Same generated code,
  one body, donor isLastSeq pattern.
- decode_buffer test: Vec::new() instead of vec::Vec::new() (the
  latter resolves only via alloc::vec re-exporting Vec — looks like
  a typo).
- bit_reader_reverse::mask_lower_bits: docstring now explicitly notes
  the input-validation debug_assert is on this function (not just
  get_bits) — Copilot misread the existing assert location.
- simd_copy::single_op_copy_16: add a reachability matrix comment
  before the portable fallback so a future arch arm slotted between
  existing arms knows it must terminate with return or its code goes
  silently dead.
@polaz polaz force-pushed the feat/#178-decode-speed branch from 4de7069 to 83c52dd Compare May 19, 2026 14:31
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
zstd/src/decoding/ringbuffer.rs (1)

124-155: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard reserve capacity math against usize wraparound.

Line 133 and Line 154 use unchecked addition. A very large amount from safe callers can wrap in release, causing reserve() to skip growth or compute an undersized new_cap; the later unsafe writes in extend, extend_and_fill, or extend_from_reader then trust space that was never actually reserved. Please switch these checks to non-overflowing arithmetic (amount < self.cap - self.tail for the fast path, checked_add before growth sizing) and fail fast on overflow.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/decoding/ringbuffer.rs` around lines 124 - 155, The fast-path in
reserve and the new capacity math in reserve_amortized are using unchecked
additions that can wrap; update reserve's condition to use non-overflowing
comparison (e.g., check amount < self.cap.saturating_sub(self.tail) or amount <
self.cap - self.tail using a checked branch equivalent) so the flat-path can't
be bypassed by wraparound, and in reserve_amortized switch to checked_add (use
usize::checked_add when computing cap+amount and when adding
WILDCOPY_OVERLENGTH) and fail fast (panic with a clear "capacity overflow"
message) if any checked_add returns None; ensure the Layout::array call uses the
verified new capacity so late unsafe writes in
extend/extend_and_fill/extend_from_reader cannot rely on undersized allocation.
zstd/src/fse/fse_decoder.rs (1)

60-72: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Make update_state_fast crate-private to prevent undefined behavior.

This method exposes an unsafe contract as a public API with insufficient enforcement. Safe code can construct FSEDecoder::new(&FSETable::new(..)) and invoke update_state_fast, which dereferences get_unchecked(0) on an empty decode vector. Public mutable fields (decode, accuracy_log) further break the invariant. Change pub fn to pub(crate) fn to preserve internal use while eliminating the UB surface.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/fse/fse_decoder.rs` around lines 60 - 72, The public method
update_state_fast exposes an unsafe contract that can be violated by external
code (e.g., callers can create FSEDecoder::new(&FSETable::new(..)) and trigger
an out‑of‑bounds get_unchecked on table.decode); make update_state_fast
crate-private by changing its signature from pub fn update_state_fast(...) to
pub(crate) fn update_state_fast(...). Also ensure callers inside the crate still
use update_state_fast and that external access to the FSETable fields (decode,
accuracy_log) remains restricted by keeping them non‑pub or documenting internal
invariants so the unsafe indexing invariant cannot be violated from outside the
crate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 177-188: The code currently unconditionally restores the repcode
history via "*offset_hist = saved_offset_hist" even when
"buffer.try_restore_checkpoint(buffer_checkpoint)" fails; change this so the
history is only rewound when try_restore_checkpoint returns true (i.e. the
rollback actually happened). Concretely, call
"buffer.try_restore_checkpoint(buffer_checkpoint)" into a boolean (or use its
existing result), and move the "*offset_hist = saved_offset_hist" assignment
inside the branch that checks that it returned true; leave "offset_hist"
untouched when the restore fails so the buffer and offset history remain
consistent for reuse.

---

Outside diff comments:
In `@zstd/src/decoding/ringbuffer.rs`:
- Around line 124-155: The fast-path in reserve and the new capacity math in
reserve_amortized are using unchecked additions that can wrap; update reserve's
condition to use non-overflowing comparison (e.g., check amount <
self.cap.saturating_sub(self.tail) or amount < self.cap - self.tail using a
checked branch equivalent) so the flat-path can't be bypassed by wraparound, and
in reserve_amortized switch to checked_add (use usize::checked_add when
computing cap+amount and when adding WILDCOPY_OVERLENGTH) and fail fast (panic
with a clear "capacity overflow" message) if any checked_add returns None;
ensure the Layout::array call uses the verified new capacity so late unsafe
writes in extend/extend_and_fill/extend_from_reader cannot rely on undersized
allocation.

In `@zstd/src/fse/fse_decoder.rs`:
- Around line 60-72: The public method update_state_fast exposes an unsafe
contract that can be violated by external code (e.g., callers can create
FSEDecoder::new(&FSETable::new(..)) and trigger an out‑of‑bounds get_unchecked
on table.decode); make update_state_fast crate-private by changing its signature
from pub fn update_state_fast(...) to pub(crate) fn update_state_fast(...). Also
ensure callers inside the crate still use update_state_fast and that external
access to the FSETable fields (decode, accuracy_log) remains restricted by
keeping them non‑pub or documenting internal invariants so the unsafe indexing
invariant cannot be violated from outside the crate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ffa04f02-07bd-45b0-92a4-8ff37992dad1

📥 Commits

Reviewing files that changed from the base of the PR and between 4de7069 and 83c52dd.

📒 Files selected for processing (11)
  • .github/workflows/ci.yml
  • zstd/benches/compare_ffi.rs
  • zstd/src/bit_io/bit_reader_reverse.rs
  • zstd/src/decoding/block_decoder.rs
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/ringbuffer.rs
  • zstd/src/decoding/sequence_execution.rs
  • zstd/src/decoding/sequence_section_decoder.rs
  • zstd/src/decoding/simd_copy.rs
  • zstd/src/fse/fse_decoder.rs
  • zstd/src/tests/fuzz_regressions.rs

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 91.22056% with 41 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/decoding/simd_copy.rs 83.50% 16 Missing ⚠️
zstd/src/decoding/sequence_section_decoder.rs 89.70% 14 Missing ⚠️
zstd/src/decoding/sequence_execution.rs 90.69% 4 Missing ⚠️
zstd/src/fse/fse_decoder.rs 80.00% 4 Missing ⚠️
zstd/src/decoding/ringbuffer.rs 97.39% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

- sequence_section_decoder: only restore offset_hist when
  try_restore_checkpoint actually rolled back the buffer. The previous
  unconditional rewind left buffer + offset_hist out of sync on the
  realloc-detected branch (buffer keeps speculative bytes, repcode
  history rewinds) — workspace would be internally inconsistent for
  any subsequent reuse after the Err.
- fse_decoder: tighten update_state_fast from pub to pub(crate). The
  method's unchecked decode-table read assumes a fully-built table
  reached via crate-internal call paths (FSEDecoder constructed with
  an empty FSETable would dereference get_unchecked(0) on an empty
  Vec, which is UB). External access cannot construct that invariant
  themselves, so pub(crate) is the right surface.
- ringbuffer::reserve: switch fast-path math to cap.saturating_sub
  and reserve_amortized to checked_add. A caller passing
  amount == usize::MAX (e.g. via a malformed match_length) could
  previously wrap tail+amount or cap+amount in release and produce
  an undersized allocation that subsequent unsafe writes trusted —
  now overflows panic with a clear message instead.
Copilot AI review requested due to automatic review settings May 19, 2026 14:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(decode + encode-greedy): close 3-5× donor gap on negative-level decompress; share SIMD primitives + add dedicated greedy strategy

2 participants