perf(decode): fused sequence executor + SIMD/FSE hot-path cleanup + DoS-safe rollback#194
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR fuses sequence decoding and execution into an inline decode+execute pipeline, adds DecodeBuffer checkpoint/rollback, allocates 16-byte ringbuffer slack for safe overshooting SIMD copies, rewrites simd_copy dispatch, adjusts bitreader and FSE hot-paths, defers benchmark compression, and adds a no-std clippy CI job plus a malformed-block regression test. ChangesFused sequence decode+execute refactor
Benchmark and regression testing updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Continuation of decode-speed work from the closed PR #190, fusing the sequence decoder/executor onto one loop and tightening hot SIMD/FSE paths. The fused executor now records a transactional rollback point through DecodeBuffer::checkpoint/try_restore_checkpoint, so a malformed block that forces a RingBuffer reallocation between checkpoint and validation surfaces as a normal decode Err instead of panicking (closes the libFuzzer DoS surface from #178).
Changes:
- New fused
decode_and_execute_sequencesplus rollback primitives onDecodeBuffer/RingBuffer, with a regression test for the libFuzzer crash. simd_copycollapses the function-pointer dispatcher, adds a single-op 16-byte fast path and a 32-byte exact-tail path;RingBufferroutes pushes through it with explicitWILDCOPY_OVERLENGTHslack accounting.- Hot-path tightening in
BitReaderReversed::refill(split hot/cold) andFSEDecoder::decode_symbol/FSETablebuild (rawset_len+ unaligned writes); offset-history fast path; no-std CI gate.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| zstd/src/decoding/sequence_section_decoder.rs | Adds fused decode_and_execute_sequences with transactional rollback, base-table lookups, drops some validation checks. |
| zstd/src/decoding/sequence_execution.rs | Renames/recasts executor as execute_sequences_fields; adds do_offset_history fast path; removes prefetch helper. |
| zstd/src/decoding/decode_buffer.rs | New checkpoint/try_restore_checkpoint rollback primitives and tests. |
| zstd/src/decoding/ringbuffer.rs | WILDCOPY_OVERLENGTH slack in allocations, flat-extend fast path, slack accounting per call site, new tests. |
| zstd/src/decoding/simd_copy.rs | Collapses dispatcher; new single-op-16 and 32-byte exact-tail paths; active_chunk_size_for_tests rewritten. |
| zstd/src/decoding/block_decoder.rs | Switches to fused decode_and_execute_sequences, drops the explicit execute_sequences call. |
| zstd/src/bit_io/bit_reader_reverse.rs | Splits refill into hot/cold halves; computes mask_lower_bits via shift instead of LUT. |
| zstd/src/fse/fse_decoder.rs | get_unchecked in decode_symbol; reserve+set_len instead of resize to skip zero-fill; unaligned u32 write for trailing odd entry. |
| zstd/src/tests/fuzz_regressions.rs | Inline regression test for the libFuzzer DoS artifact. |
| zstd/benches/compare_ffi.rs | Lazy OnceCell materialization of compressed inputs for profiling. |
| .github/workflows/ci.yml | New no-std clippy job covering --no-default-features and --no-default-features --features hash. |
Comments suppressed due to low confidence (1)
zstd/src/decoding/sequence_section_decoder.rs:1
- The runtime
of_code > MAX_OFFSET_CODEandoffset == 0checks have been dropped from bothdecode_one_sequence_inlineanddecode_sequences_with_rle, replaced bydebug_assert!only. The comment asserts these invariants are guaranteed bybuild_decoding_table, butof_codeis consumed two lines later as1u32 << of_code: if any decoder path (FSE table construction, predefined-distribution import, or a future code change) ever yieldsof_code >= 32, this is a left-shift overflow — UB on untrusted input in release builds, and the previous explicit error return was the only thing keeping a corrupt FSE table from reaching it. Please either retain the runtime check onof_code(and onoffset != 0) on this hot path or add a tight, audited invariant assertion at the table-construction site that is explicitly cross-referenced here so the chain of trust is greppable. Removing a defense-in-depth check on parsed bytes from an untrusted frame deserves an explicit decision rather than a debug-only assert.
use super::super::blocks::sequence_section::ModeType;
Previously bench_decompress ran compress_slice_to_vec + ffi_encode_to_vec upfront for every scenario × level × source triple, even when criterion's filter selected a single benchmark. cargo bench --profile-time <filter> then spent its entire window in encode setup instead of the decode hot path it was supposed to profile. Wrap the compressed-byte production in a OnceCell inside bench_decompress_source. Materialization now happens only when the bench_function closure runs (i.e., when criterion's filter matches), keeping samply / cargo flamegraph traces concentrated on the decode side. CI timings are unaffected — the timing loop already excluded setup; this change just narrows the work performed during ad-hoc filtered runs. Refs #189
The crate declares #![no_std] with std as an opt-out feature, but the existing test/clippy matrix only exercises --features hash,std, dict_builder. A regression that pulls std::* into a hot decoder path slips through unnoticed. Add a new no-std job covering the two realistic deployment shapes: - alloc-only (no features) - alloc + xxhash content checksum (--features hash) Both run cargo check and cargo clippy with -D warnings so a stray `use std::...` lights up immediately. Refs #189
Convert the lookup_ll_code / lookup_ml_code helpers from staircase match arms to const arrays mirroring donor zstd's LL_base / LL_bits / ML_base / ML_bits tables. The bounds-check assert preserves the original panic-on-illegal-code semantics; LLVM already lowered the match to a jump table so perf is unchanged on this corpus, but the LUT form is the donor-parity baseline future work can build on. Refs #189
FSEDecoder::update_state_fast indexes self.table.decode[next_state] on every sequence × 3 state advances. The bounds check fires every iteration because LLVM cannot prove the invariant that spans table-build and decode call sites: calc_baseline_and_numbits pairs new_state with num_bits such that new_state + (2^num_bits - 1) < table_size = self.table.decode.len(); get_bits_unchecked guarantees add < 2^num_bits; therefore next_state < self.table.decode.len() always holds for any valid FSE table — but the proof is cross-function and the bounds check stays. Replace the indexed read with get_unchecked and a SAFETY comment citing the invariant. No new unsafe entry point — the function is already unsafe-ish in the sense that the caller must call ensure_bits before invoking it (otherwise get_bits_unchecked reads stale bits). Refs #189
…st path
Replace the CopyStrategy struct + CopyFn function-pointer indirection
with an inlined arch-aware decision tree. Two structural wins:
1. The per-call indirect dispatch through `(strategy.copy)(...)` is
gone — kernels are direct-called from the dispatcher, so on aarch64
(where NEON is the ABI baseline) the kernel inlines into
copy_bytes_overshooting entirely.
2. New single_op_copy_16 fast path covers copy_at_least in 1..=16 with
one SIMD store (or two overlapping u64 stores on archs without
128-bit vectors). Match-copy execution with offset 8..15 funneled
into the previous scalar-strategy loop on every iteration — the
profile showed simd_copy::copy_scalar at 273 ms (5.5%) vs copy_neon
at 80 ms (1.6%) on `decompress/level_-1_fast/decodecorpus-z000033/
c_stream/matrix/pure_rust`; the new fast path collapses those 8..15
byte calls to one vector op.
no_std semantics preserved: x86 runtime detection stays std-only, no_std
x86 uses compile-time target_feature cfg, aarch64+neon is unconditional
in both modes. CI no-std + no-std-hash jobs cover the build matrix.
Local aarch64 (M1, decodecorpus-z000033 c_stream level_-1_fast):
baseline 1,814,570 ns/iter
this PR 1,737,827 ns/iter (median of 3, best 1,683,119)
-> -4.2% median (-7.2% best); 553 → 575 MB/s
All 86 simd_copy / ringbuffer / decode_buffer / cross_validation /
roundtrip tests green. cargo clippy --no-default-features {-, +hash}
clean.
Refs #189
simd_copy::copy_bytes_overshooting falls back to copy_from_nonoverlapping
(libc memmove) whenever `min(src.1, dst.1) < copy_at_least` rounded up
to the active SIMD chunk. The fallback dominated the decode profile —
_platform_memmove sat at 21.5% of CPU on decompress/level_-1_fast/
decodecorpus-z000033/c_stream/matrix/pure_rust because match copies near
ring-buffer wraparound boundaries had tight capacities and tripped the
fallback every iteration. The new single_op_copy_16 fast path also
requires min_buffer_size >= 16, so any 8..15-byte chunk emitted by
DecodeBuffer::repeat_in_chunks for offsets 8..15 skipped it for the
same reason.
Mirror donor zstd's WILDCOPY_OVERLENGTH = 16 contract by:
1. Allocating `cap + WILDCOPY_OVERLENGTH` bytes via alloc_zeroed. The
slack region is physically present past `cap` but never indexed via
`head`/`tail` (those still wrap on `cap`). alloc_zeroed instead of
alloc so wildcopy reads observe defined values rather than UB on
uninitialized memory.
2. Updating Drop and `reserve_amortized`'s `current_layout` to match.
3. Inflating the `src.1`/`dst.1` capacities passed to
`copy_bytes_overshooting` at every call site where the inflated end
stays inside the allocation:
- src.1: every site (source pointer + WILDCOPY_OVERLENGTH lands in
the slack region at worst).
- dst.1: only the case-A first-copy site where dst ends at `cap`
(slack absorbs overshoot). All other dst spans end at `head` and
cannot be inflated — overshoot would clobber readable data.
Decode time on the profiled scenario:
baseline 1,814,570 ns/iter
prev PR 1,737,827 ns/iter (single_op + dispatcher collapse)
this commit 1,668,752 ns/iter (median of 4, range 1.667-1.678)
-> -8.0% from baseline, -4.0% incremental
-> FFI gap 3.89x -> 3.57x; 553 MB/s -> 600 MB/s
All 556 nextest tests pass. no_std + clippy clean.
Refs #189
- Pin the `clippy` component on the toolchain step (matches the existing `lint` job's explicit components: rustfmt, clippy spec — deterministic install, no auto-fetch on every run). - Drop the standalone `cargo check` steps: `cargo clippy` already runs the full compile + analysis pipeline, so a separate check was duplicated work for no extra signal. Refs #189
The Entry { new_state: 0, symbol: 0, num_bits: 0 } pre-fill from
self.decode.resize() was redundant on little-endian targets:
copy_symbols_into_decode already writes a full 4-byte Entry (with
new_state = 0 and num_bits = 0 implicit in the packed u32/u64 layout)
to every slot via raw ptr::write_unaligned calls.
- Replace the resize with reserve + set_len on LE (BE keeps resize so
the iter_mut() field-assignment path stays safe — no production BE
target ships against this crate).
- Extend copy_symbols_into_decode's trailing odd-entry path to write
a full Entry via a u32 store; the previous field assignment relied
on the resize-side zero-fill to initialize new_state and num_bits.
Measured impact on the L-1 fast decodecorpus-z000033 c_stream path is
~0% — FSETable::build_decoding_table self-time is 171 ms / 3.4% of
decode CPU and the resize zero-fill was only ~half of that. The win
is real but small relative to the ring-buffer / simd_copy wins
upstream of it. Keeping the change because the pattern (skip
zero-fill when subsequent writes cover every byte) is a clean donor
parity point and stops future regressions where a slower iter_mut
loop would re-introduce the redundant pass.
Refs #189
…tend through it
Two coordinated changes that together cut _platform_memmove from 25.9%
self-time (1264 ms) to 7.1% (343 ms) on decompress/level_-1_fast/
decodecorpus-z000033/c_stream/matrix/pure_rust.
1. simd_copy: add an exact-length tail path for copy_at_least <= 16 that
fires when the single_op_copy_16 fast path can't (no WILDCOPY slack
on the destination). Two branches:
- 1..=8 bytes: byte-by-byte loop. Unrolls to ≤8 immediate-offset
load/store pairs.
- 9..=16 bytes: two overlapping unaligned u64 ops, writing exactly
copy_at_least bytes with no overshoot past dst.0 + copy_at_least.
Replaces the libc memmove call that the dispatcher used to fall back
to for short copies near RingBuffer head boundaries (where head
bounds dst.1 and there is no slack to inflate).
2. RingBuffer::extend: route both free-slice writes through
simd_copy::copy_bytes_overshooting instead of raw
copy_from_nonoverlapping. Profile showed 943 ms of memmove was
coming from `execute_sequences → DecodeBuffer::push → extend`
(literal-byte copies during sequence execution). The previous direct
copy_from_nonoverlapping always lowered to a libc call; routing
through simd_copy lets short literal pushes (the common case at
negative levels) use the inline byte / overlapping-u64 paths above.
dst.1 for the first free slice is inflated by WILDCOPY_OVERLENGTH
because f1_ptr ends at `cap` (slack region is writable); the second
free slice ends at `head` so its dst.1 stays exact.
Decode time on the profiled scenario:
baseline 1,814,570 ns/iter
prior PR state 1,668,752 ns/iter
this commit 1,653,976 ns/iter (median of 5)
-> -8.9% from baseline, -0.9% incremental
The bench-time delta is smaller than the profile-attribution shift
suggests because the new inline path does roughly the same byte-level
work memmove was doing — the win is in killing the libc call overhead
on short copies. Larger copies (>16 bytes that previously hit memmove)
still hit it; chasing those further has diminishing returns.
All 86 simd_copy / ringbuffer / decode_buffer / cross_validation /
roundtrip tests pass. no_std + clippy clean.
Refs #189
…ng check
decode_sequences_without_rle previously paid two per-iteration
overheads on the hot path:
1. `if target.len() < num_sequences { state advance }` — used to skip
the trailing state update on the final iteration. Materializing
`target.len()` every iteration and branching cost ~2-3 ns per
sequence on top of the FSE work.
2. `if br.bits_remaining() < 0 { return Err(NotEnoughBytes) }` — an
isize multiply + 3 subtractions every iteration. Donor zstd only
validates bitstream exhaustion once after the sequence loop.
Refactor splits the loop into a hot (num_sequences - 1) body that
always advances state, and a tail iteration that skips the advance.
The bits_remaining check moves out of the hot path into a single
post-loop validation that catches the same underflow case via the
existing >= 0 comparison.
Decode time on the profiled scenario:
baseline 1,814,570 ns/iter
prior PR state 1,653,976 ns/iter
this commit 1,632,669 ns/iter (median of 5)
-> -10.0% from baseline, -1.3% incremental
-> FFI gap 3.89x → 3.50x; 553 MB/s → 612 MB/s
Same change deferred for decode_sequences_with_rle — that path only
fires on blocks that mix RLE-mode codes with FSE-encoded streams,
which decodecorpus-z000033 (and most real corpora at fast levels)
do not produce. Splitting it would add complexity for zero measurable
gain on the targeted scenario.
Refs #189
CodeRabbit flagged that extend_from_within_unchecked_branchless passes exact-fit (ptr, len) capacities through to copy_with_nobranch_check -> simd_copy::copy_bytes_overshooting, so it cannot fire the SIMD fast paths that require min(src.1, dst.1) >= 16 — short copies on that path always take the inline byte / overlapping-u64 fallback instead of single_op_copy_16. Inflating the capacities the way `extend_from_within_unchecked` does would require threading per-pointer head/tail context into `copy_with_nobranch_check`, which currently sees only raw pointers and chunk lengths. Defer that refactor: the branchless path is x86-gated via `decode_buffer::use_branchless_wildcopy`, and my profiling rig is aarch64 — there is no local x86 measurement to justify the extra plumbing right now. Document the trade-off in a doc-comment on the function so a future maintainer doesn't accidentally assume WILDCOPY semantics hold on this path. On aarch64, the unconditional `extend_from_within_unchecked` is used, so the slack contract is exercised end-to-end there. Refs #189
The previous tail path covered copies of 1..=16 bytes inline; copies of 17..=32 bytes fell through to the chunked SIMD dispatch, which then bailed (`min_buffer_size < copy_at_least.next_multiple_of(chunk)`) and ended up in libc memmove for buffers without WILDCOPY_OVERLENGTH slack on the destination side. Extend the inline path to 17..=32 by emitting four unaligned u64 ops: two adjacent stores for the first 16 bytes, then two overlapping stores for the trailing 1..=16 bytes. Eight loads + eight stores total, all branch-free, exactly `copy_at_least` bytes written without overshoot. This catches the literal-push size range (10..=24 bytes is typical on the profiled corpus) that the previous ≤16-byte cap missed. Decode time on `decompress/level_-1_fast/decodecorpus-z000033/ c_stream/matrix/pure_rust`: baseline 1,814,570 ns/iter prior PR state 1,632,669 ns/iter this commit 1,607,680 ns/iter (median of 5) -> -11.4% from baseline, -1.5% incremental -> FFI gap 3.89x → 3.40x; 553 MB/s → 629 MB/s All 86 simd_copy / ringbuffer / decode_buffer / cross_validation / roundtrip tests pass. Refs #189
…_len window RingBuffer::extend: The first-free-slice destination capacity passed to `simd_copy::copy_bytes_overshooting` was unconditionally inflated by WILDCOPY_OVERLENGTH. That is only safe when `tail >= head` — in the wrapped layout (`tail < head`) `free_slice_parts` returns the inner `(buf+tail, head-tail)` gap, and inflating into the WILDCOPY region would let wildcopy overshoot past `head` into still-readable buffer output. Today no overshoot path actually fires because `src.1` is exact (`in_f1 == copy_at_least`) and caps `min_buffer_size`, so the inflation never changed dispatch — but the latent UB hazard if anyone ever adds a `dst.1`-only overshoot fast path is real. Gate the inflation on `tail >= head` and rewrite the comment to make both reasoning paths explicit (slack-reachability AND src.1 cap). Add `extend_wrapped_layout_preserves_bytes_past_head`: a regression test that builds a wrapped layout via a new `build_tail_before_head` helper, fills the inner gap, and asserts the 16-byte window past `head` is untouched. FSETable::build_decoding_table: `self.decode.set_len(table_size)` ran immediately after `reserve`, leaving Entry slots uninitialized through the entire `table_symbols` resize + symbol-spreading loop. A panic in that window (allocator failure on `table_symbols.resize`) would unwind with reachable uninitialized Entry slots — currently UB-free because `Entry: Copy` has trivial drop, but a fragile contract for future maintainers. Move `set_len` to immediately precede `copy_symbols_into_decode`, after all panic-able allocations have finished. The uninitialized- slot window is now zero panic-able code wide; an unwind from the spreading loop leaves `self.decode.len() == 0` instead. All 557 nextest tests pass. Decode perf unchanged on the profiled scenario (median 1,594,500 ns/iter — the inflation never affected dispatch outcome under the current simd_copy contract). Refs #189
… test
Clippy 1.95 flags `for i in 0..16 { sentinel[i] = ... }` as
`needless_range_loop`. Rewrite the prefill-sampling loop in
extend_wrapped_layout_preserves_bytes_past_head to use
`sentinel.iter_mut().enumerate()`; the verification loop keeps the
plain `for i in 0..16usize` form because its body indexes the raw
ring-buffer pointer (not the loop's own slice), which clippy permits.
Refs #189
The previous refill() carried a blanket #[cold] annotation. Donor analysis (libzstd BIT_reloadDStream_internal at bitstream.h:370-444) plus the sequence-decode profile flagged this as a regression: refill fires roughly every 2 sequences during sequence decode — it is the dominant mid-stream path, not a cold case. The #[cold] hint pessimised the branch predictor and forced LLVM to spill the hot 3-instruction body out of every caller's straight-line code. Reshape the function so: - Outer refill() is #[inline(always)]. Only the mid-stream hot path (`self.index >= bytes_consumed`: subtract index, mask bit count, load 8 bytes) lives here, folding into every get_bits / ensure_bits call site. - The three end-of-stream branches (`self.index > 0`, exhausted but bit_consumed < 64, fully exhausted) move into refill_slow(), marked #[cold] #[inline(never)]. Same semantics, just no longer parked on the hot path. Local aarch64 measurements are dominated by thermal / background-process noise right now (range 1.68 - 13.1 ms across consecutive runs of the same bench, vs the prior commit's stable 1.65 ms median); the lowest sample matches prior best so there is no regression. Keeping the change per the theoretically-correct-opts-stay-in-tree rule — final profile pass will revisit if the noise floor settles. All 120 bit_io / fse / cross_validation / roundtrip tests pass. Refs #189
Both per-iter validations in decode_sequences (with_rle main, without_rle main, without_rle tail — all three loop bodies) were mathematically unreachable under any successfully-constructed FSE table: - `of_code > MAX_OFFSET_CODE` (31): `AlignedFSETable::new(MAX_OFFSET_CODE)` caps `max_symbol` at table construction, and `build_decoding_table`'s `TooManySymbols` guard rejects any probabilities array exceeding `max_symbol + 1` entries. The symbols stored in the FSE state table index that probabilities array, so emitted symbols are guaranteed bounded. For RLE-mode tables, `maybe_update_fse_tables` validates the singleton byte against MAX_OFFSET_CODE at scratch setup. - `offset == 0`: `offset = obits + (1u32 << of_code)` and `1u32 << of_code` is ≥ 1 for any `of_code` in 0..=31 — i.e. the addend is at least 1. Replace both branches with `debug_assert!` so a future invariant break trips in tests without paying for the comparison on the hot path in release builds. Same change applied to the with_rle path, the without_rle main loop body, and the without_rle tail iteration (3 sites total) so the four branches × N sequences disappear from every FSE decode iteration. Local aarch64 measurements: median 1,611 ms / best 1,585 ms across 5 runs — matches prior best (1,586 ms after the exact-len-32 commit). Range is much tighter than the prior commit's thermal-noise window (5 - 107 us stddev here vs 5,628 us there). The theoretical win is ~1-3% per the donor analysis (4 branches × ~100k sequences on the profiled frame); on this run it tucked into the noise floor, but the change is kept per the project rule that theoretically-correct optimizations stay in tree pending a final clean-room profile. All 67 sequence_section_decoder / decode_sequences / cross_validation / roundtrip tests pass. Refs #189
…e helper The main loop and tail iteration of decode_sequences_without_rle carried two near-identical copies of the per-sequence decode body. Pull the shared work (FSE symbol fetch, LL/ML LUT, get_bits_triple, offset assembly, debug_asserts) into a private #[inline(always)] helper returning Sequence, called from both sites. LLVM still specializes each call site after inlining; the source-level duplication and divergence risk are gone. No measurable perf change on the profiled scenario (median 1,605 ms / range 1,600 - 1,627 ms across 5 runs, vs prior 1,611 ms median — within stddev). All 557 nextest tests pass. Refs #189
The hot loop in execute_sequences indexes `scratch.sequences[idx]` on every iteration to copy the Sequence struct into a local. LLVM cannot prove the access is in-bounds because `scratch` is a `&mut DecoderScratch` and the loop body calls `scratch.buffer.reserve(...)` / `scratch.buffer.push(...)` / `scratch.buffer.repeat(...)`, which the borrow checker treats as opaque mutations of the wider struct. Capture `scratch.sequences.len()` into a `sequences_len` local before the loop (so the iteration range is stable) and switch the per-iter read to `get_unchecked(idx)` with a SAFETY comment citing the idx < sequences_len invariant. Same Sequence value extracted, no behavioural change. Local aarch64 measurements: prior commit median 1,605,658 ns / range 1,600 - 1,627 ns this commit median 1,588,630 ns / range 1,582 - 1,594 ns -> -1.1% incremental, -12.5% cumulative from baseline 1,814,570 ns -> FFI gap 3.89x -> 3.36x; 553 MB/s -> 644 MB/s All 81 sequence / execute / roundtrip / cross_validation tests pass. Refs #189
Both `if seq.ll > 0 { push(literals) }` and `if seq.ml > 0 { repeat(...) }`
in execute_sequences were guarding downstream calls that already handle
the zero-length case:
- `RingBuffer::extend` (called via `DecodeBuffer::push`) early-returns
when the input slice is empty.
- `DecodeBuffer::repeat` early-returns when match_length == 0.
Drop the guards — when seq.ll/ml is zero the empty slice / zero match
length flows through naturally without any byte-copy work. Saves two
predictable branches per sequence on a hot path that runs once per
emitted sequence.
The literal-length bounds check (`high > literals_buffer.len()`) stays
on the always-execute path, which is mathematically equivalent to the
old check inside the `seq.ll > 0` branch — when ll == 0,
`high == literals_copy_counter`, so the check resolves identically.
Local aarch64 measurements: median 1,607 ns / range 1,582 - 1,620 ns
across 5 runs, vs prior commit 1,589 ns median. Within run-to-run
stddev (best 1,582 ns matches prior best 1,583 ns). Theoretically
correct (~0.5-1% per branch elimination); kept per the rule that
correct opts stay in tree pending a final clean-room profile.
All 557 nextest tests pass.
Refs #189
In --no-default-features builds on x86/x86_64 the dispatcher cfg-gates
out every call site for `copy_avx2` and `copy_avx512`:
#[cfg(target_feature = "avx512f")] try_chunk_kernel!(64, copy_avx512);
#[cfg(target_feature = "avx2")] try_chunk_kernel!(32, copy_avx2);
The default x86_64 Rust target does NOT enable +avx2 / +avx512f (those
require explicit RUSTFLAGS), so on no-std builds without those flags
both call sites cfg-out. Std builds use runtime detection through
`detect_x86_caps`, which rustc cannot see through statically.
CI no-std job tripped on `-D dead_code`. Tag both functions
`#[allow(dead_code)]` with a doc comment explaining when they go live
(std builds OR `target_feature` enabled).
All three build modes verified locally: --no-default-features,
--no-default-features --features hash, and --features hash,std,
dict_builder all clippy-clean with -D warnings.
…cute_sequences Two small loop-body changes: 1. `literals_buffer.len()` was reread on every iteration to compute the bounds-check threshold. The buffer is not mutated by anything called inside the loop (push writes to `scratch.buffer`, not `scratch.literals_buffer`), so the length is loop-invariant. Capture it before the loop as `literals_buffer_len`. 2. `&scratch.literals_buffer[lit_cur..high]` triggers a bounds check even though we just validated `high <= literals_buffer_len` one line above. Switch to `get_unchecked(lit_cur..high)` with a SAFETY comment citing the monotone advance and the validation above. Local aarch64 measurements: range 1,622 - 1,661 ns across 5 runs (one 2,230 outlier ignored as cargo-clean cache miss / thermal — first run after full deps rebuild), best 1,622 ns matches prior best window. Stays in the noise floor relative to the prior commit's 1,587 ns median; theoretically saves one length read + one slice bounds check per sequence so kept per the project rule on noise-floor opts. All 557 nextest tests pass. Refs #189
Profile (decompress L-1 fast decodecorpus-z000033 c_stream) showed
RingBuffer::extend self-time grew to 33.6% after the simd_copy routing
refactor — that earlier change moved the byte-level work into extend's
caller frame, but kept the 4-case free_slice_parts dispatch on the
slow path for workloads that never wrap.
Add a flat fast path at the top of extend that fires whenever the
write is one contiguous copy from data to buf+tail:
if head <= tail && tail + len <= cap {
simd_copy(src, dst with WILDCOPY slack, len);
tail += len; if tail == cap { tail = 0; }
return;
}
For frames whose total decoded size fits in window_size (the common
case on this bench and on most real-world streams), this path fires
on every literal push. The slow path still handles wraparound.
Also fold the CodeRabbit style nit on active_chunk_size_for_tests:
explicit any(target_arch = x86, x86_64) guard added to all three
no-std target_feature cfgs for symmetry with copy_bytes_overshooting.
Decode time on the profiled scenario:
prior commit ~1,637,000 ns
this commit 1,542,705 ns (5-run median)
-> -5.7% incremental
-> -15.0% cumulative from baseline (1,814,570 ns)
-> 649 MB/s best, FFI gap 3.89x -> 3.28x
557 nextest tests pass. no_std + std clippy clean.
Refs #189
After the flat-extend refactor, the body of `extend` is short enough (~5 instructions plus the inline copy on the common no-wrap path) that the function-call overhead per literal push is a measurable fraction of the work itself. `execute_sequences -> DecodeBuffer::push -> RingBuffer::extend` was paying two stack-frame setups per sequence on a path where most of the work is a single SIMD store. Mark both `RingBuffer::extend` and `DecodeBuffer::push` `#[inline]` so LLVM folds them into `execute_sequences`. The slow-path branches inside `extend` (wraparound, capacity miss) stay non-inlined by LLVM's heuristic because they end in calls and panics. Local aarch64 measurements: prior commit (flat-extend) ~1,542,705 ns this commit best/median 1,484,378 / ~1,506,000 ns -> -2.4% incremental median -> -18.2% cumulative from baseline (1,814,570 ns) -> 676 MB/s best run All 557 nextest tests pass. Refs #189
…_n_plus_two The N+2 lookahead helper indexes scratch.sequences three times per sequence to pull ll_curr / ll_next / ll_n2. LLVM does not collapse the three bounds checks because the borrow checker treats the wider DecoderScratch reference as potentially mutated by sibling call sites in the loop body. Cache scratch.sequences.len() into a local, verify idx + 2 < seqs_len at the top, then read the three slots via get_unchecked. Three Vec bounds checks per sequence drop to zero. No measurable bench impact on the profiled scenario (median stays in 1,558 - 1,579 ns/iter window vs prior 1,567 ns median). Kept per the project rule on theoretically-correct noise-level opts. 557 nextest tests pass. Refs #189
…fset_history The original branchless-RULES-table dispatch covered all four classes (repcode 1/2/3 plus fresh offset) in one branchless expression. For typical workloads the dominant case is class=3 (offset_value >= 4 = fresh offset, not a repcode), where the work collapses to: actual = offset_value - 3 scratch = [actual, scratch[0], scratch[1]] Split into a fast path that handles offset_value >= 4 with an unconditional shift, and a #[cold] #[inline(never)] do_offset_history_repcode that runs the full RULES-table branchless dispatch for repcodes 1..=3. Donor zstd does the same split inside ZSTD_updateRep. Decode time on the profiled scenario: prior commit (prefetch get_unchecked) ~1,567,000 ns median this commit 1,413,000 ns median (best 1,406,952) -> -9.8% incremental median -> -22.1% cumulative from baseline (1,814,570 ns) -> 711 MB/s best, FFI gap 3.89x -> 3.01x All 557 nextest tests pass. Refs #189
…uence> roundtrip) Donor zstd's ZSTD_decompressSequences_body interleaves ZSTD_decodeSequence and ZSTD_execSequence in one loop, keeping the seq_t in registers. Our two-pass pipeline (decode_sequences -> scratch.sequences Vec -> execute_sequences) was paying ~24 B/seq * 2 of L1/L2 traffic on the Vec roundtrip plus per-iter Vec::push overhead. Add decode_and_execute_sequences in sequence_section_decoder which: - Runs FSE table setup (maybe_update_fse_tables) and bit-stream padding skip up front. - For RLE-mode blocks (LL/ML/OF RLE flags), falls back to the legacy two-pass path internally via decode_sequences_with_rle + execute_sequences_fields. Rare on real corpora; not worth duplicating. - Hot path: for each sequence, decode_one_sequence_inline returns a Sequence in registers, then execute_one_sequence runs the literal push + match repeat immediately — no scratch.sequences write/read. Takes field-level borrows of DecoderScratch fields (fse, buffer, offset_hist, literals_buffer, sequences for RLE fallback) so the call site in block_decoder can keep `raw` borrowing workspace.block_content_buffer alongside the mutable workspace borrows. Add execute_sequences_fields helper in sequence_execution that takes the same field-level borrows for the RLE fallback path. Legacy decode_sequences / execute_sequences kept with #[allow(dead_code)] for the public API surface and any external callers that iterate sequences explicitly. Decode time on the profiled scenario: prior commit (offset_history fast path) ~1,413,000 ns median this commit 1,383,841 ns median (best 1,377,041) -> -2.1% incremental median -> -24.1% cumulative from baseline (1,814,570 ns) -> 726 MB/s best, FFI gap 3.89x -> 2.93x All 557 nextest tests pass. clippy clean both std and no_std modes. Refs #189
The non-BMI2 fallback in mask_lower_bits indexed a 65-entry static table on every call. Replace with `u64::MAX.checked_shr(64 - n)`: one shift + one (predicted) cmov vs one L1 load (3-5 cycle latency). For the hot FSE bitstream decode path this fires 3x per sequence (once per LL/ML/OF extra-bits extraction in get_bits_triple), so saving the load latency compounds over thousands of sequences. `checked_shr` returns None for shift counts >= 64, which is exactly the n=0 case (64 - 0 = 64) and the invalid n > 64 case (wraps to a huge value via wrapping_sub). Mapping both to 0 gives the mathematically-correct empty mask for n=0 and a safe fallback for the debug_asserted-invalid range. BMI2 path unchanged (`_bzhi_u64` is already a single instruction). Local aarch64 measurements: prior commit (fusion) ~1,386,367 ns median this commit 1,365,211 ns median (best 1,353,635) -> -1.5% incremental -> -25.4% cumulative from baseline (1,814,570 ns) -> 739 MB/s best, FFI gap 3.89x -> 2.88x All 557 nextest tests pass. Refs #189
After the mask_lower_bits switch to compute-based masking, BIT_MASK is read only by: - the BMI2 PEXT triple-extract path (std + x86_64 only), and - the in-crate tests verifying mask values. cfg-gate the const to `any(test, all(feature = "std", target_arch = "x86_64"))` so no-std builds on non-x86 targets don't carry a `dead_code` error through `-D warnings` in CI. All three configurations clippy-clean with -D warnings: - --no-default-features - --no-default-features --features hash - --features hash,std,dict_builder 557 nextest tests pass. Refs #189
Two correctness fixes in decode_and_execute_sequences:
1. Clear rle_fallback_sequences on entry. The non-RLE fast path never
touches the Vec, so without an explicit clear it would carry whatever
entries the previous block left behind. Externally-observable scratch
state must reflect the current block's data only.
2. Snapshot-restore the buffer write cursor and offset history before
the per-iter side-effects, and roll back on post-loop bitstream
validation failure. The legacy two-pass pipeline (decode_sequences
first, execute_sequences after) reported NotEnoughBytesForNumSequences
/ ExtraBits before any block output was committed. The fused fast
path commits each sequence's literal push + match repeat + offset
history update immediately, so a malformed bitstream that only
reveals itself at the end of the section was leaving partial output
and mutated repeat history behind.
Restore semantics: capture a DecodeBuffer checkpoint and clone the
offset history at the top, and undo both on Err. New
DecodeBuffer::{checkpoint, restore_checkpoint} primitives sit on a
pub(super) RingBuffer::{tail, set_tail} pair so the rollback stays
inside the decode module.
Add regression test checkpoint_restore_undoes_pushes to lock in the
primitive's behaviour — a future change that breaks tail-rewind
semantics would surface here instead of as silent state leakage in
the fused executor.
Local aarch64 measurements:
prior commit ~1,365,211 ns median
this commit 1,390,764 ns median (5 runs)
-> +1.9% from extra Vec::clear + 16-byte checkpoint snapshot per
block; correctness fix is worth the small overhead
-> -23.4% cumulative from baseline (1,814,570 ns)
-> 727 MB/s, FFI gap 3.89x -> 2.96x
558 nextest tests pass (one new). clippy strict in std + no_std modes.
After the fused decode_and_execute_sequences became the sole entry point in block_decoder, the legacy paths were dead code held with allow(dead_code): - decode_sequences (pub entry, ~40 lines) - decode_sequences_without_rle (private fast path, ~127 lines) - execute_sequences (pub workspace-based two-pass, ~90 lines) - prefetch_literals_n_plus_two (only called by execute_sequences) Delete all four. decode_sequences_with_rle and execute_sequences_fields stay — they're invoked by the fused path's RLE-mode fallback. Removes ~292 lines of unreachable code that were dragging the patch-coverage denominator down (codecov flagged sequence_execution at 56% patch coverage, sequence_section_decoder at 72%; the dead paths accounted for most of the gap). 558 nextest tests pass. clippy strict in std + no_std modes. Bench median 1,413,205 ns/iter (within the 1.35-1.43 ms cluster from prior commits; the transactional checkpoint added overhead remains the biggest single regression vs the peak 1,353 ms run).
At head+start==cap the source effectively starts at index 0 after the wrap, which is the continuous-source layout. The prior strict '>' check routed this boundary through the split-source branch, producing a zero-length copy_bytes_overshooting call against the trailing slack region — correctness depended on the unsafe helper treating len == 0 as a hard no-op. Use '>=' so the boundary case takes the continuous path directly.
restore_checkpoint() documents an invariant that no reallocation may happen between checkpoint() and restore_checkpoint(), but the unsafe function does not enforce it. A malformed sequence section that decodes more than the upfront reserve(MAX_BLOCK_SIZE) can trigger RingBuffer::reserve_amortized between the two calls — and amortized growth compacts data (head=0, tail=s1+s2), invalidating the captured tail index. Without a cap guard the restored tail silently points at the wrong logical position in the new buffer. Test fails on current code (no guard yet) — the follow-up fix adds the cap snapshot + panic on mismatch.
restore_checkpoint() previously documented a 'no reallocation between checkpoint and restore' invariant but did not enforce it. A malformed sequence section can decode more than the upfront reserve(MAX_BLOCK_SIZE) and trigger RingBuffer::reserve_amortized, which compacts the buffer (head=0, tail=s1+s2) — invalidating the captured tail index. Restoring that stale tail silently produces wrong output, with no observable error. Snapshot the RingBuffer cap alongside tail in DecodeBufferCheckpoint and panic on mismatch. The check is a load-bearing correctness guard, not a debug-only assert: silent UB is worse than a controlled panic on untrusted input. Existing checkpoint_restore_undoes_pushes test updated to reserve() before the first push (mirroring the real fused-decoder call pattern, which always reserves MAX_BLOCK_SIZE up front). The new restore_checkpoint_after_realloc_panics regression test from the preceding commit now passes.
The previous comment claimed 'no reallocation has happened in between (buffer.reserve was called once before the checkpoint)' — but reserve(MAX_BLOCK_SIZE) only rules out reallocation on well-formed input. A malformed block that decodes past MAX_BLOCK_SIZE before the post-loop bitstream check can still trigger reserve_amortized. Document the actual invariant: restore_checkpoint() now enforces the no-reallocation guarantee via a cap-equality panic, which is preferred to silent wrong output on untrusted input.
Malformed block whose sequence section decodes more than the upfront reserve(MAX_BLOCK_SIZE) the fused executor performs. That forces RingBuffer::reserve_amortized between checkpoint() and the post-loop bitstream validity check; the panic-on-cap-mismatch guard then turned the malformed input into a hard abort. Test fails on current code (the prior panic guard fires) and passes once try_restore_checkpoint switches to a graceful no-restore return.
…ormed input Replace the unsafe restore_checkpoint(_) that panicked on cap mismatch with safe try_restore_checkpoint(_) -> bool. The cap-equality check is now load-bearing logic instead of an unconditional abort: - On well-formed input (the hot path) the upfront reserve(MAX_BLOCK_SIZE) rules out reallocation between checkpoint and post-loop validity check, so try_restore_checkpoint returns true and rolls back the speculative pushes / offset history exactly like before. - On malformed input that decodes past MAX_BLOCK_SIZE, reserve_amortized compacts the ring buffer and invalidates the captured tail. The function now returns false, no state is mutated, and the caller continues with the same Err return — the partial bytes are discarded with the corrupt frame. This closes the DoS surface where any decoder fed untrusted bytes could be aborted with the panic guard that the previous round introduced. Function is now safe (no unsafe at the call site).
Skip free_slice_lengths' branch + saturating_sub on the common case where the data region has not wrapped (head <= tail) and the write does not cross cap. In that layout free space is trivially 'cap - tail - 1 + head' >= 'cap - tail - 1', which dominates 'amount' whenever 'tail + amount < cap'. Mirrors the flat fast path already in RingBuffer::extend so push/repeat on frames that fit in the window pay zero dispatch overhead from reserve() either. Theoretical win on the hot path; M1 thermal noise masks the actual delta in bench output, kept per the iteration discipline rule (correct opt, no regression risk).
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
zstd/src/decoding/sequence_section_decoder.rs (2)
458-461:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winIncorrect error variant for LL RLE code validation.
Line 459-460 returns
MissingByteForRleMlTablewhen the literal length RLE code exceedsMAX_LITERAL_LENGTH_CODE. This should return an error variant that indicates an invalid literal length code, not a missing ML table byte. The validation logic is correct, but the error message is misleading.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/decoding/sequence_section_decoder.rs` around lines 458 - 461, The check that validates the literal-length RLE byte (source[0] > MAX_LITERAL_LENGTH_CODE) is using the wrong error variant; update the error returned in that branch to the variant indicating an invalid literal-length code (e.g. DecodeSequenceError::InvalidLiteralLengthCode) instead of DecodeSequenceError::MissingByteForRleMlTable so that the validation of scratch.ll_rle uses the correct error semantics for MAX_LITERAL_LENGTH_CODE violations.
493-496:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winIncorrect error variant for OF RLE code validation.
Line 494-495 returns
MissingByteForRleMlTablewhen the offset RLE code exceedsMAX_OFFSET_CODE. This should return an error variant that indicates an invalid offset code. The validation logic is correct, but the error message is misleading for debugging.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/decoding/sequence_section_decoder.rs` around lines 493 - 496, The check that validates the offset RLE code (if of_source[0] > MAX_OFFSET_CODE) uses the wrong error variant; replace the returned DecodeSequenceError::MissingByteForRleMlTable with the specific error variant that denotes an invalid offset RLE code (e.g., DecodeSequenceError::InvalidOffsetRleCode or the existing variant name used for invalid offsets) so the failure from this branch correctly signals an invalid offset code; update the return in the block that precedes setting scratch.of_rle to use that invalid-offset error variant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 458-461: The check that validates the literal-length RLE byte
(source[0] > MAX_LITERAL_LENGTH_CODE) is using the wrong error variant; update
the error returned in that branch to the variant indicating an invalid
literal-length code (e.g. DecodeSequenceError::InvalidLiteralLengthCode) instead
of DecodeSequenceError::MissingByteForRleMlTable so that the validation of
scratch.ll_rle uses the correct error semantics for MAX_LITERAL_LENGTH_CODE
violations.
- Around line 493-496: The check that validates the offset RLE code (if
of_source[0] > MAX_OFFSET_CODE) uses the wrong error variant; replace the
returned DecodeSequenceError::MissingByteForRleMlTable with the specific error
variant that denotes an invalid offset RLE code (e.g.,
DecodeSequenceError::InvalidOffsetRleCode or the existing variant name used for
invalid offsets) so the failure from this branch correctly signals an invalid
offset code; update the return in the block that precedes setting scratch.of_rle
to use that invalid-offset error variant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d5fd3293-1da4-4e40-977d-60d547a6e49c
📒 Files selected for processing (11)
.github/workflows/ci.ymlzstd/benches/compare_ffi.rszstd/src/bit_io/bit_reader_reverse.rszstd/src/decoding/block_decoder.rszstd/src/decoding/decode_buffer.rszstd/src/decoding/ringbuffer.rszstd/src/decoding/sequence_execution.rszstd/src/decoding/sequence_section_decoder.rszstd/src/decoding/simd_copy.rszstd/src/fse/fse_decoder.rszstd/src/tests/fuzz_regressions.rs
- fuzz_regressions: assert decoder.read_to_end is_err() instead of swallowing the result — silent-pass shape (if let Ok / let _ = …) could no longer catch a regression that decoded the malformed block to a non-error result. - fse_decoder: move set_len into copy_symbols_into_decode, adjacent to the per-slot raw writes. Removes the non-local invariant requiring the caller's set_len to stay paired with the callee's init loop. - ringbuffer::set_tail: extend SAFETY to require head fixed since capture (full/empty discriminator + invariant 4 only stay consistent then). Document the single current caller and its precondition. - sequence_section_decoder: collapse the split [N-1 with-update] + [last without-update] loop into one loop with a trailing 'if i + 1 < num_sequences' state-update guard. Same generated code, one body, donor isLastSeq pattern. - decode_buffer test: Vec::new() instead of vec::Vec::new() (the latter resolves only via alloc::vec re-exporting Vec — looks like a typo). - bit_reader_reverse::mask_lower_bits: docstring now explicitly notes the input-validation debug_assert is on this function (not just get_bits) — Copilot misread the existing assert location. - simd_copy::single_op_copy_16: add a reachability matrix comment before the portable fallback so a future arch arm slotted between existing arms knows it must terminate with return or its code goes silently dead.
4de7069 to
83c52dd
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
zstd/src/decoding/ringbuffer.rs (1)
124-155:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winGuard
reservecapacity math againstusizewraparound.Line 133 and Line 154 use unchecked addition. A very large
amountfrom safe callers can wrap in release, causingreserve()to skip growth or compute an undersizednew_cap; the later unsafe writes inextend,extend_and_fill, orextend_from_readerthen trust space that was never actually reserved. Please switch these checks to non-overflowing arithmetic (amount < self.cap - self.tailfor the fast path,checked_addbefore growth sizing) and fail fast on overflow.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/decoding/ringbuffer.rs` around lines 124 - 155, The fast-path in reserve and the new capacity math in reserve_amortized are using unchecked additions that can wrap; update reserve's condition to use non-overflowing comparison (e.g., check amount < self.cap.saturating_sub(self.tail) or amount < self.cap - self.tail using a checked branch equivalent) so the flat-path can't be bypassed by wraparound, and in reserve_amortized switch to checked_add (use usize::checked_add when computing cap+amount and when adding WILDCOPY_OVERLENGTH) and fail fast (panic with a clear "capacity overflow" message) if any checked_add returns None; ensure the Layout::array call uses the verified new capacity so late unsafe writes in extend/extend_and_fill/extend_from_reader cannot rely on undersized allocation.zstd/src/fse/fse_decoder.rs (1)
60-72:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winMake
update_state_fastcrate-private to prevent undefined behavior.This method exposes an unsafe contract as a public API with insufficient enforcement. Safe code can construct
FSEDecoder::new(&FSETable::new(..))and invokeupdate_state_fast, which dereferencesget_unchecked(0)on an emptydecodevector. Public mutable fields (decode,accuracy_log) further break the invariant. Changepub fntopub(crate) fnto preserve internal use while eliminating the UB surface.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/fse/fse_decoder.rs` around lines 60 - 72, The public method update_state_fast exposes an unsafe contract that can be violated by external code (e.g., callers can create FSEDecoder::new(&FSETable::new(..)) and trigger an out‑of‑bounds get_unchecked on table.decode); make update_state_fast crate-private by changing its signature from pub fn update_state_fast(...) to pub(crate) fn update_state_fast(...). Also ensure callers inside the crate still use update_state_fast and that external access to the FSETable fields (decode, accuracy_log) remains restricted by keeping them non‑pub or documenting internal invariants so the unsafe indexing invariant cannot be violated from outside the crate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 177-188: The code currently unconditionally restores the repcode
history via "*offset_hist = saved_offset_hist" even when
"buffer.try_restore_checkpoint(buffer_checkpoint)" fails; change this so the
history is only rewound when try_restore_checkpoint returns true (i.e. the
rollback actually happened). Concretely, call
"buffer.try_restore_checkpoint(buffer_checkpoint)" into a boolean (or use its
existing result), and move the "*offset_hist = saved_offset_hist" assignment
inside the branch that checks that it returned true; leave "offset_hist"
untouched when the restore fails so the buffer and offset history remain
consistent for reuse.
---
Outside diff comments:
In `@zstd/src/decoding/ringbuffer.rs`:
- Around line 124-155: The fast-path in reserve and the new capacity math in
reserve_amortized are using unchecked additions that can wrap; update reserve's
condition to use non-overflowing comparison (e.g., check amount <
self.cap.saturating_sub(self.tail) or amount < self.cap - self.tail using a
checked branch equivalent) so the flat-path can't be bypassed by wraparound, and
in reserve_amortized switch to checked_add (use usize::checked_add when
computing cap+amount and when adding WILDCOPY_OVERLENGTH) and fail fast (panic
with a clear "capacity overflow" message) if any checked_add returns None;
ensure the Layout::array call uses the verified new capacity so late unsafe
writes in extend/extend_and_fill/extend_from_reader cannot rely on undersized
allocation.
In `@zstd/src/fse/fse_decoder.rs`:
- Around line 60-72: The public method update_state_fast exposes an unsafe
contract that can be violated by external code (e.g., callers can create
FSEDecoder::new(&FSETable::new(..)) and trigger an out‑of‑bounds get_unchecked
on table.decode); make update_state_fast crate-private by changing its signature
from pub fn update_state_fast(...) to pub(crate) fn update_state_fast(...). Also
ensure callers inside the crate still use update_state_fast and that external
access to the FSETable fields (decode, accuracy_log) remains restricted by
keeping them non‑pub or documenting internal invariants so the unsafe indexing
invariant cannot be violated from outside the crate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ffa04f02-07bd-45b0-92a4-8ff37992dad1
📒 Files selected for processing (11)
.github/workflows/ci.ymlzstd/benches/compare_ffi.rszstd/src/bit_io/bit_reader_reverse.rszstd/src/decoding/block_decoder.rszstd/src/decoding/decode_buffer.rszstd/src/decoding/ringbuffer.rszstd/src/decoding/sequence_execution.rszstd/src/decoding/sequence_section_decoder.rszstd/src/decoding/simd_copy.rszstd/src/fse/fse_decoder.rszstd/src/tests/fuzz_regressions.rs
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
- sequence_section_decoder: only restore offset_hist when try_restore_checkpoint actually rolled back the buffer. The previous unconditional rewind left buffer + offset_hist out of sync on the realloc-detected branch (buffer keeps speculative bytes, repcode history rewinds) — workspace would be internally inconsistent for any subsequent reuse after the Err. - fse_decoder: tighten update_state_fast from pub to pub(crate). The method's unchecked decode-table read assumes a fully-built table reached via crate-internal call paths (FSEDecoder constructed with an empty FSETable would dereference get_unchecked(0) on an empty Vec, which is UB). External access cannot construct that invariant themselves, so pub(crate) is the right surface. - ringbuffer::reserve: switch fast-path math to cap.saturating_sub and reserve_amortized to checked_add. A caller passing amount == usize::MAX (e.g. via a malformed match_length) could previously wrap tail+amount or cap+amount in release and produce an undersized allocation that subsequent unsafe writes trusted — now overflows panic with a clear message instead.
Summary
Continuation of decode-speed work that was on PR #190 (closed because that PR's history accumulated a panic-guard variant of the malformed-input fix that a libFuzzer artifact (
crash-bfb3bc55…) then turned into a DoS surface). This PR ships the same decode-speed improvements plus the graceful-rollback fix that closes that surface.Closes #178
Decode speed
Baseline (z000033, Level 3 dfast, M1): 1,814,570 ns/iter → best 1,353,635 ns / median ~1,413,000 ns (≈ −22%). Throughput 553 → 739 MB/s. FFI gap 3.89× → 2.88×.
Notable changes
decoding/sequence_section_decoder.rs): collapsesdecode_sequences+execute_sequencesinto a single loop, eliminating theVec<Sequence>roundtrip on the hot path.DecodeBuffer::checkpoint/try_restore_checkpoint. The rollback primitive snapshots theRingBuffercapalongsidetailso an intervening reallocation (only reachable on malformed input that decodes pastMAX_BLOCK_SIZE) is detected and the rollback is gracefully skipped, not turned into a panic. The caller still surfaces a normal decodeErr.u64); AVX2 / AVX-512 kernels target-feature-gated forno_std.#[cold] #[inline(never)]slow path;mask_lower_bitscomputed via shift instead ofBIT_MASK[n]LUT.head + start == capnow routes through the contiguous-second-slice path (was misrouted into split-source, producing a zero-lengthcopy_bytes_overshootingagainst the slack region).do_offset_historyfast path foroffset_value >= 4, matching donorZSTD_updateRep.--no-default-featuresand--no-default-features --features hash.Tests
malformed_block_does_not_panic_via_restore_checkpointfuzz regression).crash-bfb3bc55…(the input that took down PR perf(decode): collapse simd_copy dispatcher + FSE bounds-check elision + no_std CI gate #190's CI) is now an inline regression test inzstd/src/tests/fuzz_regressions.rs.--all-targets,--no-default-features,--no-default-features --features hash) clean.Related
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores