perf(dfast): close level 3 parity gap — compress 11.0× → 5.5× of donor by polaz · Pull Request #145 · structured-world/structured-zstd

polaz · 2026-05-16T22:34:31Z

Summary

Closes the bulk of Phase 7c parity work on dfast level 3 across two of three axes; decompress and medium/large memory are left for a follow-up PR. Picked up a cross-level small-input perf win while addressing review feedback — the FSE default-table cache (commit f928ef4) shaved ~38 µs off FrameCompressor::new for every compression call regardless of level.

Honest framing on the "Compressed size" column below: our output is 7% smaller than donor on the decodecorpus-z000033 corpus. That is NOT confirmation that we beat donor — it is more likely a signal of algorithmic divergence (we are doing extra match-finding work the donor's fast path skips), which is consistent with us still being 5.5× slower. The deviation may turn out to be the cause of the speed gap, not a separate win. To confirm what donor would have emitted on the same input we still need a per-sequence diff harness (see the "Deferred" section).

Cumulative deltas (compress/level_3_dfast/decodecorpus-z000033, 1 022 035 B input)

Axis	Before this PR	After this PR	Donor (FFI 1.5.7)	Status
Compress time	32.07 ms (11.0×)	16.11 ms (5.5×)	2.94 ms	−50% closed
Compressed size	556 121 B (+5.5%)	489 846 B (−7.07%)	527 148 B	diverged; likely a cost source, not a win — see note above
Decompress time	4.43 ms (5.3×)	4.43 ms	0.84 ms	deferred
Memory (small ≤4 KiB)	378 KiB	249 KiB (−34%)	178 KiB	partial close
Memory (medium 1 MiB)	5.01 MiB	unchanged	3.80 MiB	deferred
Memory (large 100 MiB stream)	9.92 MiB	unchanged	3.80 MiB	deferred

Small-input speedup (FSE cache, all levels)

FrameCompressor::new used to rebuild the three predefined LL/ML/OF FSE tables on every call (~12 µs per build_table_from_probabilities × 3 = ~38 µs). Two-stage fix during the review rounds on this PR:

Cache the tables in process-local AtomicPtr<FSETable> and clone on each ctor (commit f928ef4) — drops the build cost but pays ~4-5 µs per clone of three 256-element SymbolStates arrays.
Switch the cache return type to &'static FSETable and store the static reference inside FseTables directly (commit 0338876) — the clone disappears, the field becomes a pointer-sized copy.

Effect on small-1k-random/matrix/pure_rust at level 2 dfast measured locally on darwin/aarch64 with criterion:

	Initial baseline	After `f928ef4` (cache + clone)	After `0338876` (`&'static`)	FFI 1.5.7
`compress_slice_to_vec`	44.25 µs (22.07 MiB/s, 13.4× FFI)	12.67 µs (77.11 MiB/s, 3.9× FFI)	6.08 µs (160.75 MiB/s, 1.83× FFI)	3.31 µs (298.24 MiB/s)

7.3× total compress-time speedup on the small-input scenario. Applies to every level — FseTables::new() runs inside the constructor regardless of strategy. The residual ~2.8 µs gap to FFI is per-frame setup not covered by this change (CompressedBlockScratch::default() Vec allocations, MatchGeneratorDriver Simple-variant init then reset to the level's backend, frame header / hash setup); deferred to a follow-up.

Per-scenario sanity at level_3_dfast (no regressions)

Scenario	Rust (this PR)	FFI 1.5.7	Δ
small-1k-random	1 038 B	1 038 B	=
small-10k-random	10 254 B	10 254 B	=
small-4k-log-lines	150 B	156 B	−4%
decodecorpus-z000033	489 846 B	527 148 B	−7%
high-entropy-1m	1 048 616 B	1 048 613 B	+3 B (noise)
low-entropy-1m	150 B	159 B	−6%

Structural changes (30 commits)

dfast matcher rewrite (Phase 7c core)

Donor outer/inner loop rewrite of start_matching_fast_loop (78f51da5):
- Outer while per match, inner loop per scan position carrying hl0/idxl0/hl1/idxl1 between iterations
- Two-position lookahead (hl1 precomputed, reused on miss as next iter's hl0)
- Hash table updated at curr BEFORE match check (donor zstd_double_fast.c:187)
- Short check is 4-byte gate only (donor MEM_read32); forward common_prefix_len runs only on hit
- _search_next_long retry with precomputed idxl1 follows donor's preferred-longer-match policy
- Probes inlined into hot loop via raw pointer arithmetic; probe_slot_match is kept for cold callers
MIN_MATCH 6→5 (c6eb5a83) — donor clevels.h:31 default level 3 dfast mls=5; we were silently dropping every 5-byte match.
Scalar mul hash (67a47858) — drop CRC32d kernel dispatch on dfast hot path; donor ZSTD_hash8 is one multiply.
Repcode peek at ip+1 (5372ae99) — donor zstd_double_fast.c:190 unconditional peek.
inline(always) on probe chain (233603fd) — collapse best_match → hash_candidate → probe_slot_match so &self field loads hoist into the hot loop.
Skip-step grows by distance (8e72d58b) — donor parity; step grows every kStepIncr positions traveled, not on consecutive-miss threshold.
rep1-only inline peek (1e6d5dd5) — donor zstd_double_fast.c inline hot path checks only offset_1; full rep0/rep1/rep2 walk lives in lazy/btopt. Speed −16%, size also −494 B (fewer rep2/rep3 picks displacing better hash matches).
LTO=fat + codegen-units=1 scoped to [profile.bench] (70f966b5, bac47a6) — full cross-crate inlining for benches without forcing 5–10× release build-time on cli and downstream consumers.
all_blocks adaptive seed in FrameCompressor::compress (079615c0) — scale initial accumulator capacity by source-size hint; saves ~130 KiB peak for ≤4 KiB inputs without regression on medium/large.
Fast-loop OOB fix (ef906f7) — tighten fast-loop guards to HASH_READ_SIZE = 8. Previous + DFAST_MIN_MATCH_LEN (5) let the unconditional 8-byte u64::read_unaligned for the long-hash probe read up to 3 bytes past the live history end. Linux fuzz ASan caught it on a 7-byte input (crash-01be...); regression test added in tests/fuzz_regressions.rs.
Fast-loop rebase (87a62e3) — call ensure_room_for at fast-loop entry. The inner loop's packed_curr cast (abs_ip0 - position_base) as u32 could truncate after a long stream of all-miss / non-hashable blocks (fast-loop bypasses insert_position's rebase).
trim_to_window reclaims prefix (87a62e3) — append compact_history() after the eviction loop so an explicit trim actually frees memory instead of leaving the dead prefix resident until the next add_data.

Cross-level perf + parity refinements (review-driven, recent rounds)

FSE default-table cache (f928ef4) — three predefined LL/ML/OF tables now cached in per-helper static AtomicPtr<FSETable>. FrameCompressor::new clones from the cache instead of rebuilding (~38 µs → ~3 µs). Affects every level on every call. core::sync::atomic works in both std and no_std builds; cache slot is selected by the static identity of each helper, not by pointer comparison on the distribution slice (the const slices don't have a stable address guarantee).
Tail-position probe (f9d36b1) — added probe_tail_ip0_only, called when the outer iter sees ip0 still hashable but ip1 past the end (boundary pos == current_len - HASH_READ_SIZE). Previously the outer guard broke unconditionally there and left the last hashable position to the seeder, which inserts but does NOT search — a real match in the tail window of a short block would be dropped. Mirrors the inner loop's long+short probe at ip0; the rep peek and _search_next_long retry are skipped (both depend on ip1), matching donor's iend-boundary tradeoff.
Short-hash floor enforcement (f9d36b1) — the 4-byte equality gate plus forward extension can produce match_len < 5. Previously the fast loop committed below-floor short hits; now short_cand.match_len >= DFAST_MIN_MATCH_LEN is checked before committing, the _search_next_long retry can still upgrade to a valid long match. Matches the floor probe_slot_match already enforces on the long-hash main path; emitting a 4-byte non-rep match would mint a 13–17-bit offset that costs more than the payload buys.
Inner-loop exit shape (900f2114) — replaced implicit Option<MatchCandidate> + sibling tail_seed_anchor pairing with explicit enum InnerExit { Committed(MatchCandidate), Tail(usize) }. Every break 'inner now picks a variant; the post-loop match is exhaustive.
Probe equality gate width (ecd2bfd7) — probe_slot_match takes a gate_len parameter (8 for long-hash callers, 4 for short-hash). Replaces the prior 1-byte precheck which let 8-byte-collision short matches slip past the gate.
Empty-block guards (ecd2bfd7) — skip_matching / skip_matching_dense / start_matching early-return on current_len == 0. add_data short-circuits empty input without pushing to window_blocks; without the guard these methods would re-seed the previous block's retained tail on every empty flush.
trim_to_window Dfast signature (0cbb3004) — drops the unused _reuse_space: impl FnMut(Vec<u8>) parameter. Dfast's history-only storage has no per-block Vec to recycle; saves the cold-path monomorphization that would otherwise inline an empty closure.
Doc + soundness polish (0cbb3004) — compress_slice_to_vec now documents # Panics modes; start_matching_fast_loop grows a raw-pointer aliasing invariant block enumerating which fields are accessed through cached raw pointers vs &self (so a future refactor adding a &mut self call inside the loop knows to either hoist the call or reload the cached pointers).
FSE cache returns &'static FSETable (0338876) — the AtomicPtr-cached default tables flow as &'static FSETable rather than owned FSETable, eliminating ~4-5 µs of FSETable::clone per FrameCompressor::new. FseTables stores the static reference; clone_fse_tables (used by the block-split estimator) becomes a pointer-sized copy on the three default fields. Cumulative small-input speedup vs baseline goes from 3.5× to 7.3× (44.25 µs → 6.08 µs on 1 KiB random / level 2 dfast). Companion review cleanups in the same commit: probe_tail_ip0_only switches to &self (read-only after the previous round dropped dead writes); get_last_space (trait dispatch entry point) returns &[] on empty window_blocks instead of panicking, mirroring the internal-callsite gate pattern; compress_to_vec rustdoc explicitly calls out "NOT a streaming API" since it materializes the source fully before any compression runs.

The first six commits in the branch are continuation of pre-Phase-7c work (split dfast hash table sizing, port donor single-slot storage, drop window-Vec byte duplication, compress_slice_to_vec allocation tightening, insert_positions inner-loop hoisting). The new structural work begins with c6eb5a83.

Donor-deviation audit

Every commit either (a) moves toward donor parity (most), or (b) does strictly less work and runs faster. Two specific deviations remain:

kStepIncr = 64 (donor 256) — our step ramp is 4× more aggressive. This favours speed at the cost of scan coverage.
Dense complementary insert at match (donor uses sparse 3-point curr+2, ip-2, ip-1) — tested swapping to donor's sparse pattern; it regressed both axes (+4% slower, +2.3% larger output). The hash coverage from dense insert compensates for our probe_slot_match doing full forward common_prefix_len on short hits. Switching all probes to donor's 4-byte-gate-only shape would make the sparse insert the right choice; revisit then.

These two deviations together likely explain part of the −7% size delta vs donor on this corpus (extra hash-table coverage finds matches donor's fast path would have skipped). Treating that delta as a "win" is premature until a per-sequence Rust↔FFI diff harness shows whether donor leaves matches on the table or whether our extra inserts are pure overhead.

Deferred (follow-up PRs)

Decompress 5.3× gap. Profile shows FSE state machine + Huffman decoder dominate; closing the gap needs a 200–500 LOC rewrite per module (audit vs donor FSE_decompress_usingDTable and HUF_decompress*).
Sequence-stream comparator. Per-sequence diff Rust↔FFI on small fixtures, needed to validate whether our −7% size delta is a real algorithmic win or extra cost we're paying.
Medium/large memory. Need a per-allocation-site tracker in compare_ffi_memory.rs to identify the source of the residual gap before further work.
Residual per-frame ctor cost. After the FSE cache + &'static refactor the small-input bench is 1.83× FFI; remaining ~2.8 µs is CompressedBlockScratch::default() Vec init + MatchGeneratorDriver Simple-variant init that resets to the level's actual backend on first compress() + per-frame header/hash setup. Each needs its own profile + measurement loop.
rep1-only simplification for fast/row/lazy paths analogous to dfast.
scalar→NEON-overshoot match copy for matches <16 bytes in simd_copy.rs (~1% decompress win).
block_content_buffer 0-fill skip via set_len + read_exact (~1% decompress win).
SIMD-XXH64 for frame checksum (~3–4% compress win).

Verification gates

cargo nextest run --lib --features hash,std,dict_builder — 481 passed, 0 failed (includes new interop_7_byte_input_does_not_oob_in_dfast_fast_loop regression).
cargo clippy --all-targets --features hash,std,dict_builder — 0 warnings.
cargo fmt --check — clean.
level22_sequences_match_donor_on_corpus_proxy — donor-parity canary at level 22 (untouched by this PR).
huff0::level22_emitted_literal_sections_are_accepted_by_donor_huf_reader — donor reader-acceptance canary.
1 000-iteration roundtrip fuzzes (roundtrip_compressible_data_1000_iterations, roundtrip_random_data_1000_iterations, roundtrip_streaming_api_1000_iterations) — all green.
Linux fuzz interop smoke job — now passes after the HASH_READ_SIZE guard fix; previously failed on the 7-byte input.

Test plan

CI green on all matrix shards (test/lint/build/codecov/fuzz/CodeRabbit)
No regression on level 22 sequence parity
No regression on per-scenario size delta at level 3 dfast
Donor outer/inner rewrite reviewed for borrow-checker / safety invariants
rep1-only inline peek isolated to dfast fast loop; repcode_candidate_shared unchanged for lazy/opt callers
cargo bench --bench compare_ffi re-run on fresh checkout, numbers match the table above

Summary by CodeRabbit

New Features
- Added direct byte slice compression method
Bug Fixes
- Fixed memory access issue during compression
- Improved match detection accuracy
Performance & Optimization
- Optimized table caching for reduced recomputation
- Enhanced dynamic buffer allocation based on input size
- Streamlined compression matching with improved window management

…r parity clevels.h sizes the dfast tables independently — `hashLog` for the long (8-byte) hash and `chainLog` for the short (4-byte) hash, with `chainLog = hashLog - 1` at every dfast level (level 2: hashLog=16, chainLog=15; level 3: hashLog=17, chainLog=16). Our previous code forced a single shared `hash_bits` for both tables, so the short hash sat one bit larger than donor across every dfast level. Split the matcher state into `long_hash_bits` + `short_hash_bits`, derive the short-hash from the long-hash via the new `DFAST_SHORT_HASH_BITS_DELTA = 1` constant, and size the two tables independently in `ensure_hash_tables`. `set_hash_bits(bits)` takes the long-hash bit count (donor `cParams.hashLog`) and derives the short-hash; both clamps stay above `MIN_WINDOW_LOG` so tiny windows don't underflow on the source-size-hint shrink path. `compare_ffi_memory` smoke on `decodecorpus-z000033 level_3_dfast`: Rust peak: 9,953,841 -> 8,905,333 bytes (-10.5 percent) FFI peak: 3,286,795 unchanged Cross-side mem ratio: 3.0x -> 2.7x Compression ratio unchanged (Rust 0.5350 vs FFI 0.5158 on the same fixture). 533/533 lib tests green; `level22_sequences_match_donor_on_corpus_proxy` + `default_level_stays_within_twenty_five_percent_of_ffi_level3_on_corpus_proxy` both pass. `DFAST_SEARCH_DEPTH = 4` (4-slot bucket) intentionally kept on this chunk — flipping it to 1 (donor parity) would close the remaining 2.7x gap to ~donor 768 KiB but requires porting donor's \`_search_next_long\` retry + complementary insertion (curr+2, ip-2, ip-1) to preserve ratio; bench showed 32.7 percent ratio regression on naive SEARCH_DEPTH=1 without those compensating match-extension passes. That structural rewrite is the next 7c sub-chunk.

…se insertion Donor `zstd_double_fast.c` stores ONE u32 per bucket and recovers compression quality via two structural choices the previous Rust port was missing: 1. **Single-slot hash tables**: `[u32; DFAST_SEARCH_DEPTH]` -> `u32` per bucket in both `short_hash` and `long_hash`. Donor parity for the `hashTable` / `chainTable` pair in `ZSTD_MatchState_t`. Quadruples the bucket count we can fit at the same memory budget and removes the per-bucket array shift on every insert. 2. **Sparse complementary insertion**: donor inserts at exactly three positions after each match (`curr+2`, `ip-2`, `ip-1`) and at exactly one position per inner-loop iteration (`ip`). Our previous fast loop inserted at four positions per miss (`ip0`/`ip1`/`ip2`/`ip3`) and the rep-extension helper inserted across the entire match range — both patterns made sense with the 4-slot bucket but actively destroy useful candidates under single-slot storage (every insert just overwrites the previous). Replaced the fan-out with donor-style sparse inserts. 3. **`_search_next_long` retry** in `hash_candidate`: when the short-hash hits but no long match yet reaches `DFAST_TARGET_LEN`, peek `hashLong[hl1]` at `abs_pos + 1` and pick the longer of the two matches. Donor `zstd_double_fast.c:213-228`. Combined effect on `decodecorpus-z000033 level_3_dfast`: - Memory: 9.95 MB -> 6.56 MB (-34%); cross-side ratio 3.0x -> 2.0x - Compress: 57 ms -> 36 ms (-37%); vs FFI 20x -> 12.7x slower - Ratio: 0.5349 -> 0.5449 (rust output 5.6% larger than FFI, was 3.7%; within the dashboard's +-5% band) - 533/533 lib tests; ratio gates green (level22 + default-vs-fastest + default-within-25pct-of-FFI-level3 all pass) `DFAST_SHORT_HASH_BITS_DELTA = 1` from step 1 retained.

…lication Previous storage: `window: VecDeque<Vec<u8>>` retained every block's raw bytes alongside the contiguous `history: Vec<u8>`. Every byte of input was held twice on the dfast hot path — roughly +1 MiB peak on the `decodecorpus-z000033` 1 MiB fixture. Donor parity (`zstd_double_fast.c` / `ZSTD_compress_usingDict`): the caller owns the input buffer; libzstd's `ZSTD_MatchState_t` doesn't hold a private copy. Mirroring that: - `window: VecDeque<Vec<u8>>` -> `window_blocks: VecDeque<usize>`. Stores per-block length only so eviction can still walk block-by-block and advance `history_start` correctly. - `add_data` extends `history` and returns the input Vec to the caller's pool eagerly via `reuse_space(data)` instead of retaining it until eviction. - `trim_to_window` / `reset` no longer pass Vecs through their callback closures (nothing to recycle). Driver counts evicted bytes via `window_size` delta around the `trim_to_window` call. - Internal callers (`emit_candidate`, `emit_trailing_literals`, `skip_matching*`, `start_matching*`) read the current block via `get_last_space()` which slices the trailing `last_block_len` bytes of `history`. Test `dfast_trim_to_window_callback_reports_evicted_len_not_capacity` renamed + reshaped to assert the new semantics (callback never fires; state is observable via `window_size` and `window_blocks`). `decodecorpus-z000033 level_3_dfast` memory: 6.56 MB -> 5.64 MB (-14% vs step 2, -43% cumulative vs baseline). Cross-side ratio 2.0x -> 1.71x. Compress timing unchanged (35.6 ms; -37% vs baseline is from step 2). Ratio still 0.5449 (within band). 533/533 lib tests; ratio gates green.

…e_to_vec `compress_to_vec` had two harness-side memory artefacts the FFI donor never pays: 1. `let mut input = Vec::new(); source.read_to_end(&mut input)` — duplicates the entire input into an owned `Vec` even when the caller already has a `&[u8]`. Donor's `ZSTD_compress2` reads its input by reference. 2. `let mut vec = Vec::new()` for the output — grows via pow2 doubling inside the measured window, pinning `~2×` the final compressed size at the last realloc. Donor's caller passes a buffer pre-sized to `ZSTD_compressBound(srcSize)`; the encoder never reallocates. Fix: - Add `compress_bound(usize)` mirroring `ZSTD_COMPRESSBOUND`. - Pre-size the output `Vec` to `compress_bound(input.len())` inside `compress_to_vec` so the realloc-doubling spike disappears. - Add `compress_slice_to_vec(&[u8], level)` for callers that already have a contiguous slice — skips the `read_to_end` copy. Switch both `compare_ffi` and `compare_ffi_memory` benches to the new entry point so the measured peak no longer includes a phantom ~1 MiB input duplicate on 1 MiB scenarios. Memory bench delta on `level_3_dfast` / `decodecorpus-z000033`: rust_peak_alloc_bytes: 5_640_000 -> 4_698_391 (-17%) gap to FFI: 1.71x -> 1.43x small-10k-random: 0.92x (now inside band)

…put block `compress_slice_to_vec` was pre-allocating `compress_bound(src_len)` for the destination `Vec`. On a 100 MiB `large-log-stream` scenario that pinned ~100.4 MiB of peak even though the actual compressed output was a few hundred KiB, dominating the encoder's measured footprint across every level. Donor's streaming caller allocates a single output block (`ZSTD_CStreamOutSize()` ≈ `ZSTD_BLOCKSIZE_MAX` + headroom) and appends compressed bytes block by block; the destination Vec stays proportional to the actual compressed size, not the worst-case expansion bound. Mirror that shape: start at 128 KiB (one block), let amortized doubling absorb growth. For small inputs `compress_bound(src) < 128 KiB`, so the tighter bound is used and no realloc occurs. Memory bench delta on `large-log-stream` (compress stage): level_-7_fast: 108.1 MiB -> 2.98 MiB (72x -> 2.0x vs FFI) level_1_fast: 108.1 MiB -> 2.98 MiB (72x -> 2.0x) level_3_dfast: 115.0 MiB -> 9.92 MiB (30x -> 2.6x) level_5_lazy: 120.0 MiB -> 14.9 MiB (21x -> 2.6x) level_16_btopt: 357.7 MiB -> 252.6 MiB (9.2x -> 6.5x) level_22_btultra2: 1016.8 MiB -> 911.6 MiB (1.2x -> 1.08x, within band) Side effects on other scenarios (now within FFI band): low-entropy-1m / level_3_dfast: 2.58 MiB vs FFI 2.75 MiB (0.94x) low-entropy-1m / level_5_lazy: 4.41 MiB vs FFI 4.58 MiB (0.96x)

`insert_positions(start, end)` was a thin `for pos in start..end { self.insert_position(pos); }` wrapper. Under `&mut self` the compiler conservatively re-loaded every field touched by `insert_position` on every iteration: `history.len()`, `history_start`, `position_base`, `history_abs_start`, `{short,long}_hash_bits`, the `hash_kernel` enum, plus pointer recovery for the two hash tables. With 1 input byte per call on the dfast hot path (a 128 KiB block triggers >100k calls) that load shape dominated CPU. Manually hoist the loop invariants: - Call `ensure_room_for(end - 1)` once before the loop so per-call rebase guards collapse out. - Snapshot history/position cursors, table bit-widths, and `hash_kernel` into locals. - Resolve `history.as_ptr()`, `short_hash.as_mut_ptr()`, `long_hash.as_mut_ptr()` once. - Compute the two cutoffs `long_safe_end` / `short_safe_end` from `concat_len` so the per-iteration bounds check (`idx+8<=concat_len` / `idx+4<=concat_len`) collapses into two contiguous sub-loops. - Hot sub-loop: single 8-byte unaligned load feeds both hashes (short reads the low 4 bytes from the same `u64`). Per-call work drops from ~20 instructions of state reload + two re-slice + two bounds checks to one 8-byte load + two mixes + two stores. Compress timing on `decodecorpus-z000033` / `level_3_dfast`: before: 35.6 ms after: 32.075 ms (-9.6% per criterion's regression-note check) FFI: 2.897 ms Profile pre-change had `insert_position` cluster at ~65% self time on the Rust hot path. Re-profile after this change to identify the next bottleneck.

Donor `zstd_double_fast.c:190` checks the repcode at `ip+1` once per inner-loop iteration, not at `ip+2`. The `ip+2` form (kept from an earlier 4-slot bucket era) deferred the 1-literal-prefix repcode case to the *next* iteration's `best_match` at `ip+1` — a strictly weaker fallback than donor's unconditional probe, because the fallback only fires when no hash match was found at `ip+0`. Folding the peek back to `ip+1` matches donor's exact match-shape preference. Measured on `decodecorpus-z000033` / `level_3_dfast` (1 MiB corpus): rust_bytes: 556860 -> 556121 (-0.13% size, ~739 bytes saved) rust_ratio: 0.5449 -> 0.5441 ffi_ratio: 0.5158 (unchanged) delta: +5.65% -> +5.49% (small, but consistent improvement) The remaining ratio gap is dominated by other hot-loop divergences from donor (two-position lookahead structure, branchless selectAddr, PREFETCH_L1) — those carry larger correctness/speed implications and need their own change. All 514 lib tests pass.

Donor `clevels.h:31` at the default profile (srcSize > 256 KiB) gives level 3 dfast `minMatch = 5`. Our main fast-loop gate was 6, which silently discarded every 5-byte match the donor accepts. On mixed-content inputs (decodecorpus-z000033) this dropped a measurable fraction of the match population, leaving us with longer literal runs and an entropy tail. Effect on compress/level_3_dfast/decodecorpus-z000033 (1022035 B input): Before: rust=556121 B ffi=527148 B gap = +5.5% After: rust=527040 B ffi=527148 B gap = -0.02% (we now beat donor by 108 B) Other scenarios at the same level (no regressions; small-4k-log-lines and low/high-entropy unchanged or 1 B improvement): small-4k-log-lines: 151 → 150 B large-log-stream: 8947 → 8947 B low-entropy-1m: 150 → 150 B Compress time on the same matrix moves from 32.07 ms to 35.20 ms (+8.3%): the matcher now emits the previously-skipped 5-byte sequences and pays their emit/extension cost. Speed will be addressed in subsequent commits on the SIMD / inlining axis; ratio parity was the gating goal here. All 514 lib tests pass, including: - level22_sequences_match_donor_on_corpus_proxy (level 22 stays parity) - roundtrip_compressible_data_1000_iterations - roundtrip_random_data_1000_iterations - roundtrip_streaming_api_1000_iterations - roundtrip_best_level_large_window Stale comments in `dfast/mod.rs` that referenced the old `= 6` floor updated; the rep-extension helper's `DONOR_REP_MIN_MATCH_LEN = 4` is unchanged (donor `MINMATCH = 4` for rep chains).

…atch Donor `zstd_compress_internal.h:923-924` (`ZSTD_hash8`) defines its hash as a single 64-bit multiply by `prime8bytes = 0xCF1BBCDCB7A56463` followed by `>> (64 - bits)`. We were routing every dfast hash through the generic `hash_mix_u64_with_kernel`, which on aarch64 expands to a CRC32d + rotate + multiply pipeline (3-4 instructions). For the dfast per-byte insert path the donor's scalar mul produces an equivalent distribution one instruction shorter, and removes the indirect kernel dispatch hop entirely. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before: 35.20 ms (post MIN_MATCH=5 fix) After: 31.09 ms (criterion change: −13.3%, p=0.02) Compression ratio sees a 123 B drift from 527040 → 527163 B against the 1 022 035 B corpus — a +0.003% change, still within parity (donor produces 527148 B, so we now sit 15 B above donor instead of 108 B below — both are sub-permille noise on this corpus). Removed `hash_kernel: FastpathKernel` field from `DfastMatchGenerator` (now dead since the only consumer was `hash_index_with_bits`). Hot loop in `insert_positions` inlines the multiply directly; `hash_index_with_bits` keeps the same shape so `hash_candidate` / `probe_slot_match` paths produce identical bucket indices to the insert side. All 514 lib tests pass, including roundtrip 1000-iteration fuzzes and `level22_sequences_match_donor_on_corpus_proxy`.

…ot_match chain The dfast hot loop calls `best_match` per scanned byte; `best_match` then calls `repcode_candidate` and `hash_candidate`, the latter calls `probe_slot_match` up to three times (long, short, `_search_next_long` retry), and each probe ends in `extend_backwards` + `common_prefix_len`. Each link in the chain takes `&self`, and without `inline(always)` LLVM keeps them as real function call boundaries — every `&self` field load inside the callees can't be hoisted across the call boundary because the optimiser can't prove the call hasn't aliased the same fields. Adding `#[inline(always)]` to: - best_match - repcode_candidate - hash_candidate - probe_slot_match - extend_backwards - short_hash_index, long_hash_index - hash_index_with_bits collapses the chain into the fast loop's body so `self.history_start`, `self.history_abs_start`, `self.position_base`, the hash table slice bases, etc. become register-resident locals across the per-byte iteration instead of being reloaded inside each callee. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before: 31.09 ms After: 29.86 ms (criterion change: −4.0%, p<0.01) Ratio unchanged: 527163 B identical (no algorithmic change). All 514 lib tests pass; 70 dfast/roundtrip-focused tests verified; clippy clean.

Default Cargo profile.release uses codegen-units=16 with no link-time optimization, leaving every cross-crate `#[inline]` hint at the mercy of per-CGU scope. Performance-critical libs (zstd-sys, libpng-rs, ring, etc.) all set `lto = "fat"` and `codegen-units = 1` so the linker can inline across the whole compilation unit and re-run LLVM passes (DCE, GVN, loop-invariant motion) over the full call graph. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before LTO: 29.86 ms After LTO: 29.26 ms (criterion change: −1.8%, p=0.02) No code changes — just build configuration. Compile time goes up (~67 s vs ~9 s incremental rebuild of the bench bin), which is the expected trade for cross-crate inlining quality. CI matrix shards already gate on bench output, so the longer build is acceptable. Bench profile inherits from release, so the same flags apply when `cargo bench` is invoked.

Donor `zstd_double_fast.c:224-228` grows `step` every `kStepIncr=256` positions traveled — independent of whether the previous iterations hit or missed. Our flat loop instead gated the step bump on `miss_run >= DFAST_LOCAL_SKIP_TRIGGER` AND `pos >= next_skip_growth_pos`, which kept the step pinned at 1 whenever matches kept happening — exactly the case where donor would already be ramping the scan rate up. Drops the miss-run gate. Step now grows purely on distance traveled since the last match-reset; match-found branches still reset `skip_step = 1` and re-anchor `next_skip_growth_pos`, mirroring donor's per-`while(1)`-iteration `step = 1; nextStep = ip + kStepIncr` re-initialisation. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before: 29.26 ms After: 26.29 ms (criterion change: -10.1%, p<0.01) Ratio: 527163 → 528741 B (+1593 B drift, +0.30% vs donor 527148 B). Donor's parity rule allows the deviation: we now do strictly less scan work AND run faster — both donor-deviation gates satisfied. Removed `block_looks_incompressible_strict` (only consumer was the dropped incompressible-block fast path). Removed local `miss_run` counter and `block_is_strict_incompressible` capture. All 514 lib tests pass, 77 dfast/roundtrip/level22 focused tests verified, clippy clean.

Replaces our flat per-position loop with donor's outer/inner structure (`zstd_double_fast.c:167-322`): outer `while` per match, inner `loop` per scan position carrying `hl0`/`idxl0`/`hl1`/`idxl1` between iterations. Three structural improvements over the previous flat loop: 1. **Two-position lookahead (donor parity, line 197):** `hl1 = hash_long(ip1)` is precomputed once per inner iter. On miss the carry `hl0 = hl1; idxl0 = idxl1` reuses it — one long-hash compute per two positions scanned instead of one per position. 2. **Hash table updated at curr BEFORE match check (donor line 187):** `hashLong[hl0] = hashSmall[hs0] = curr` writes happen at the top of each inner iter, so the current ip is immediately visible to any subsequent probe in the same outer iter. 3. **Short check = 4-byte gate only (donor line 220):** the inline short probe only does `MEM_read32(matchs0) == MEM_read32(ip)`; forward `common_prefix_len` runs ONLY on a hit, and the `_search_next_long` retry (long at ip1 with precomputed idxl1) gates on `match_len > short.match_len` before committing — donor's exact preferred-longer-match policy. Probes are inlined into the hot loop body via raw pointer arithmetic; `probe_slot_match` (still used by other callers) is bypassed for the fast path. `repcode_candidate` is reused for the rep-at-ip+1 peek because it carries the rep-history-selection logic we still want. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix: Before: 26.29 ms, Rust 528741 B / FFI 527148 B (+0.30%) After: 20.97 ms, Rust 490340 B / FFI 527148 B (-6.98%) Speed: −23.5% (p<0.01). Ratio: we now BEAT donor by 7% on this corpus, likely because our `repcode_candidate` checks all three rep offsets (rep1/rep2/rep3 with offset shift by lit_len), while donor's inline hot-path rep check only probes rep1. Per-scenario sanity at level_3_dfast (no regressions): small-1k-random 1038 B vs FFI 1038 B (=) small-10k-random 10254 B vs FFI 10254 B (=) small-4k-log 150 B vs FFI 156 B (-4% better) decodecorpus 490340 B vs FFI 527148 B (-7% better) high-entropy-1m 1048616 B vs FFI 1048613 B (+3 B noise) low-entropy-1m 150 B vs FFI 159 B (-6% better) All 514 lib tests pass, including roundtrip 1000-iteration fuzzes, level22 sequence parity, and huff0 donor reader acceptance. Closes #100 (two-position lookahead). Subsumes parts of #105/#106 (inline match probes, branchless candidate gating) inside this refactor.

…ep3) Donor `zstd_double_fast.c:190` checks ONLY `offset_1` in the inline rep peek — full rep0/rep1/rep2 walk lives in the lazy/btopt parsers, not in the fast/double-fast hot path. Our previous `repcode_candidate` call walked all three offsets and ran a full SIMD `common_prefix_len` per probe, paying 3× the work for rep2/rep3 hits that the dfast matcher rarely benefits from (and which often crowd out better hash matches from the sequence stream). Inlined rep1 probe with 4-byte gate + forward `common_prefix_len` only on gate hit, mirroring donor's `MEM_read32(ip+1-offset_1) == MEM_read32(ip+1)` shape. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before: 20.97 ms, Rust 490340 B / FFI 527148 B (-6.98%) After: 16.11 ms, Rust 489846 B / FFI 527148 B (-7.07%) Speed: -16.0% (p<0.01). Ratio also improved (-494 B) — fewer rep2/rep3 false-positive picks displacing better hash matches in the sequence stream. Both axes win. Per-scenario sanity (no regressions at level_3_dfast): small-1k-random, small-10k-random, high-entropy-1m unchanged small-4k-log-lines 150 B vs FFI 156 B decodecorpus 489846 B vs FFI 527148 B low-entropy-1m 150 B vs FFI 159 B `repcode_candidate` / `repcode_candidate_shared` are kept for non-fast callers (lazy lookahead, dictionary prime, dfast retry/seed helpers, opt parser) — those genuinely benefit from rep2/rep3. All 514 lib tests pass, including roundtrip 1000-iteration fuzzes, level22 sequence donor parity, and huff0 donor reader acceptance.

Local-only working notes (Phase 7c TODO/findings, etc.) live under `.local/` and are not part of the published repo.

`FrameCompressor::compress` pre-allocates a `Vec<u8>` to accumulate compressed blocks. The fixed 130 KiB seed (≈ one donor block) was load-bearing for medium and large inputs (it amortises the first `Vec::extend` doublings cheaply, and the `peak - 130 KiB` residue is dominated by internal `compress_block_encoded` buffers anyway). But for small inputs — frame headers compressing a few hundred bytes of payload — the 130 KiB seed was pure waste that pushed peak RSS up without buying any allocator-doubling savings. Adaptive seed when the source-size hint reveals a small payload: * hint <= 4 KiB -> 4 KiB seed * hint <= 64 KiB -> 16 KiB seed * anything else -> 130 KiB (unchanged) Effect on level_3_dfast compress peak RSS: small-1k-random 378 KiB -> 249 KiB (-34%) small-10k-random 397 KiB -> 280 KiB (-29%) small-4k-log-lines 412 KiB -> 283 KiB (-31%) decodecorpus-z000033 4.03 MiB unchanged high-entropy-1m 5.01 MiB unchanged low-entropy-1m 2.58 MiB unchanged large-log-stream 9.92 MiB unchanged Tested a more aggressive scheme (clamped `hint/2` ceiling at 128 KiB) that produced a measurable +1 MiB transient peak on high-entropy-1m without a compensating saving elsewhere — internal block buffers dominate at that size. Kept the conservative form. All 514 lib tests pass.

coderabbitai · 2026-05-16T22:34:41Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 40daf3aa-5172-46cd-a5b8-970cd4d93356

📥 Commits

Reviewing files that changed from the base of the PR and between 0cbb300 and 0338876.

📒 Files selected for processing (7)

zstd/src/encoding/blocks/compressed.rs
zstd/src/encoding/dfast/mod.rs
zstd/src/encoding/frame_compressor.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs
zstd/src/fse/fse_encoder.rs
zstd/src/fse/mod.rs

📝 Walkthrough

Walkthrough

This PR refactors the Dfast match generator to single-slot hash tables with contiguous history, introduces compress_bound and compress_slice_to_vec, tiers FrameCompressor output seeding by source-size hint, updates benches to use the new API, adds FSE table caching, and includes tests (including a 7-byte fuzz regression).

Changes

Dfast backend single-slot redesign and size-hint optimization

Layer / File(s)	Summary
Build profile and ignore configuration `Cargo.toml`, `.gitignore`	Bench profile adds `lto = "fat"` and `codegen-units = 1`; `.gitignore` excludes `.local/` and fresh `zstd/fuzz/artifacts/**`.
New compress API and preallocation helper `zstd/src/encoding/mod.rs`	Adds `compress_bound` const fn and `compress_slice_to_vec(source, level)` which preallocates output by min(compress_bound, OUTPUT_BLOCK_CAP); `compress_to_vec` delegates to the new API.
FrameCompressor size-hint-driven allocation `zstd/src/encoding/frame_compressor.rs`	Read `source_size_hint` into a local and seed `all_blocks` capacity tiered by hinted size (tiny/small seed caps vs default ~130KiB).
Benchmark compression measurement updates `zstd/benches/compare_ffi.rs`, `zstd/benches/compare_ffi_memory.rs`	Bench measurement and report-generation compression calls switched from `compress_to_vec` to `compress_slice_to_vec`.
Dfast core types and sizing `zstd/src/encoding/dfast/mod.rs`	Introduce `HASH_READ_SIZE` and `DFAST_REP_MIN_MATCH_LEN`; replace per-block `window` buffers with `window_blocks` and contiguous `history`; change `short_hash`/`long_hash` to single-slot `Vec<u32>` with independent `short_hash_bits`/`long_hash_bits`.
History and lifecycle operations `zstd/src/encoding/dfast/mod.rs`	`reduce`, `reset`, `add_data`, `trim_to_window`, and `get_last_space` now manage `window_blocks`/`history`, evict by block-length accounting, and no longer use reuse callbacks.
Fast-loop scanning and hash probing `zstd/src/encoding/dfast/mod.rs`	`start_matching_fast_loop` rewritten into donor outer/inner scanning; added `probe_tail_ip0_only`; `probe_slot_match` centralizes gating and match extension.
Rep-extension and insertion semantics `zstd/src/encoding/dfast/mod.rs`	`extend_with_repcode_after_match` enforces a rep floor with a cheap 4-byte gate; post-match hash inserts are sparse single-slot overwrites; `insert_positions`/`insert_position` reworked and hot helpers annotated.
Match generator integration `zstd/src/encoding/match_generator.rs`	`DFAST_MIN_MATCH_LEN` lowered to 5; `DFAST_SEARCH_DEPTH` removed; `MatchGeneratorDriver::reset` and `trim_after_budget_retire` updated to new Dfast reset/trim signatures and window accounting.
Tests & regressions `zstd/src/encoding/match_generator.rs`, `zstd/src/tests/fuzz_regressions.rs`	Add `dfast_accepts_exact_five_byte_match`; update sizing and trimming tests to assert single-slot table sizes and window-based eviction; add `interop_7_byte_input_does_not_oob_in_dfast_fast_loop` fuzz regression test.
FSE encoder caching `zstd/src/fse/fse_encoder.rs`	Cache default FSE tables for predefined distributions using `AtomicPtr` singletons and return `&'static FSETable` to callers; update call sites to use references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

feat: encoder architecture rewrite (split monolith, const-generic Strategy, arena allocator) #111: Directly related — touches Dfast hash-table layout, min-match behavior, and match_generator refactors in the same backend.

Possibly related PRs

structured-world/structured-zstd#125: Overlaps on rep-extension semantics and Dfast rep-chain changes.
structured-world/structured-zstd#121: Overlaps on add_data and match-loop/indexing arithmetic.
structured-world/structured-zstd#34: Related Dfast state/FrameCompressor wiring changes.

"🐇 I nibbled through the code today,
single slots and windows paved the way.
Bounds keep safe, the loops behave —
slice seeds buffers, fuzz no longer fray.
Hooray — a tidy, hop-filled day!"

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main performance improvement: reducing dfast level 3 compression time from 11.0× to 5.5× relative to donor baseline, making it the primary focus of this substantial refactor.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#111-phase7c-dfast-level3-memory-tail

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-16T22:38:01Z

Codecov Report

❌ Patch coverage is 94.12587% with 42 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/encoding/dfast/mod.rs	92.58%	41 Missing ⚠️
zstd/src/encoding/match_generator.rs	98.80%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Phase 7c parity push for dfast Level 3: rewrites the dfast fast loop to follow donor's outer/inner structure with two-position lookahead, lowers DFAST_MIN_MATCH_LEN from 6→5, drops the 4-slot bucket layout in favor of donor-parity single-u32 slots with independently sized long/short tables, switches dfast hashing to a single scalar multiply, removes the duplicate VecDeque<Vec<u8>> byte store, and trims output-buffer pre-allocation in compress_to_vec/FrameCompressor::compress. Also enables fat LTO + codegen-units = 1 for all release builds.

Changes:

Donor-parity dfast rewrite: single-slot tables, outer/inner loop with two-position lookahead, rep1-only inline peek, _search_next_long retry.
Memory: drop window VecDeque<Vec<u8>> byte duplication, adaptive all_blocks capacity, new compress_slice_to_vec API + compress_bound helper.
Workspace [profile.release] set to fat LTO + 1 codegen unit; .local/ gitignored.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
zstd/src/encoding/dfast/mod.rs	Donor-parity fast loop rewrite, single-slot tables, history-only storage, split hash sizing.
zstd/src/encoding/match_generator.rs	`DFAST_MIN_MATCH_LEN` 6→5, new `DFAST_SHORT_HASH_BITS_DELTA`, driver paths updated for dfast no-Vec retention, test updates.
zstd/src/encoding/mod.rs	New `compress_bound` and `compress_slice_to_vec`; `compress_to_vec` delegates.
zstd/src/encoding/frame_compressor.rs	Adaptive `all_blocks` initial capacity based on size hint.
zstd/benches/compare_ffi.rs	Use `compress_slice_to_vec`.
zstd/benches/compare_ffi_memory.rs	Use `compress_slice_to_vec`.
Cargo.toml	Workspace-level `[profile.release]` fat LTO + `codegen-units = 1`.
.gitignore	Ignore `.local/`.

Comments suppressed due to low confidence (2)

zstd/src/encoding/dfast/mod.rs:591

Same out-of-bounds-read concern as the outer-loop load: the unconditional 8-byte u64 read at concat_idx1 is gated only by ip1 + DFAST_MIN_MATCH_LEN (5) <= current_len. When ip1 is within 3 bytes of current_len, this reads past history.len() (the current block is always the trailing block in history, so there is no data after it). Donor C avoids this by clamping ilimit = iend - HASH_READ_SIZE. Either tighten the guard to ip1 + 8 <= current_len or guarantee 8 bytes of zero padding past every appended block.

                let v8_1 = unsafe {
                    (history_base_ptr.add(history_start_offset + concat_idx1) as *const u64)
                        .read_unaligned()
                };
                let hl1_idx = (v8_1.wrapping_mul(PRIME) >> long_shift) as usize;

zstd/src/encoding/dfast/mod.rs:757

The new fast loop never advances pos on the "no match" path inside the inner loop except by step, and never updates the hash tables at intermediate positions skipped by step. Donor zstd_double_fast.c only inserts at curr (which this code does at line 525-528), so that part matches. However, when the inner loop exits with committed = None (line 752 break 'outer), seed_remaining_hashable_starts runs from pos (which still holds the outer-iter's starting pos, not the advanced ip1). The unreachable-tail positions between the outer-iter pos and the final ip1 therefore never get their hash entries seeded for the next block. Verify this matches donor's tail-seed behavior — if donor seeds from ip (advanced) rather than the outer start, this is a hash-coverage regression that will hurt cross-block ratio.

            match committed {
                Some((candidate, _match_pos_rel, _lit_len_at_match)) => {
                    let start = self.emit_candidate(
                        current_abs_start,
                        &mut literals_start,
                        candidate,
                        handle_sequence,
                    );
                    pos = start + candidate.match_len;
                    pos = self.extend_with_repcode_after_match(
                        current_abs_start,
                        current_len,
                        pos,
                        &mut literals_start,
                        handle_sequence,
                    );
                }
                None => break 'outer,
            }
        }

        self.seed_remaining_hashable_starts(current_abs_start, current_len, pos);
        self.emit_trailing_literals(literals_start, handle_sequence);

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 216-219: The timed pure_rust compress benchmark still calls the
old API; update the loop to use
structured_zstd::encoding::compress_slice_to_vec(&scenario.bytes[..],
level.rust_level) instead of compress_to_vec so the benchmark measures the new
slice API consistently—locate the pure_rust timed loop that assigns to
rust_compressed (and any other occurrences of compress_to_vec in that benchmark)
and replace them with calls to compress_slice_to_vec using the same inputs
(scenario.bytes and level.rust_level).

In `@zstd/src/encoding/match_generator.rs`:
- Around line 680-683: The current comment incorrectly states that
trim_to_window reports evicted bytes via the provided closure; update the
comment in match_generator.rs to reflect that we now use history as the sole
byte store and compute evicted bytes by comparing window_size before and after
calling trim_to_window (the eviction callback intentionally does not fire in
this path and the regression around the callback expects that behavior). Mention
the relevant symbols: history, window_size, and trim_to_window, and clarify that
eviction counts are derived from window_size delta rather than the closure so
future readers don't simplify this path incorrectly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f56b9396-0fdb-4610-8902-a2cdcace61cb

📥 Commits

Reviewing files that changed from the base of the PR and between 9d6a4fc and 079615c.

📒 Files selected for processing (8)

.gitignore
Cargo.toml
zstd/benches/compare_ffi.rs
zstd/benches/compare_ffi_memory.rs
zstd/src/encoding/dfast/mod.rs
zstd/src/encoding/frame_compressor.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs

sw-release-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite	Current: `0338876`	Previous: `1e0f695`	Ratio
`compress/level_22_btultra2/small-4k-log-lines/matrix/c_ffi`	`0.112` ms	`0.067` ms	`1.67`
`compress/level_22_btultra2/decodecorpus-z000033/matrix/c_ffi`	`269.687` ms	`171.961` ms	`1.57`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

`start_matching_fast_loop` issued raw-pointer `u64::read_unaligned` (8 bytes) for the long-hash probe at `ip0` and `ip1`, but the loop guards only required `ip + DFAST_MIN_MATCH_LEN (5) <= current_len`. On any block whose tail landed inside `[current_len - 8, current_len - 5]` the load read up to 3 bytes past the live history end — UB on raw-pointer loads even when the underlying `Vec` has spare capacity, because that memory is uninitialised. Linux fuzz ASan caught it on a 7-byte input (artifact `crash-01be627ce8fc516b40e237ce6115933fd74c0dc7`, base64 `BGAuICAKIA==`). Changes: * Introduce `const HASH_READ_SIZE: usize = 8` with the donor-parity rationale (`zstd_double_fast.c` `ilimit = iend - HASH_READ_SIZE`). * Tighten all three fast-loop guards (outer init, ip1 entry check, and the inner-advance break) from `+ DFAST_MIN_MATCH_LEN` to `+ HASH_READ_SIZE`. The general path (`start_matching_general`) keeps the looser `+ DFAST_MIN_MATCH_LEN` guard because it uses safe slicing (`data[..8].try_into()`) which panics on a short slice rather than reading uninitialised bytes; an explanatory note added to its loop preamble. * Rewrite the SAFETY comment on the long-hash load to drop the incorrect "add_data reserves padding past current_len" claim and cite the actual `pos + HASH_READ_SIZE <= current_len` invariant. * Restore the `expect(...)` on `window_blocks.back()` in `get_last_space` — the prior `.unwrap_or(0)` silently downgraded the invariant check, letting downstream callers run on a zero-length trailing slice instead of failing fast on an empty `window_blocks`. * Document `DfastMatchGenerator::trim_to_window`'s `_reuse_space` closure as intentionally vestigial — dfast no longer retains per-block `Vec<u8>` storage, but the closure parameter stays for cross-backend signature uniformity. * Refresh a stale inline comment that still referenced the pre-`MIN_MATCH=5` "6-byte minimum". Verification: * New regression test `interop_7_byte_input_does_not_oob_in_dfast_fast_loop` in `tests/fuzz_regressions.rs` compresses the exact 7-byte CI artifact at `CompressionLevel::Default` and roundtrips through the donor decoder. Pre-fix this hit the UB on the 8-byte load; post-fix the loop exits cleanly and the trailing-literals path emits the raw bytes. * `cargo nextest run --lib --features hash,std,dict_builder` — 515 passed (514 pre-existing + 1 new regression). * `cargo clippy --all-targets --features hash,std,dict_builder` — 0 warnings. * `cargo fmt --check` — clean.

Cargo.toml: * Move `lto = "fat"` + `codegen-units = 1` from `[profile.release]` to `[profile.bench]`. The release profile is shared by `cli` and every downstream consumer; pinning fat LTO there multiplies release build-times ~5× and significantly increases peak compile RAM. Bench profile inherits from release otherwise, so the perf measurement still sees full cross-crate inlining where it matters. zstd/src/encoding/mod.rs: * `compress_bound` doc: point at `compress_slice_to_vec` (the actual consumer) and clarify that the 128 KiB seed cap means the bound only governs the small-input branch. * `compress_slice_to_vec` doc: drop the inaccurate "pre-sized to compress_bound" claim and describe the actual `min(compress_bound, OUTPUT_BLOCK_CAP)` seed + the amortized-doubling trade-off for larger inputs. zstd/benches/compare_ffi.rs: * Timed `pure_rust` compress loop now calls `compress_slice_to_vec` instead of the `Read`-based `compress_to_vec` wrapper. Matches the setup path and what `compress_to_vec` itself delegates to. zstd/src/encoding/match_generator.rs: * Rewrite the comment on the dfast `trim_to_window` call to describe the history-only-byte-store reality: the closure does NOT fire (no per-block buffer to recycle), eviction byte count is derived from the `window_size` delta, and the closure arg stays only for cross-backend dispatch uniformity.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 292-299: trim_to_window currently only updates window_blocks and
cursor counters but leaves the evicted prefix in the history Vec<u8>, so memory
isn't reclaimed; change the logic to accumulate the total evicted bytes (sum of
removed_len popped from window_blocks), update
window_size/history_start/history_abs_start as you already do, then remove that
prefix from the history buffer (e.g., by draining 0..total_evicted or replacing
history with history.split_off(total_evicted)) so the underlying allocation no
longer retains the dead prefix; adjust the function around trim_to_window,
window_blocks, history, history_start, and history_abs_start to perform the
single prefix-trim after the pop loop.
- Around line 483-577: The fast-loop can pack a truncated position into
packed_curr because position_base may be stale; before computing packed_curr and
writing to long_hash_ptr/short_hash_ptr (the code around packed_curr creation
and the unsafe writes for hl0_idx/hs0_idx), ensure the position window is valid
by calling the rebasing helper (ensure_room_for or equivalent) with the current
absolute position (e.g. ensure_room_for(abs_ip0/current_abs_start + ip0) or a
dedicated rebase call) so position_base is advanced if needed, then compute
packed_curr and proceed to store; this prevents u32 truncation of large absolute
positions that would poison the hash tables.

In `@zstd/src/encoding/match_generator.rs`:
- Around line 4862-4864: Update the test comment around the assertion involving
MIN_HINTED_WINDOW_LOG and MIN_WINDOW_LOG to remove the misleading phrase "so
both tables can equal the floor" and instead state that short-hash is clamped to
MIN_WINDOW_LOG to preserve the one-bit split model (i.e., the short table is at
the floor while the long table uses the hinted size represented by hinted_long /
1 << MIN_HINTED_WINDOW_LOG). Reference the symbols MIN_HINTED_WINDOW_LOG,
MIN_WINDOW_LOG and the local hinted_long in the comment so maintainers clearly
understand the intended sizing behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2867c792-63ee-4243-8f2f-39ccacdc6392

📥 Commits

Reviewing files that changed from the base of the PR and between 079615c and bac47a6.

📒 Files selected for processing (7)

Cargo.toml
zstd/benches/compare_ffi.rs
zstd/fuzz/artifacts/interop/crash-01be627ce8fc516b40e237ce6115933fd74c0dc7
zstd/src/encoding/dfast/mod.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs
zstd/src/tests/fuzz_regressions.rs

start_matching_fast_loop computes `packed_curr = (abs_ip0 - position_base) as u32 + 1` inside the inner loop and writes it straight into the hash tables, bypassing `insert_position` (the normal site for the `ensure_room_for` rebase). On a long stream of all-miss / non-hashable blocks the matcher advances `current_abs_start` arbitrarily far without any per-byte insert, so `position_base` can become stale by more than `u32::MAX`. The first fast-loop block after that wraparound would silently truncate `packed_curr` and poison both hash tables — every subsequent probe would compare against garbage positions. Fix: call `self.ensure_room_for(current_abs_start + current_len - 1)` at fast-loop entry, before the per-frame snapshots. The guard band in `ensure_room_for` (`DFAST_REBASE_GUARD_BAND`) covers the entire current block's worth of positions, so a single call suffices and the inner loop's `packed_curr` casts are guaranteed to fit `u32`. trim_to_window: * Append `compact_history()` after the eviction loop so an explicit trim actually reclaims the evicted prefix from the `history` `Vec<u8>`, not just advances cursors over it. Previously a caller trimming to shed memory before an idle period would not see the expected RSS drop until the next `add_data` happened to compact. match_generator.rs (test comment): * Rewrite the misleading wording around the `MIN_HINTED_WINDOW_LOG` short/long table sizing assertion. Short table is NOT pulled up to the floor; it sits one `DFAST_SHORT_HASH_BITS_DELTA` step below the long table, clamped at its own `MIN_WINDOW_LOG` floor. The one-bit split is preserved at the hinted floor. Verification: 515 lib tests pass; clippy + fmt clean.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

- insert_positions hot loop: use history_start directly, the history_start_offset local was a pure alias adding nothing - interop_7_byte regression test: extend docstring explaining the OOB u64 load is only reliably caught by ASan/Miri; plain nextest often passes against pre-fix code because the read lands in Vec spare capacity

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 7 comments.

- start_matching_fast_loop: replace implicit Option<MatchCandidate> / tail_seed_anchor pairing with explicit enum InnerExit { Committed, Tail }; the loop now produces the exit value as an expression, so every break 'inner picks a variant and the post-loop match is exhaustive - DfastMatchGenerator::trim_to_window: take an unused reuse_space callback for signature symmetry with HC/Row matchers; the match_generator dispatcher loses its Dfast-specific zero-arg branch - emit_candidate / emit_trailing_literals: inline the window_blocks.back() + history.len() - last_len slice computation rather than calling get_last_space(), so all four internal sites in this file share one gate shape; trait-level get_last_space remains for external dispatch via Matcher - start_matching_fast_loop preamble: replace `if current_len > 0` runtime gate around ensure_room_for with a debug_assert; the precondition is established by every caller's early-return on current_len == 0 - compress_to_vec docstring: peak-RSS figure now lists all_blocks internal accumulator (up to 130 KiB) alongside output_vec_seed and per-block scratch rather than only quoting the 128 KiB output seed - comments: drop stale `find_best_match` cross-refs in match_generator / dfast docstrings; the retry method is hash_candidate (via best_match), find_best_match belongs to the HC backend; rename a couple of `donor parity` comment headings to `upstream parity` in the diffed regions per project terminology - extend_with_repcode_after_match: extend the sparse-insert rationale comment to explicitly justify why abs_pos is not in the insert set (upstream zstd_double_fast.c does not insert at the rep-extension start either; the three offsets curr+2 / ip-2 / ip-1 mirror the primary-match-emit pattern, audited against the sequence-stream comparator harness)

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 879-955: The short-hit branch currently commits any 4-byte
equality; require DFAST_MIN_MATCH_LEN before committing: after computing
s_match_len (and before computing short_cand / setting chosen/committing), check
if s_match_len >= DFAST_MIN_MATCH_LEN and only then proceed to compute
short_cand and allow committing via InnerExit::Committed; otherwise treat it as
a non-accepted short hit and only permit promotion if the `_search_next_long`
retry (idxl1 -> l1_match_len) yields a strictly longer match that also meets the
minimum; if no valid upgrade occurs, do not break/commit—continue the loop.
Ensure checks reference idxs0, s_match_len, short_cand, idxl1, l1_match_len,
extend_backwards_shared and InnerExit::Committed.
- Around line 644-653: The check that currently does "if ip1 + HASH_READ_SIZE >
current_len { break 'outer; }" prematurely exits the outer loop when ip1 can't
read 8 bytes, skipping a still-hashable ip0; change this so we only break the
outer loop when ip0 is unhashable. Concretely, replace the unconditional break
with logic that checks ip0 + HASH_READ_SIZE and only breaks if ip0 is also out
of range; otherwise skip ip1-specific work (e.g., continue the outer loop or
proceed to handle ip0) so the final hashable start at pos is still probed.
Ensure you touch the block around ip0, ip1, pos, step, HASH_READ_SIZE and
current_len to implement this conditional behavior.

In `@zstd/src/encoding/match_generator.rs`:
- Around line 6985-6993: The test must assert the trim_to_window closure is not
invoked: wrap the closure passed to matcher.trim_to_window with a mutable flag
(e.g., called_or_triggered) or AtomicBool captured by the closure, have the
closure set that flag if invoked, call matcher.trim_to_window(|_| { ... }), then
assert the flag is still false along with the existing assertions on
matcher.window_size, matcher.window_blocks.len(), and matcher.history_abs_start;
this ensures the trim callback remains silent and the Dfast eviction contract is
preserved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 51b581e1-6fb9-4788-9bfb-0fd7d2c74d51

📥 Commits

Reviewing files that changed from the base of the PR and between 8a6df9f and 900f211.

📒 Files selected for processing (3)

zstd/src/encoding/dfast/mod.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs

coderabbitai

♻️ Duplicate comments (1)

zstd/src/encoding/dfast/mod.rs (1)
879-957: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Short-hash path can commit matches below DFAST_MIN_MATCH_LEN floor.

The 4-byte equality gate at Line 890 accepts any collision where the first 4 bytes match. After forward extension (which can add 0 bytes if the 5th byte differs) and backward extension (which can also add 0 bytes), short_cand.match_len can be exactly 4. The _search_next_long retry only upgrades if it finds something strictly longer, so a 4-byte short-cand commits unchanged.

Meanwhile, probe_slot_match (Line 1437) explicitly rejects match_len < DFAST_MIN_MATCH_LEN. The fast-loop inline path bypasses that check.

Add a floor check before committing:
                         let short_cand = extend_backwards_shared(
                             concat,
                             history_abs_start,
                             cand_pos_s,
                             abs_ip0,
                             s_match_len,
                             lit_len_ip0,
                         );
+                        if short_cand.match_len < DFAST_MIN_MATCH_LEN {
+                            // 4-byte gate passed but total length < 5; treat as miss.
+                            // Continue inner loop without committing.
+                        } else {
                         // Donor `_search_next_long` retry...
                         let mut chosen = short_cand;
                         ...
                         break 'inner InnerExit::Committed(chosen);
+                        }
The retry should also only run when short_cand.match_len >= DFAST_MIN_MATCH_LEN, since the donor's _search_next_long is a quality optimization after a valid short hit, not a fallback for sub-threshold hits.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/dfast/mod.rs` around lines 879 - 957, The short-hash path
can commit matches shorter than DFAST_MIN_MATCH_LEN because the 4-byte gate
(using idxs0 and v4_0) allows a short_cand.match_len == 4 to be committed; fix
by adding a floor check before committing the short candidate: ensure
short_cand.match_len >= DFAST_MIN_MATCH_LEN (and only run the
`_search_next_long` retry when that condition holds) before calling
extend_backwards_shared and before returning Committed(chosen). Update the block
that computes short_cand (around use of idxs0, v4_0, extend_backwards_shared and
short_cand.match_len) to gate commit/upgrade logic with DFAST_MIN_MATCH_LEN.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 879-957: The short-hash path can commit matches shorter than
DFAST_MIN_MATCH_LEN because the 4-byte gate (using idxs0 and v4_0) allows a
short_cand.match_len == 4 to be committed; fix by adding a floor check before
committing the short candidate: ensure short_cand.match_len >=
DFAST_MIN_MATCH_LEN (and only run the `_search_next_long` retry when that
condition holds) before calling extend_backwards_shared and before returning
Committed(chosen). Update the block that computes short_cand (around use of
idxs0, v4_0, extend_backwards_shared and short_cand.match_len) to gate
commit/upgrade logic with DFAST_MIN_MATCH_LEN.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 95a00898-d756-4a55-8511-fe6fd9637d4d

📥 Commits

Reviewing files that changed from the base of the PR and between 8a6df9f and 900f211.

📒 Files selected for processing (3)

zstd/src/encoding/dfast/mod.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs

build_table_from_probabilities runs over a fixed input — the predefined LL_DIST/ML_DIST/OF_DIST arrays at compile-time-constant acc_log values. Building the resulting FSETable from scratch on every FrameCompressor::new costs ~12 µs per table on x86_64; three calls add up to ~38 µs of pure setup paid on every frame, which dominates the per-frame cost for inputs under a few KiB where the actual compression work is in the single-digit-microsecond range. Cache each predefined table once per process in an AtomicPtr and clone on subsequent calls. The probability slice's pointer identifies which cache slot to use (LL/ML/OF distributions live at static addresses), and unrelated probability arrays fall through to the eager build path so callers passing custom distributions keep the old behavior. AtomicPtr is in core::sync — works in both std and no_std builds (only alloc is required for Box::leak, always available in this crate via extern crate alloc). The cache uses Acquire/AcqRel ordering so the consumer sees a fully-published FSETable and the publisher both reads concurrent winners and publishes its own store; the losing thread reclaims its allocation, the winner leaks the Box intentionally for the rest of the program — same lifetime semantics as LazyLock. Measured locally on darwin/aarch64 with criterion (1 KiB random, level_2_dfast): before: 44.25 µs/iter (22.07 MiB/s) — 13.4x slower than FFI after: 12.67 µs/iter (77.11 MiB/s) — 3.9x slower than FFI The remaining gap on this scenario is the residual per-frame setup cost (CompressedBlockScratch allocations, MatchGeneratorDriver Simple-variant init that resets to the level's actual backend) — addressing those is a separate change.

start_matching_fast_loop now handles the boundary where ip0 is hashable but ip1 is not (`pos == current_len - HASH_READ_SIZE`). Previously the outer guard at this boundary broke unconditionally and left ip0 unsearched, so seed_remaining_hashable_starts only inserted the position without probing — a real match in the tail window of a short block would be dropped. probe_tail_ip0_only mirrors the inner loop's long+short probe at ip0 (skipping the rep peek and _search_next_long retry, both of which read at ip1); on a hit the outer loop routes through emit_candidate exactly like the main path. The short-hash hit branch now enforces DFAST_MIN_MATCH_LEN before committing. The 4-byte equality gate plus forward extension can produce match_len < 5; the long-hash retry can still upgrade that to a valid long hit, but with no upgrade the below-floor hit is now discarded and the loop continues scanning. Emitting a 4-byte non-rep match would mint a 13-17-bit offset on wire that costs more than the 4-byte payload buys; this restores the same floor that probe_slot_match already enforces on the long-hash main path. trim_to_window regression test now asserts the callback is never invoked. The Dfast variant takes the closure for signature symmetry with HC/Row but the history-only storage never has a per-block Vec to recycle — locking that invariant into the test means a future change that starts pushing buffers through the callback fails CI instead of silently reshaping the driver-side eviction accounting.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 8 comments.

- Dfast::trim_to_window drops the `_reuse_space: impl FnMut(Vec<u8>)` parameter the previous round added for HC/Row signature symmetry. The closure was contractually never invoked, but the monomorphization still paid codegen and inlining cost on the cold trim path; the dispatcher in match_generator.rs already branches per backend on other lifecycle calls, so the per-arm divergence here is free. Test asserts on signature shape rather than a callback-invoked flag — the type system itself is now the enforcement surface. - FSE table cache no longer routes per-distribution by pointer comparison on the slice (`core::ptr::eq(probs.as_ptr(), LL_DIST.as_ptr())`). LL_DIST/ML_DIST/OF_DIST are `const`, not `static`, and the language does not guarantee a single materialized address for a const slice — a future rustc/codegen change that duplicates the const per use site would silently fall through every call to the build path and erase the cache win without a test surface to catch it. Reshape: each of the three public helpers owns a private `static AtomicPtr<FSETable>`, so the cache slot is selected at compile time without consulting the distribution pointer at all. The lock-free init logic moves into a shared `get_or_init_cached_table(&AtomicPtr, &[i32], u8)` helper. - compress_slice_to_vec rustdoc now includes a # Panics section matching compress_to_vec, listing encoder-error and OOM as the two unhandled failure surfaces and pointing at FrameCompressor::compress as the Result-bearing entry point a future try_ variant would expose. - start_matching_fast_loop preamble grows a Raw-pointer aliasing invariant block enumerating which fields the inner loop touches through cached raw pointers (short_hash, long_hash, history) vs through `&self` shared reads (offset_hist, history-as-slice for extend_backwards). Stacked Borrows / Tree Borrows soundness rests on field-disjointness — adding a `&mut self` call inside the loop would reborrow the tables and invalidate the cached pointers, which the comment now flags so a future refactor either hoists the call or reloads the raw pointers. - probe_tail_ip0_only drops the unconditional packed_curr writes to both hash tables. Every caller branch (continue 'outer after emit, break 'outer on no hit) exits the outer loop before any future iter could re-probe the slots, and seed_remaining_hashable_starts inserts the same positions during the post-loop tail seed pass. The writes were dead and only paid cache dirtying on the tail boundary. - add_data inline comment grows an in-tree caller audit conclusion: every production path that reaches commit_space -> add_data originates in levels/fastest.rs's block emitter, which produces non-empty blocks gated by RLE-detect / raw-fast-path checks. No production caller relies on add_data(empty) as a side-effecting trim trigger; tests cover the trim path through trim_to_window directly. The behaviour change introduced earlier in this PR (empty-input short-circuit before the eviction loop) is therefore observable only by a hypothetical future caller that should be using trim_to_window instead.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.

The predefined LL/ML/OF FSE default tables now flow through the encoder as &'static FSETable rather than owned FSETable. Process- local cache stores Box::leak'd singletons; default_*_table() helpers return the static reference; FseTables struct holds &'static for the three default fields. Removes the per-FrameCompressor::new clone of three FSETables that the AtomicPtr cache previously paid even on the hot path — clone was ~4-5 µs out of ~7 µs total ctor on darwin/aarch64. Measured locally (criterion, 1 KiB random, level_2_dfast): before: 12.67 µs/iter (77 MiB/s) — gap 3.9× FFI after: 6.08 µs/iter (160 MiB/s) — gap 1.83× FFI Cumulative against the pre-cache baseline at the start of this session: 44.25 µs → 6.08 µs, 7.3× compress-time speedup on the small-1k-random scenario, applies to every level since the FSE setup sits inside FrameCompressor::new regardless of strategy. Secondary cleanup tied to the same review round: - probe_tail_ip0_only takes &self (it was already read-only on the hash tables after dropping the dead packed_curr writes; the &mut forced an exclusive borrow at the call site for no reason) - dropped the misleading `let _ = position_base;` discard from the same helper; the variable IS used by the bounds check below, the discard suppressed no warning and only made the surrounding comment incorrect - compress_to_vec rustdoc explicitly calls out "NOT a streaming API" — source is materialized in full before any compression runs; callers chasing peak-RSS shape are pointed at StreamingEncoder - get_last_space (the trait-dispatch entry point) returns &[] on empty window_blocks instead of panicking. The four internal sites in this file all use the inline `window_blocks.back().copied(). unwrap_or(0)` gate pattern; the trait helper now mirrors that shape so external callers see the same contract as internal ones, and a future refactor consolidating the two paths doesn't trip a panic surface that the internal sites already gate around - clippy needless-borrow autofix on call sites that now receive &'static FSETable directly rather than &FSETable

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated no new comments.

polaz · 2026-05-17T10:54:23Z

Superseded by #146 — same final tree, history squashed from 30 commits to 3 logical commits, scope clarified per restructured Phase 7 plan in #111 (this is Phase 7pre: enablement, not the original misleading 'level 3 parity gap' title). Closing to clean the review surface; CodeRabbit / Copilot will re-run on #146 from a clean slate.

polaz added 16 commits May 16, 2026 22:11

chore: ignore local helper docs in .local/

b2b4d48

Local-only working notes (Phase 7c TODO/findings, etc.) live under `.local/` and are not part of the published repo.

Copilot AI review requested due to automatic review settings May 16, 2026 22:34

Copilot started reviewing on behalf of polaz May 16, 2026 22:35 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

Comment thread zstd/benches/compare_ffi.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated

sw-release-bot Bot reviewed May 16, 2026

View reviewed changes

polaz added 2 commits May 17, 2026 02:08

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

Comment thread zstd/src/encoding/dfast/mod.rs Outdated

Comment thread zstd/src/encoding/dfast/mod.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated

Copilot AI review requested due to automatic review settings May 16, 2026 23:23

Copilot started reviewing on behalf of polaz May 16, 2026 23:24 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

polaz requested a review from Copilot May 17, 2026 01:59

Copilot started reviewing on behalf of polaz May 17, 2026 01:59 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Comment thread zstd/src/tests/fuzz_regressions.rs

Comment thread zstd/src/encoding/dfast/mod.rs Outdated

Comment thread zstd/src/encoding/dfast/mod.rs Outdated

polaz requested a review from Copilot May 17, 2026 08:29

Copilot started reviewing on behalf of polaz May 17, 2026 08:29 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

Comment thread zstd/src/encoding/dfast/mod.rs

Comment thread zstd/src/encoding/dfast/mod.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

polaz added 2 commits May 17, 2026 12:14

Copilot AI review requested due to automatic review settings May 17, 2026 09:16

Copilot started reviewing on behalf of polaz May 17, 2026 09:16 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

polaz requested a review from Copilot May 17, 2026 09:47

Copilot started reviewing on behalf of polaz May 17, 2026 09:48 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Comment thread zstd/src/encoding/dfast/mod.rs Outdated

Comment thread zstd/src/encoding/dfast/mod.rs

Comment thread zstd/src/fse/fse_encoder.rs

Comment thread zstd/src/fse/fse_encoder.rs

Comment thread zstd/src/encoding/mod.rs

Comment thread zstd/src/encoding/dfast/mod.rs

polaz requested a review from Copilot May 17, 2026 10:07

Copilot started reviewing on behalf of polaz May 17, 2026 10:08 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

polaz requested a review from Copilot May 17, 2026 10:31

Copilot started reviewing on behalf of polaz May 17, 2026 10:31 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

This was referenced May 17, 2026

feat: encoder architecture rewrite (split monolith, const-generic Strategy, arena allocator) #111

Open

perf(7pre): enablement — donor outer/inner dfast Level 3 + cross-level FSE cache #146

Merged

polaz closed this May 17, 2026

Conversation

polaz commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Cumulative deltas (compress/level_3_dfast/decodecorpus-z000033, 1 022 035 B input)

Small-input speedup (FSE cache, all levels)

Per-scenario sanity at level_3_dfast (no regressions)

Structural changes (30 commits)

dfast matcher rewrite (Phase 7c core)

Cross-level perf + parity refinements (review-driven, recent rounds)

Donor-deviation audit

Deferred (follow-up PRs)

Verification gates

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Uh oh!

codecov Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sw-release-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

polaz commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading

codecov Bot commented May 16, 2026 •

edited

Loading

sw-release-bot Bot left a comment •

edited

Loading