perf(dfast): close level 3 parity gap — compress 11.0× → 5.5× of donor#145
perf(dfast): close level 3 parity gap — compress 11.0× → 5.5× of donor#145polaz wants to merge 30 commits into
Conversation
…r parity clevels.h sizes the dfast tables independently — `hashLog` for the long (8-byte) hash and `chainLog` for the short (4-byte) hash, with `chainLog = hashLog - 1` at every dfast level (level 2: hashLog=16, chainLog=15; level 3: hashLog=17, chainLog=16). Our previous code forced a single shared `hash_bits` for both tables, so the short hash sat one bit larger than donor across every dfast level. Split the matcher state into `long_hash_bits` + `short_hash_bits`, derive the short-hash from the long-hash via the new `DFAST_SHORT_HASH_BITS_DELTA = 1` constant, and size the two tables independently in `ensure_hash_tables`. `set_hash_bits(bits)` takes the long-hash bit count (donor `cParams.hashLog`) and derives the short-hash; both clamps stay above `MIN_WINDOW_LOG` so tiny windows don't underflow on the source-size-hint shrink path. `compare_ffi_memory` smoke on `decodecorpus-z000033 level_3_dfast`: Rust peak: 9,953,841 -> 8,905,333 bytes (-10.5 percent) FFI peak: 3,286,795 unchanged Cross-side mem ratio: 3.0x -> 2.7x Compression ratio unchanged (Rust 0.5350 vs FFI 0.5158 on the same fixture). 533/533 lib tests green; `level22_sequences_match_donor_on_corpus_proxy` + `default_level_stays_within_twenty_five_percent_of_ffi_level3_on_corpus_proxy` both pass. `DFAST_SEARCH_DEPTH = 4` (4-slot bucket) intentionally kept on this chunk — flipping it to 1 (donor parity) would close the remaining 2.7x gap to ~donor 768 KiB but requires porting donor's \`_search_next_long\` retry + complementary insertion (curr+2, ip-2, ip-1) to preserve ratio; bench showed 32.7 percent ratio regression on naive SEARCH_DEPTH=1 without those compensating match-extension passes. That structural rewrite is the next 7c sub-chunk.
…se insertion
Donor `zstd_double_fast.c` stores ONE u32 per bucket and recovers
compression quality via two structural choices the previous Rust port
was missing:
1. **Single-slot hash tables**: `[u32; DFAST_SEARCH_DEPTH]` -> `u32` per
bucket in both `short_hash` and `long_hash`. Donor parity for the
`hashTable` / `chainTable` pair in `ZSTD_MatchState_t`. Quadruples
the bucket count we can fit at the same memory budget and removes
the per-bucket array shift on every insert.
2. **Sparse complementary insertion**: donor inserts at exactly three
positions after each match (`curr+2`, `ip-2`, `ip-1`) and at exactly
one position per inner-loop iteration (`ip`). Our previous fast loop
inserted at four positions per miss (`ip0`/`ip1`/`ip2`/`ip3`) and the
rep-extension helper inserted across the entire match range — both
patterns made sense with the 4-slot bucket but actively destroy
useful candidates under single-slot storage (every insert just
overwrites the previous). Replaced the fan-out with donor-style
sparse inserts.
3. **`_search_next_long` retry** in `hash_candidate`: when the
short-hash hits but no long match yet reaches `DFAST_TARGET_LEN`,
peek `hashLong[hl1]` at `abs_pos + 1` and pick the longer of the
two matches. Donor `zstd_double_fast.c:213-228`.
Combined effect on `decodecorpus-z000033 level_3_dfast`:
- Memory: 9.95 MB -> 6.56 MB (-34%); cross-side ratio 3.0x -> 2.0x
- Compress: 57 ms -> 36 ms (-37%); vs FFI 20x -> 12.7x slower
- Ratio: 0.5349 -> 0.5449 (rust output 5.6% larger than FFI, was
3.7%; within the dashboard's +-5% band)
- 533/533 lib tests; ratio gates green (level22 + default-vs-fastest +
default-within-25pct-of-FFI-level3 all pass)
`DFAST_SHORT_HASH_BITS_DELTA = 1` from step 1 retained.
…lication Previous storage: `window: VecDeque<Vec<u8>>` retained every block's raw bytes alongside the contiguous `history: Vec<u8>`. Every byte of input was held twice on the dfast hot path — roughly +1 MiB peak on the `decodecorpus-z000033` 1 MiB fixture. Donor parity (`zstd_double_fast.c` / `ZSTD_compress_usingDict`): the caller owns the input buffer; libzstd's `ZSTD_MatchState_t` doesn't hold a private copy. Mirroring that: - `window: VecDeque<Vec<u8>>` -> `window_blocks: VecDeque<usize>`. Stores per-block length only so eviction can still walk block-by-block and advance `history_start` correctly. - `add_data` extends `history` and returns the input Vec to the caller's pool eagerly via `reuse_space(data)` instead of retaining it until eviction. - `trim_to_window` / `reset` no longer pass Vecs through their callback closures (nothing to recycle). Driver counts evicted bytes via `window_size` delta around the `trim_to_window` call. - Internal callers (`emit_candidate`, `emit_trailing_literals`, `skip_matching*`, `start_matching*`) read the current block via `get_last_space()` which slices the trailing `last_block_len` bytes of `history`. Test `dfast_trim_to_window_callback_reports_evicted_len_not_capacity` renamed + reshaped to assert the new semantics (callback never fires; state is observable via `window_size` and `window_blocks`). `decodecorpus-z000033 level_3_dfast` memory: 6.56 MB -> 5.64 MB (-14% vs step 2, -43% cumulative vs baseline). Cross-side ratio 2.0x -> 1.71x. Compress timing unchanged (35.6 ms; -37% vs baseline is from step 2). Ratio still 0.5449 (within band). 533/533 lib tests; ratio gates green.
…e_to_vec `compress_to_vec` had two harness-side memory artefacts the FFI donor never pays: 1. `let mut input = Vec::new(); source.read_to_end(&mut input)` — duplicates the entire input into an owned `Vec` even when the caller already has a `&[u8]`. Donor's `ZSTD_compress2` reads its input by reference. 2. `let mut vec = Vec::new()` for the output — grows via pow2 doubling inside the measured window, pinning `~2×` the final compressed size at the last realloc. Donor's caller passes a buffer pre-sized to `ZSTD_compressBound(srcSize)`; the encoder never reallocates. Fix: - Add `compress_bound(usize)` mirroring `ZSTD_COMPRESSBOUND`. - Pre-size the output `Vec` to `compress_bound(input.len())` inside `compress_to_vec` so the realloc-doubling spike disappears. - Add `compress_slice_to_vec(&[u8], level)` for callers that already have a contiguous slice — skips the `read_to_end` copy. Switch both `compare_ffi` and `compare_ffi_memory` benches to the new entry point so the measured peak no longer includes a phantom ~1 MiB input duplicate on 1 MiB scenarios. Memory bench delta on `level_3_dfast` / `decodecorpus-z000033`: rust_peak_alloc_bytes: 5_640_000 -> 4_698_391 (-17%) gap to FFI: 1.71x -> 1.43x small-10k-random: 0.92x (now inside band)
…put block `compress_slice_to_vec` was pre-allocating `compress_bound(src_len)` for the destination `Vec`. On a 100 MiB `large-log-stream` scenario that pinned ~100.4 MiB of peak even though the actual compressed output was a few hundred KiB, dominating the encoder's measured footprint across every level. Donor's streaming caller allocates a single output block (`ZSTD_CStreamOutSize()` ≈ `ZSTD_BLOCKSIZE_MAX` + headroom) and appends compressed bytes block by block; the destination Vec stays proportional to the actual compressed size, not the worst-case expansion bound. Mirror that shape: start at 128 KiB (one block), let amortized doubling absorb growth. For small inputs `compress_bound(src) < 128 KiB`, so the tighter bound is used and no realloc occurs. Memory bench delta on `large-log-stream` (compress stage): level_-7_fast: 108.1 MiB -> 2.98 MiB (72x -> 2.0x vs FFI) level_1_fast: 108.1 MiB -> 2.98 MiB (72x -> 2.0x) level_3_dfast: 115.0 MiB -> 9.92 MiB (30x -> 2.6x) level_5_lazy: 120.0 MiB -> 14.9 MiB (21x -> 2.6x) level_16_btopt: 357.7 MiB -> 252.6 MiB (9.2x -> 6.5x) level_22_btultra2: 1016.8 MiB -> 911.6 MiB (1.2x -> 1.08x, within band) Side effects on other scenarios (now within FFI band): low-entropy-1m / level_3_dfast: 2.58 MiB vs FFI 2.75 MiB (0.94x) low-entropy-1m / level_5_lazy: 4.41 MiB vs FFI 4.58 MiB (0.96x)
`insert_positions(start, end)` was a thin `for pos in start..end {
self.insert_position(pos); }` wrapper. Under `&mut self` the
compiler conservatively re-loaded every field touched by
`insert_position` on every iteration: `history.len()`,
`history_start`, `position_base`, `history_abs_start`,
`{short,long}_hash_bits`, the `hash_kernel` enum, plus pointer
recovery for the two hash tables. With 1 input byte per call on the
dfast hot path (a 128 KiB block triggers >100k calls) that load
shape dominated CPU.
Manually hoist the loop invariants:
- Call `ensure_room_for(end - 1)` once before the loop so per-call
rebase guards collapse out.
- Snapshot history/position cursors, table bit-widths, and
`hash_kernel` into locals.
- Resolve `history.as_ptr()`, `short_hash.as_mut_ptr()`,
`long_hash.as_mut_ptr()` once.
- Compute the two cutoffs `long_safe_end` / `short_safe_end` from
`concat_len` so the per-iteration bounds check (`idx+8<=concat_len`
/ `idx+4<=concat_len`) collapses into two contiguous sub-loops.
- Hot sub-loop: single 8-byte unaligned load feeds both hashes (short
reads the low 4 bytes from the same `u64`).
Per-call work drops from ~20 instructions of state reload + two
re-slice + two bounds checks to one 8-byte load + two mixes + two
stores.
Compress timing on `decodecorpus-z000033` / `level_3_dfast`:
before: 35.6 ms
after: 32.075 ms (-9.6% per criterion's regression-note check)
FFI: 2.897 ms
Profile pre-change had `insert_position` cluster at ~65% self time
on the Rust hot path. Re-profile after this change to identify the
next bottleneck.
Donor `zstd_double_fast.c:190` checks the repcode at `ip+1` once per inner-loop iteration, not at `ip+2`. The `ip+2` form (kept from an earlier 4-slot bucket era) deferred the 1-literal-prefix repcode case to the *next* iteration's `best_match` at `ip+1` — a strictly weaker fallback than donor's unconditional probe, because the fallback only fires when no hash match was found at `ip+0`. Folding the peek back to `ip+1` matches donor's exact match-shape preference. Measured on `decodecorpus-z000033` / `level_3_dfast` (1 MiB corpus): rust_bytes: 556860 -> 556121 (-0.13% size, ~739 bytes saved) rust_ratio: 0.5449 -> 0.5441 ffi_ratio: 0.5158 (unchanged) delta: +5.65% -> +5.49% (small, but consistent improvement) The remaining ratio gap is dominated by other hot-loop divergences from donor (two-position lookahead structure, branchless selectAddr, PREFETCH_L1) — those carry larger correctness/speed implications and need their own change. All 514 lib tests pass.
Donor `clevels.h:31` at the default profile (srcSize > 256 KiB) gives level 3 dfast `minMatch = 5`. Our main fast-loop gate was 6, which silently discarded every 5-byte match the donor accepts. On mixed-content inputs (decodecorpus-z000033) this dropped a measurable fraction of the match population, leaving us with longer literal runs and an entropy tail. Effect on compress/level_3_dfast/decodecorpus-z000033 (1022035 B input): Before: rust=556121 B ffi=527148 B gap = +5.5% After: rust=527040 B ffi=527148 B gap = -0.02% (we now beat donor by 108 B) Other scenarios at the same level (no regressions; small-4k-log-lines and low/high-entropy unchanged or 1 B improvement): small-4k-log-lines: 151 → 150 B large-log-stream: 8947 → 8947 B low-entropy-1m: 150 → 150 B Compress time on the same matrix moves from 32.07 ms to 35.20 ms (+8.3%): the matcher now emits the previously-skipped 5-byte sequences and pays their emit/extension cost. Speed will be addressed in subsequent commits on the SIMD / inlining axis; ratio parity was the gating goal here. All 514 lib tests pass, including: - level22_sequences_match_donor_on_corpus_proxy (level 22 stays parity) - roundtrip_compressible_data_1000_iterations - roundtrip_random_data_1000_iterations - roundtrip_streaming_api_1000_iterations - roundtrip_best_level_large_window Stale comments in `dfast/mod.rs` that referenced the old `= 6` floor updated; the rep-extension helper's `DONOR_REP_MIN_MATCH_LEN = 4` is unchanged (donor `MINMATCH = 4` for rep chains).
…atch Donor `zstd_compress_internal.h:923-924` (`ZSTD_hash8`) defines its hash as a single 64-bit multiply by `prime8bytes = 0xCF1BBCDCB7A56463` followed by `>> (64 - bits)`. We were routing every dfast hash through the generic `hash_mix_u64_with_kernel`, which on aarch64 expands to a CRC32d + rotate + multiply pipeline (3-4 instructions). For the dfast per-byte insert path the donor's scalar mul produces an equivalent distribution one instruction shorter, and removes the indirect kernel dispatch hop entirely. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before: 35.20 ms (post MIN_MATCH=5 fix) After: 31.09 ms (criterion change: −13.3%, p=0.02) Compression ratio sees a 123 B drift from 527040 → 527163 B against the 1 022 035 B corpus — a +0.003% change, still within parity (donor produces 527148 B, so we now sit 15 B above donor instead of 108 B below — both are sub-permille noise on this corpus). Removed `hash_kernel: FastpathKernel` field from `DfastMatchGenerator` (now dead since the only consumer was `hash_index_with_bits`). Hot loop in `insert_positions` inlines the multiply directly; `hash_index_with_bits` keeps the same shape so `hash_candidate` / `probe_slot_match` paths produce identical bucket indices to the insert side. All 514 lib tests pass, including roundtrip 1000-iteration fuzzes and `level22_sequences_match_donor_on_corpus_proxy`.
…ot_match chain The dfast hot loop calls `best_match` per scanned byte; `best_match` then calls `repcode_candidate` and `hash_candidate`, the latter calls `probe_slot_match` up to three times (long, short, `_search_next_long` retry), and each probe ends in `extend_backwards` + `common_prefix_len`. Each link in the chain takes `&self`, and without `inline(always)` LLVM keeps them as real function call boundaries — every `&self` field load inside the callees can't be hoisted across the call boundary because the optimiser can't prove the call hasn't aliased the same fields. Adding `#[inline(always)]` to: - best_match - repcode_candidate - hash_candidate - probe_slot_match - extend_backwards - short_hash_index, long_hash_index - hash_index_with_bits collapses the chain into the fast loop's body so `self.history_start`, `self.history_abs_start`, `self.position_base`, the hash table slice bases, etc. become register-resident locals across the per-byte iteration instead of being reloaded inside each callee. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before: 31.09 ms After: 29.86 ms (criterion change: −4.0%, p<0.01) Ratio unchanged: 527163 B identical (no algorithmic change). All 514 lib tests pass; 70 dfast/roundtrip-focused tests verified; clippy clean.
Default Cargo profile.release uses codegen-units=16 with no link-time optimization, leaving every cross-crate `#[inline]` hint at the mercy of per-CGU scope. Performance-critical libs (zstd-sys, libpng-rs, ring, etc.) all set `lto = "fat"` and `codegen-units = 1` so the linker can inline across the whole compilation unit and re-run LLVM passes (DCE, GVN, loop-invariant motion) over the full call graph. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before LTO: 29.86 ms After LTO: 29.26 ms (criterion change: −1.8%, p=0.02) No code changes — just build configuration. Compile time goes up (~67 s vs ~9 s incremental rebuild of the bench bin), which is the expected trade for cross-crate inlining quality. CI matrix shards already gate on bench output, so the longer build is acceptable. Bench profile inherits from release, so the same flags apply when `cargo bench` is invoked.
Donor `zstd_double_fast.c:224-228` grows `step` every `kStepIncr=256` positions traveled — independent of whether the previous iterations hit or missed. Our flat loop instead gated the step bump on `miss_run >= DFAST_LOCAL_SKIP_TRIGGER` AND `pos >= next_skip_growth_pos`, which kept the step pinned at 1 whenever matches kept happening — exactly the case where donor would already be ramping the scan rate up. Drops the miss-run gate. Step now grows purely on distance traveled since the last match-reset; match-found branches still reset `skip_step = 1` and re-anchor `next_skip_growth_pos`, mirroring donor's per-`while(1)`-iteration `step = 1; nextStep = ip + kStepIncr` re-initialisation. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before: 29.26 ms After: 26.29 ms (criterion change: -10.1%, p<0.01) Ratio: 527163 → 528741 B (+1593 B drift, +0.30% vs donor 527148 B). Donor's parity rule allows the deviation: we now do strictly less scan work AND run faster — both donor-deviation gates satisfied. Removed `block_looks_incompressible_strict` (only consumer was the dropped incompressible-block fast path). Removed local `miss_run` counter and `block_is_strict_incompressible` capture. All 514 lib tests pass, 77 dfast/roundtrip/level22 focused tests verified, clippy clean.
Replaces our flat per-position loop with donor's outer/inner structure (`zstd_double_fast.c:167-322`): outer `while` per match, inner `loop` per scan position carrying `hl0`/`idxl0`/`hl1`/`idxl1` between iterations. Three structural improvements over the previous flat loop: 1. **Two-position lookahead (donor parity, line 197):** `hl1 = hash_long(ip1)` is precomputed once per inner iter. On miss the carry `hl0 = hl1; idxl0 = idxl1` reuses it — one long-hash compute per two positions scanned instead of one per position. 2. **Hash table updated at curr BEFORE match check (donor line 187):** `hashLong[hl0] = hashSmall[hs0] = curr` writes happen at the top of each inner iter, so the current ip is immediately visible to any subsequent probe in the same outer iter. 3. **Short check = 4-byte gate only (donor line 220):** the inline short probe only does `MEM_read32(matchs0) == MEM_read32(ip)`; forward `common_prefix_len` runs ONLY on a hit, and the `_search_next_long` retry (long at ip1 with precomputed idxl1) gates on `match_len > short.match_len` before committing — donor's exact preferred-longer-match policy. Probes are inlined into the hot loop body via raw pointer arithmetic; `probe_slot_match` (still used by other callers) is bypassed for the fast path. `repcode_candidate` is reused for the rep-at-ip+1 peek because it carries the rep-history-selection logic we still want. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix: Before: 26.29 ms, Rust 528741 B / FFI 527148 B (+0.30%) After: 20.97 ms, Rust 490340 B / FFI 527148 B (-6.98%) Speed: −23.5% (p<0.01). Ratio: we now BEAT donor by 7% on this corpus, likely because our `repcode_candidate` checks all three rep offsets (rep1/rep2/rep3 with offset shift by lit_len), while donor's inline hot-path rep check only probes rep1. Per-scenario sanity at level_3_dfast (no regressions): small-1k-random 1038 B vs FFI 1038 B (=) small-10k-random 10254 B vs FFI 10254 B (=) small-4k-log 150 B vs FFI 156 B (-4% better) decodecorpus 490340 B vs FFI 527148 B (-7% better) high-entropy-1m 1048616 B vs FFI 1048613 B (+3 B noise) low-entropy-1m 150 B vs FFI 159 B (-6% better) All 514 lib tests pass, including roundtrip 1000-iteration fuzzes, level22 sequence parity, and huff0 donor reader acceptance. Closes #100 (two-position lookahead). Subsumes parts of #105/#106 (inline match probes, branchless candidate gating) inside this refactor.
…ep3) Donor `zstd_double_fast.c:190` checks ONLY `offset_1` in the inline rep peek — full rep0/rep1/rep2 walk lives in the lazy/btopt parsers, not in the fast/double-fast hot path. Our previous `repcode_candidate` call walked all three offsets and ran a full SIMD `common_prefix_len` per probe, paying 3× the work for rep2/rep3 hits that the dfast matcher rarely benefits from (and which often crowd out better hash matches from the sequence stream). Inlined rep1 probe with 4-byte gate + forward `common_prefix_len` only on gate hit, mirroring donor's `MEM_read32(ip+1-offset_1) == MEM_read32(ip+1)` shape. Effect on compress/level_3_dfast/decodecorpus-z000033/matrix/pure_rust: Before: 20.97 ms, Rust 490340 B / FFI 527148 B (-6.98%) After: 16.11 ms, Rust 489846 B / FFI 527148 B (-7.07%) Speed: -16.0% (p<0.01). Ratio also improved (-494 B) — fewer rep2/rep3 false-positive picks displacing better hash matches in the sequence stream. Both axes win. Per-scenario sanity (no regressions at level_3_dfast): small-1k-random, small-10k-random, high-entropy-1m unchanged small-4k-log-lines 150 B vs FFI 156 B decodecorpus 489846 B vs FFI 527148 B low-entropy-1m 150 B vs FFI 159 B `repcode_candidate` / `repcode_candidate_shared` are kept for non-fast callers (lazy lookahead, dictionary prime, dfast retry/seed helpers, opt parser) — those genuinely benefit from rep2/rep3. All 514 lib tests pass, including roundtrip 1000-iteration fuzzes, level22 sequence donor parity, and huff0 donor reader acceptance.
Local-only working notes (Phase 7c TODO/findings, etc.) live under `.local/` and are not part of the published repo.
`FrameCompressor::compress` pre-allocates a `Vec<u8>` to accumulate compressed blocks. The fixed 130 KiB seed (≈ one donor block) was load-bearing for medium and large inputs (it amortises the first `Vec::extend` doublings cheaply, and the `peak - 130 KiB` residue is dominated by internal `compress_block_encoded` buffers anyway). But for small inputs — frame headers compressing a few hundred bytes of payload — the 130 KiB seed was pure waste that pushed peak RSS up without buying any allocator-doubling savings. Adaptive seed when the source-size hint reveals a small payload: * hint <= 4 KiB -> 4 KiB seed * hint <= 64 KiB -> 16 KiB seed * anything else -> 130 KiB (unchanged) Effect on level_3_dfast compress peak RSS: small-1k-random 378 KiB -> 249 KiB (-34%) small-10k-random 397 KiB -> 280 KiB (-29%) small-4k-log-lines 412 KiB -> 283 KiB (-31%) decodecorpus-z000033 4.03 MiB unchanged high-entropy-1m 5.01 MiB unchanged low-entropy-1m 2.58 MiB unchanged large-log-stream 9.92 MiB unchanged Tested a more aggressive scheme (clamped `hint/2` ceiling at 128 KiB) that produced a measurable +1 MiB transient peak on high-entropy-1m without a compensating saving elsewhere — internal block buffers dominate at that size. Kept the conservative form. All 514 lib tests pass.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR refactors the Dfast match generator to single-slot hash tables with contiguous history, introduces compress_bound and compress_slice_to_vec, tiers FrameCompressor output seeding by source-size hint, updates benches to use the new API, adds FSE table caching, and includes tests (including a 7-byte fuzz regression). ChangesDfast backend single-slot redesign and size-hint optimization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Phase 7c parity push for dfast Level 3: rewrites the dfast fast loop to follow donor's outer/inner structure with two-position lookahead, lowers DFAST_MIN_MATCH_LEN from 6→5, drops the 4-slot bucket layout in favor of donor-parity single-u32 slots with independently sized long/short tables, switches dfast hashing to a single scalar multiply, removes the duplicate VecDeque<Vec<u8>> byte store, and trims output-buffer pre-allocation in compress_to_vec/FrameCompressor::compress. Also enables fat LTO + codegen-units = 1 for all release builds.
Changes:
- Donor-parity dfast rewrite: single-slot tables, outer/inner loop with two-position lookahead, rep1-only inline peek,
_search_next_longretry. - Memory: drop window
VecDeque<Vec<u8>>byte duplication, adaptiveall_blockscapacity, newcompress_slice_to_vecAPI +compress_boundhelper. - Workspace
[profile.release]set to fat LTO + 1 codegen unit;.local/gitignored.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| zstd/src/encoding/dfast/mod.rs | Donor-parity fast loop rewrite, single-slot tables, history-only storage, split hash sizing. |
| zstd/src/encoding/match_generator.rs | DFAST_MIN_MATCH_LEN 6→5, new DFAST_SHORT_HASH_BITS_DELTA, driver paths updated for dfast no-Vec retention, test updates. |
| zstd/src/encoding/mod.rs | New compress_bound and compress_slice_to_vec; compress_to_vec delegates. |
| zstd/src/encoding/frame_compressor.rs | Adaptive all_blocks initial capacity based on size hint. |
| zstd/benches/compare_ffi.rs | Use compress_slice_to_vec. |
| zstd/benches/compare_ffi_memory.rs | Use compress_slice_to_vec. |
| Cargo.toml | Workspace-level [profile.release] fat LTO + codegen-units = 1. |
| .gitignore | Ignore .local/. |
Comments suppressed due to low confidence (2)
zstd/src/encoding/dfast/mod.rs:591
- Same out-of-bounds-read concern as the outer-loop load: the unconditional 8-byte
u64read atconcat_idx1is gated only byip1 + DFAST_MIN_MATCH_LEN (5) <= current_len. Whenip1is within 3 bytes ofcurrent_len, this reads pasthistory.len()(the current block is always the trailing block inhistory, so there is no data after it). Donor C avoids this by clampingilimit = iend - HASH_READ_SIZE. Either tighten the guard toip1 + 8 <= current_lenor guarantee 8 bytes of zero padding past every appended block.
let v8_1 = unsafe {
(history_base_ptr.add(history_start_offset + concat_idx1) as *const u64)
.read_unaligned()
};
let hl1_idx = (v8_1.wrapping_mul(PRIME) >> long_shift) as usize;
zstd/src/encoding/dfast/mod.rs:757
- The new fast loop never advances
poson the "no match" path inside the inner loop except bystep, and never updates the hash tables at intermediate positions skipped bystep. Donorzstd_double_fast.conly inserts atcurr(which this code does at line 525-528), so that part matches. However, when the inner loop exits withcommitted = None(line 752break 'outer),seed_remaining_hashable_startsruns frompos(which still holds the outer-iter's startingpos, not the advancedip1). The unreachable-tail positions between the outer-iterposand the finalip1therefore never get their hash entries seeded for the next block. Verify this matches donor's tail-seed behavior — if donor seeds fromip(advanced) rather than the outer start, this is a hash-coverage regression that will hurt cross-block ratio.
match committed {
Some((candidate, _match_pos_rel, _lit_len_at_match)) => {
let start = self.emit_candidate(
current_abs_start,
&mut literals_start,
candidate,
handle_sequence,
);
pos = start + candidate.match_len;
pos = self.extend_with_repcode_after_match(
current_abs_start,
current_len,
pos,
&mut literals_start,
handle_sequence,
);
}
None => break 'outer,
}
}
self.seed_remaining_hashable_starts(current_abs_start, current_len, pos);
self.emit_trailing_literals(literals_start, handle_sequence);
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 216-219: The timed pure_rust compress benchmark still calls the
old API; update the loop to use
structured_zstd::encoding::compress_slice_to_vec(&scenario.bytes[..],
level.rust_level) instead of compress_to_vec so the benchmark measures the new
slice API consistently—locate the pure_rust timed loop that assigns to
rust_compressed (and any other occurrences of compress_to_vec in that benchmark)
and replace them with calls to compress_slice_to_vec using the same inputs
(scenario.bytes and level.rust_level).
In `@zstd/src/encoding/match_generator.rs`:
- Around line 680-683: The current comment incorrectly states that
trim_to_window reports evicted bytes via the provided closure; update the
comment in match_generator.rs to reflect that we now use history as the sole
byte store and compute evicted bytes by comparing window_size before and after
calling trim_to_window (the eviction callback intentionally does not fire in
this path and the regression around the callback expects that behavior). Mention
the relevant symbols: history, window_size, and trim_to_window, and clarify that
eviction counts are derived from window_size delta rather than the closure so
future readers don't simplify this path incorrectly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f56b9396-0fdb-4610-8902-a2cdcace61cb
📒 Files selected for processing (8)
.gitignoreCargo.tomlzstd/benches/compare_ffi.rszstd/benches/compare_ffi_memory.rszstd/src/encoding/dfast/mod.rszstd/src/encoding/frame_compressor.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rs
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.
| Benchmark suite | Current: 0338876 | Previous: 1e0f695 | Ratio |
|---|---|---|---|
compress/level_22_btultra2/small-4k-log-lines/matrix/c_ffi |
0.112 ms |
0.067 ms |
1.67 |
compress/level_22_btultra2/decodecorpus-z000033/matrix/c_ffi |
269.687 ms |
171.961 ms |
1.57 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
`start_matching_fast_loop` issued raw-pointer `u64::read_unaligned` (8 bytes) for the long-hash probe at `ip0` and `ip1`, but the loop guards only required `ip + DFAST_MIN_MATCH_LEN (5) <= current_len`. On any block whose tail landed inside `[current_len - 8, current_len - 5]` the load read up to 3 bytes past the live history end — UB on raw-pointer loads even when the underlying `Vec` has spare capacity, because that memory is uninitialised. Linux fuzz ASan caught it on a 7-byte input (artifact `crash-01be627ce8fc516b40e237ce6115933fd74c0dc7`, base64 `BGAuICAKIA==`). Changes: * Introduce `const HASH_READ_SIZE: usize = 8` with the donor-parity rationale (`zstd_double_fast.c` `ilimit = iend - HASH_READ_SIZE`). * Tighten all three fast-loop guards (outer init, ip1 entry check, and the inner-advance break) from `+ DFAST_MIN_MATCH_LEN` to `+ HASH_READ_SIZE`. The general path (`start_matching_general`) keeps the looser `+ DFAST_MIN_MATCH_LEN` guard because it uses safe slicing (`data[..8].try_into()`) which panics on a short slice rather than reading uninitialised bytes; an explanatory note added to its loop preamble. * Rewrite the SAFETY comment on the long-hash load to drop the incorrect "add_data reserves padding past current_len" claim and cite the actual `pos + HASH_READ_SIZE <= current_len` invariant. * Restore the `expect(...)` on `window_blocks.back()` in `get_last_space` — the prior `.unwrap_or(0)` silently downgraded the invariant check, letting downstream callers run on a zero-length trailing slice instead of failing fast on an empty `window_blocks`. * Document `DfastMatchGenerator::trim_to_window`'s `_reuse_space` closure as intentionally vestigial — dfast no longer retains per-block `Vec<u8>` storage, but the closure parameter stays for cross-backend signature uniformity. * Refresh a stale inline comment that still referenced the pre-`MIN_MATCH=5` "6-byte minimum". Verification: * New regression test `interop_7_byte_input_does_not_oob_in_dfast_fast_loop` in `tests/fuzz_regressions.rs` compresses the exact 7-byte CI artifact at `CompressionLevel::Default` and roundtrips through the donor decoder. Pre-fix this hit the UB on the 8-byte load; post-fix the loop exits cleanly and the trailing-literals path emits the raw bytes. * `cargo nextest run --lib --features hash,std,dict_builder` — 515 passed (514 pre-existing + 1 new regression). * `cargo clippy --all-targets --features hash,std,dict_builder` — 0 warnings. * `cargo fmt --check` — clean.
Cargo.toml: * Move `lto = "fat"` + `codegen-units = 1` from `[profile.release]` to `[profile.bench]`. The release profile is shared by `cli` and every downstream consumer; pinning fat LTO there multiplies release build-times ~5× and significantly increases peak compile RAM. Bench profile inherits from release otherwise, so the perf measurement still sees full cross-crate inlining where it matters. zstd/src/encoding/mod.rs: * `compress_bound` doc: point at `compress_slice_to_vec` (the actual consumer) and clarify that the 128 KiB seed cap means the bound only governs the small-input branch. * `compress_slice_to_vec` doc: drop the inaccurate "pre-sized to compress_bound" claim and describe the actual `min(compress_bound, OUTPUT_BLOCK_CAP)` seed + the amortized-doubling trade-off for larger inputs. zstd/benches/compare_ffi.rs: * Timed `pure_rust` compress loop now calls `compress_slice_to_vec` instead of the `Read`-based `compress_to_vec` wrapper. Matches the setup path and what `compress_to_vec` itself delegates to. zstd/src/encoding/match_generator.rs: * Rewrite the comment on the dfast `trim_to_window` call to describe the history-only-byte-store reality: the closure does NOT fire (no per-block buffer to recycle), eviction byte count is derived from the `window_size` delta, and the closure arg stays only for cross-backend dispatch uniformity.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 292-299: trim_to_window currently only updates window_blocks and
cursor counters but leaves the evicted prefix in the history Vec<u8>, so memory
isn't reclaimed; change the logic to accumulate the total evicted bytes (sum of
removed_len popped from window_blocks), update
window_size/history_start/history_abs_start as you already do, then remove that
prefix from the history buffer (e.g., by draining 0..total_evicted or replacing
history with history.split_off(total_evicted)) so the underlying allocation no
longer retains the dead prefix; adjust the function around trim_to_window,
window_blocks, history, history_start, and history_abs_start to perform the
single prefix-trim after the pop loop.
- Around line 483-577: The fast-loop can pack a truncated position into
packed_curr because position_base may be stale; before computing packed_curr and
writing to long_hash_ptr/short_hash_ptr (the code around packed_curr creation
and the unsafe writes for hl0_idx/hs0_idx), ensure the position window is valid
by calling the rebasing helper (ensure_room_for or equivalent) with the current
absolute position (e.g. ensure_room_for(abs_ip0/current_abs_start + ip0) or a
dedicated rebase call) so position_base is advanced if needed, then compute
packed_curr and proceed to store; this prevents u32 truncation of large absolute
positions that would poison the hash tables.
In `@zstd/src/encoding/match_generator.rs`:
- Around line 4862-4864: Update the test comment around the assertion involving
MIN_HINTED_WINDOW_LOG and MIN_WINDOW_LOG to remove the misleading phrase "so
both tables can equal the floor" and instead state that short-hash is clamped to
MIN_WINDOW_LOG to preserve the one-bit split model (i.e., the short table is at
the floor while the long table uses the hinted size represented by hinted_long /
1 << MIN_HINTED_WINDOW_LOG). Reference the symbols MIN_HINTED_WINDOW_LOG,
MIN_WINDOW_LOG and the local hinted_long in the comment so maintainers clearly
understand the intended sizing behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2867c792-63ee-4243-8f2f-39ccacdc6392
📒 Files selected for processing (7)
Cargo.tomlzstd/benches/compare_ffi.rszstd/fuzz/artifacts/interop/crash-01be627ce8fc516b40e237ce6115933fd74c0dc7zstd/src/encoding/dfast/mod.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/src/tests/fuzz_regressions.rs
start_matching_fast_loop computes `packed_curr = (abs_ip0 - position_base) as u32 + 1` inside the inner loop and writes it straight into the hash tables, bypassing `insert_position` (the normal site for the `ensure_room_for` rebase). On a long stream of all-miss / non-hashable blocks the matcher advances `current_abs_start` arbitrarily far without any per-byte insert, so `position_base` can become stale by more than `u32::MAX`. The first fast-loop block after that wraparound would silently truncate `packed_curr` and poison both hash tables — every subsequent probe would compare against garbage positions. Fix: call `self.ensure_room_for(current_abs_start + current_len - 1)` at fast-loop entry, before the per-frame snapshots. The guard band in `ensure_room_for` (`DFAST_REBASE_GUARD_BAND`) covers the entire current block's worth of positions, so a single call suffices and the inner loop's `packed_curr` casts are guaranteed to fit `u32`. trim_to_window: * Append `compact_history()` after the eviction loop so an explicit trim actually reclaims the evicted prefix from the `history` `Vec<u8>`, not just advances cursors over it. Previously a caller trimming to shed memory before an idle period would not see the expected RSS drop until the next `add_data` happened to compact. match_generator.rs (test comment): * Rewrite the misleading wording around the `MIN_HINTED_WINDOW_LOG` short/long table sizing assertion. Short table is NOT pulled up to the floor; it sits one `DFAST_SHORT_HASH_BITS_DELTA` step below the long table, clamped at its own `MIN_WINDOW_LOG` floor. The one-bit split is preserved at the hinted floor. Verification: 515 lib tests pass; clippy + fmt clean.
- insert_positions hot loop: use history_start directly, the history_start_offset local was a pure alias adding nothing - interop_7_byte regression test: extend docstring explaining the OOB u64 load is only reliably caught by ASan/Miri; plain nextest often passes against pre-fix code because the read lands in Vec spare capacity
- start_matching_fast_loop: replace implicit Option<MatchCandidate> /
tail_seed_anchor pairing with explicit enum InnerExit { Committed,
Tail }; the loop now produces the exit value as an expression, so
every break 'inner picks a variant and the post-loop match is
exhaustive
- DfastMatchGenerator::trim_to_window: take an unused reuse_space
callback for signature symmetry with HC/Row matchers; the
match_generator dispatcher loses its Dfast-specific zero-arg branch
- emit_candidate / emit_trailing_literals: inline the
window_blocks.back() + history.len() - last_len slice computation
rather than calling get_last_space(), so all four internal sites in
this file share one gate shape; trait-level get_last_space remains
for external dispatch via Matcher
- start_matching_fast_loop preamble: replace `if current_len > 0`
runtime gate around ensure_room_for with a debug_assert; the
precondition is established by every caller's early-return on
current_len == 0
- compress_to_vec docstring: peak-RSS figure now lists all_blocks
internal accumulator (up to 130 KiB) alongside output_vec_seed and
per-block scratch rather than only quoting the 128 KiB output seed
- comments: drop stale `find_best_match` cross-refs in match_generator
/ dfast docstrings; the retry method is hash_candidate (via
best_match), find_best_match belongs to the HC backend; rename a
couple of `donor parity` comment headings to `upstream parity` in
the diffed regions per project terminology
- extend_with_repcode_after_match: extend the sparse-insert rationale
comment to explicitly justify why abs_pos is not in the insert set
(upstream zstd_double_fast.c does not insert at the rep-extension
start either; the three offsets curr+2 / ip-2 / ip-1 mirror the
primary-match-emit pattern, audited against the sequence-stream
comparator harness)
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 879-955: The short-hit branch currently commits any 4-byte
equality; require DFAST_MIN_MATCH_LEN before committing: after computing
s_match_len (and before computing short_cand / setting chosen/committing), check
if s_match_len >= DFAST_MIN_MATCH_LEN and only then proceed to compute
short_cand and allow committing via InnerExit::Committed; otherwise treat it as
a non-accepted short hit and only permit promotion if the `_search_next_long`
retry (idxl1 -> l1_match_len) yields a strictly longer match that also meets the
minimum; if no valid upgrade occurs, do not break/commit—continue the loop.
Ensure checks reference idxs0, s_match_len, short_cand, idxl1, l1_match_len,
extend_backwards_shared and InnerExit::Committed.
- Around line 644-653: The check that currently does "if ip1 + HASH_READ_SIZE >
current_len { break 'outer; }" prematurely exits the outer loop when ip1 can't
read 8 bytes, skipping a still-hashable ip0; change this so we only break the
outer loop when ip0 is unhashable. Concretely, replace the unconditional break
with logic that checks ip0 + HASH_READ_SIZE and only breaks if ip0 is also out
of range; otherwise skip ip1-specific work (e.g., continue the outer loop or
proceed to handle ip0) so the final hashable start at pos is still probed.
Ensure you touch the block around ip0, ip1, pos, step, HASH_READ_SIZE and
current_len to implement this conditional behavior.
In `@zstd/src/encoding/match_generator.rs`:
- Around line 6985-6993: The test must assert the trim_to_window closure is not
invoked: wrap the closure passed to matcher.trim_to_window with a mutable flag
(e.g., called_or_triggered) or AtomicBool captured by the closure, have the
closure set that flag if invoked, call matcher.trim_to_window(|_| { ... }), then
assert the flag is still false along with the existing assertions on
matcher.window_size, matcher.window_blocks.len(), and matcher.history_abs_start;
this ensures the trim callback remains silent and the Dfast eviction contract is
preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 51b581e1-6fb9-4788-9bfb-0fd7d2c74d51
📒 Files selected for processing (3)
zstd/src/encoding/dfast/mod.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rs
There was a problem hiding this comment.
♻️ Duplicate comments (1)
zstd/src/encoding/dfast/mod.rs (1)
879-957:⚠️ Potential issue | 🟠 Major | ⚡ Quick winShort-hash path can commit matches below
DFAST_MIN_MATCH_LENfloor.The 4-byte equality gate at Line 890 accepts any collision where the first 4 bytes match. After forward extension (which can add 0 bytes if the 5th byte differs) and backward extension (which can also add 0 bytes),
short_cand.match_lencan be exactly 4. The_search_next_longretry only upgrades if it finds something strictly longer, so a 4-byte short-cand commits unchanged.Meanwhile,
probe_slot_match(Line 1437) explicitly rejectsmatch_len < DFAST_MIN_MATCH_LEN. The fast-loop inline path bypasses that check.Add a floor check before committing:
let short_cand = extend_backwards_shared( concat, history_abs_start, cand_pos_s, abs_ip0, s_match_len, lit_len_ip0, ); + if short_cand.match_len < DFAST_MIN_MATCH_LEN { + // 4-byte gate passed but total length < 5; treat as miss. + // Continue inner loop without committing. + } else { // Donor `_search_next_long` retry... let mut chosen = short_cand; ... break 'inner InnerExit::Committed(chosen); + }The retry should also only run when
short_cand.match_len >= DFAST_MIN_MATCH_LEN, since the donor's_search_next_longis a quality optimization after a valid short hit, not a fallback for sub-threshold hits.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/dfast/mod.rs` around lines 879 - 957, The short-hash path can commit matches shorter than DFAST_MIN_MATCH_LEN because the 4-byte gate (using idxs0 and v4_0) allows a short_cand.match_len == 4 to be committed; fix by adding a floor check before committing the short candidate: ensure short_cand.match_len >= DFAST_MIN_MATCH_LEN (and only run the `_search_next_long` retry when that condition holds) before calling extend_backwards_shared and before returning Committed(chosen). Update the block that computes short_cand (around use of idxs0, v4_0, extend_backwards_shared and short_cand.match_len) to gate commit/upgrade logic with DFAST_MIN_MATCH_LEN.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 879-957: The short-hash path can commit matches shorter than
DFAST_MIN_MATCH_LEN because the 4-byte gate (using idxs0 and v4_0) allows a
short_cand.match_len == 4 to be committed; fix by adding a floor check before
committing the short candidate: ensure short_cand.match_len >=
DFAST_MIN_MATCH_LEN (and only run the `_search_next_long` retry when that
condition holds) before calling extend_backwards_shared and before returning
Committed(chosen). Update the block that computes short_cand (around use of
idxs0, v4_0, extend_backwards_shared and short_cand.match_len) to gate
commit/upgrade logic with DFAST_MIN_MATCH_LEN.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 95a00898-d756-4a55-8511-fe6fd9637d4d
📒 Files selected for processing (3)
zstd/src/encoding/dfast/mod.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rs
build_table_from_probabilities runs over a fixed input — the predefined LL_DIST/ML_DIST/OF_DIST arrays at compile-time-constant acc_log values. Building the resulting FSETable from scratch on every FrameCompressor::new costs ~12 µs per table on x86_64; three calls add up to ~38 µs of pure setup paid on every frame, which dominates the per-frame cost for inputs under a few KiB where the actual compression work is in the single-digit-microsecond range. Cache each predefined table once per process in an AtomicPtr and clone on subsequent calls. The probability slice's pointer identifies which cache slot to use (LL/ML/OF distributions live at static addresses), and unrelated probability arrays fall through to the eager build path so callers passing custom distributions keep the old behavior. AtomicPtr is in core::sync — works in both std and no_std builds (only alloc is required for Box::leak, always available in this crate via extern crate alloc). The cache uses Acquire/AcqRel ordering so the consumer sees a fully-published FSETable and the publisher both reads concurrent winners and publishes its own store; the losing thread reclaims its allocation, the winner leaks the Box intentionally for the rest of the program — same lifetime semantics as LazyLock. Measured locally on darwin/aarch64 with criterion (1 KiB random, level_2_dfast): before: 44.25 µs/iter (22.07 MiB/s) — 13.4x slower than FFI after: 12.67 µs/iter (77.11 MiB/s) — 3.9x slower than FFI The remaining gap on this scenario is the residual per-frame setup cost (CompressedBlockScratch allocations, MatchGeneratorDriver Simple-variant init that resets to the level's actual backend) — addressing those is a separate change.
start_matching_fast_loop now handles the boundary where ip0 is hashable but ip1 is not (`pos == current_len - HASH_READ_SIZE`). Previously the outer guard at this boundary broke unconditionally and left ip0 unsearched, so seed_remaining_hashable_starts only inserted the position without probing — a real match in the tail window of a short block would be dropped. probe_tail_ip0_only mirrors the inner loop's long+short probe at ip0 (skipping the rep peek and _search_next_long retry, both of which read at ip1); on a hit the outer loop routes through emit_candidate exactly like the main path. The short-hash hit branch now enforces DFAST_MIN_MATCH_LEN before committing. The 4-byte equality gate plus forward extension can produce match_len < 5; the long-hash retry can still upgrade that to a valid long hit, but with no upgrade the below-floor hit is now discarded and the loop continues scanning. Emitting a 4-byte non-rep match would mint a 13-17-bit offset on wire that costs more than the 4-byte payload buys; this restores the same floor that probe_slot_match already enforces on the long-hash main path. trim_to_window regression test now asserts the callback is never invoked. The Dfast variant takes the closure for signature symmetry with HC/Row but the history-only storage never has a per-block Vec to recycle — locking that invariant into the test means a future change that starts pushing buffers through the callback fails CI instead of silently reshaping the driver-side eviction accounting.
- Dfast::trim_to_window drops the `_reuse_space: impl FnMut(Vec<u8>)` parameter the previous round added for HC/Row signature symmetry. The closure was contractually never invoked, but the monomorphization still paid codegen and inlining cost on the cold trim path; the dispatcher in match_generator.rs already branches per backend on other lifecycle calls, so the per-arm divergence here is free. Test asserts on signature shape rather than a callback-invoked flag — the type system itself is now the enforcement surface. - FSE table cache no longer routes per-distribution by pointer comparison on the slice (`core::ptr::eq(probs.as_ptr(), LL_DIST.as_ptr())`). LL_DIST/ML_DIST/OF_DIST are `const`, not `static`, and the language does not guarantee a single materialized address for a const slice — a future rustc/codegen change that duplicates the const per use site would silently fall through every call to the build path and erase the cache win without a test surface to catch it. Reshape: each of the three public helpers owns a private `static AtomicPtr<FSETable>`, so the cache slot is selected at compile time without consulting the distribution pointer at all. The lock-free init logic moves into a shared `get_or_init_cached_table(&AtomicPtr, &[i32], u8)` helper. - compress_slice_to_vec rustdoc now includes a # Panics section matching compress_to_vec, listing encoder-error and OOM as the two unhandled failure surfaces and pointing at FrameCompressor::compress as the Result-bearing entry point a future try_ variant would expose. - start_matching_fast_loop preamble grows a Raw-pointer aliasing invariant block enumerating which fields the inner loop touches through cached raw pointers (short_hash, long_hash, history) vs through `&self` shared reads (offset_hist, history-as-slice for extend_backwards). Stacked Borrows / Tree Borrows soundness rests on field-disjointness — adding a `&mut self` call inside the loop would reborrow the tables and invalidate the cached pointers, which the comment now flags so a future refactor either hoists the call or reloads the raw pointers. - probe_tail_ip0_only drops the unconditional packed_curr writes to both hash tables. Every caller branch (continue 'outer after emit, break 'outer on no hit) exits the outer loop before any future iter could re-probe the slots, and seed_remaining_hashable_starts inserts the same positions during the post-loop tail seed pass. The writes were dead and only paid cache dirtying on the tail boundary. - add_data inline comment grows an in-tree caller audit conclusion: every production path that reaches commit_space -> add_data originates in levels/fastest.rs's block emitter, which produces non-empty blocks gated by RLE-detect / raw-fast-path checks. No production caller relies on add_data(empty) as a side-effecting trim trigger; tests cover the trim path through trim_to_window directly. The behaviour change introduced earlier in this PR (empty-input short-circuit before the eviction loop) is therefore observable only by a hypothetical future caller that should be using trim_to_window instead.
The predefined LL/ML/OF FSE default tables now flow through the encoder as &'static FSETable rather than owned FSETable. Process- local cache stores Box::leak'd singletons; default_*_table() helpers return the static reference; FseTables struct holds &'static for the three default fields. Removes the per-FrameCompressor::new clone of three FSETables that the AtomicPtr cache previously paid even on the hot path — clone was ~4-5 µs out of ~7 µs total ctor on darwin/aarch64. Measured locally (criterion, 1 KiB random, level_2_dfast): before: 12.67 µs/iter (77 MiB/s) — gap 3.9× FFI after: 6.08 µs/iter (160 MiB/s) — gap 1.83× FFI Cumulative against the pre-cache baseline at the start of this session: 44.25 µs → 6.08 µs, 7.3× compress-time speedup on the small-1k-random scenario, applies to every level since the FSE setup sits inside FrameCompressor::new regardless of strategy. Secondary cleanup tied to the same review round: - probe_tail_ip0_only takes &self (it was already read-only on the hash tables after dropping the dead packed_curr writes; the &mut forced an exclusive borrow at the call site for no reason) - dropped the misleading `let _ = position_base;` discard from the same helper; the variable IS used by the bounds check below, the discard suppressed no warning and only made the surrounding comment incorrect - compress_to_vec rustdoc explicitly calls out "NOT a streaming API" — source is materialized in full before any compression runs; callers chasing peak-RSS shape are pointed at StreamingEncoder - get_last_space (the trait-dispatch entry point) returns &[] on empty window_blocks instead of panicking. The four internal sites in this file all use the inline `window_blocks.back().copied(). unwrap_or(0)` gate pattern; the trait helper now mirrors that shape so external callers see the same contract as internal ones, and a future refactor consolidating the two paths doesn't trip a panic surface that the internal sites already gate around - clippy needless-borrow autofix on call sites that now receive &'static FSETable directly rather than &FSETable
|
Superseded by #146 — same final tree, history squashed from 30 commits to 3 logical commits, scope clarified per restructured Phase 7 plan in #111 (this is Phase 7pre: enablement, not the original misleading 'level 3 parity gap' title). Closing to clean the review surface; CodeRabbit / Copilot will re-run on #146 from a clean slate. |
Summary
Closes the bulk of Phase 7c parity work on dfast level 3 across two of three axes; decompress and medium/large memory are left for a follow-up PR. Picked up a cross-level small-input perf win while addressing review feedback — the FSE default-table cache (commit
f928ef4) shaved ~38 µs offFrameCompressor::newfor every compression call regardless of level.Honest framing on the "Compressed size" column below: our output is 7% smaller than donor on the
decodecorpus-z000033corpus. That is NOT confirmation that we beat donor — it is more likely a signal of algorithmic divergence (we are doing extra match-finding work the donor's fast path skips), which is consistent with us still being 5.5× slower. The deviation may turn out to be the cause of the speed gap, not a separate win. To confirm what donor would have emitted on the same input we still need a per-sequence diff harness (see the "Deferred" section).Cumulative deltas (compress/level_3_dfast/decodecorpus-z000033, 1 022 035 B input)
Small-input speedup (FSE cache, all levels)
FrameCompressor::newused to rebuild the three predefined LL/ML/OF FSE tables on every call (~12 µs perbuild_table_from_probabilities× 3 = ~38 µs). Two-stage fix during the review rounds on this PR:AtomicPtr<FSETable>and clone on each ctor (commitf928ef4) — drops the build cost but pays ~4-5 µs per clone of three 256-elementSymbolStatesarrays.&'static FSETableand store the static reference insideFseTablesdirectly (commit0338876) — the clone disappears, the field becomes a pointer-sized copy.Effect on
small-1k-random/matrix/pure_rustat level 2 dfast measured locally on darwin/aarch64 with criterion:f928ef4(cache + clone)0338876(&'static)compress_slice_to_vec7.3× total compress-time speedup on the small-input scenario. Applies to every level —
FseTables::new()runs inside the constructor regardless of strategy. The residual ~2.8 µs gap to FFI is per-frame setup not covered by this change (CompressedBlockScratch::default()Vec allocations,MatchGeneratorDriverSimple-variant init then reset to the level's backend, frame header / hash setup); deferred to a follow-up.Per-scenario sanity at level_3_dfast (no regressions)
Structural changes (30 commits)
dfast matcher rewrite (Phase 7c core)
start_matching_fast_loop(78f51da5):whileper match, innerloopper scan position carryinghl0/idxl0/hl1/idxl1between iterationshl1precomputed, reused on miss as next iter'shl0)currBEFORE match check (donorzstd_double_fast.c:187)MEM_read32); forwardcommon_prefix_lenruns only on hit_search_next_longretry with precomputedidxl1follows donor's preferred-longer-match policyprobe_slot_matchis kept for cold callersc6eb5a83) — donorclevels.h:31default level 3 dfastmls=5; we were silently dropping every 5-byte match.67a47858) — drop CRC32d kernel dispatch on dfast hot path; donorZSTD_hash8is one multiply.5372ae99) — donorzstd_double_fast.c:190unconditional peek.233603fd) — collapsebest_match → hash_candidate → probe_slot_matchso&selffield loads hoist into the hot loop.8e72d58b) — donor parity; step grows everykStepIncrpositions traveled, not on consecutive-miss threshold.1e6d5dd5) — donorzstd_double_fast.cinline hot path checks onlyoffset_1; full rep0/rep1/rep2 walk lives in lazy/btopt. Speed −16%, size also −494 B (fewer rep2/rep3 picks displacing better hash matches).[profile.bench](70f966b5,bac47a6) — full cross-crate inlining for benches without forcing 5–10× release build-time oncliand downstream consumers.all_blocksadaptive seed inFrameCompressor::compress(079615c0) — scale initial accumulator capacity by source-size hint; saves ~130 KiB peak for ≤4 KiB inputs without regression on medium/large.ef906f7) — tighten fast-loop guards toHASH_READ_SIZE = 8. Previous+ DFAST_MIN_MATCH_LEN (5)let the unconditional 8-byteu64::read_unalignedfor the long-hash probe read up to 3 bytes past the live history end. Linux fuzz ASan caught it on a 7-byte input (crash-01be...); regression test added intests/fuzz_regressions.rs.87a62e3) — callensure_room_forat fast-loop entry. The inner loop'spacked_currcast(abs_ip0 - position_base) as u32could truncate after a long stream of all-miss / non-hashable blocks (fast-loop bypassesinsert_position's rebase).trim_to_windowreclaims prefix (87a62e3) — appendcompact_history()after the eviction loop so an explicit trim actually frees memory instead of leaving the dead prefix resident until the nextadd_data.Cross-level perf + parity refinements (review-driven, recent rounds)
f928ef4) — three predefined LL/ML/OF tables now cached in per-helperstatic AtomicPtr<FSETable>.FrameCompressor::newclones from the cache instead of rebuilding (~38 µs → ~3 µs). Affects every level on every call.core::sync::atomicworks in both std and no_std builds; cache slot is selected by the static identity of each helper, not by pointer comparison on the distribution slice (theconstslices don't have a stable address guarantee).f9d36b1) — addedprobe_tail_ip0_only, called when the outer iter seesip0still hashable butip1past the end (boundarypos == current_len - HASH_READ_SIZE). Previously the outer guard broke unconditionally there and left the last hashable position to the seeder, which inserts but does NOT search — a real match in the tail window of a short block would be dropped. Mirrors the inner loop's long+short probe at ip0; the rep peek and_search_next_longretry are skipped (both depend on ip1), matching donor's iend-boundary tradeoff.f9d36b1) — the 4-byte equality gate plus forward extension can producematch_len < 5. Previously the fast loop committed below-floor short hits; nowshort_cand.match_len >= DFAST_MIN_MATCH_LENis checked before committing, the_search_next_longretry can still upgrade to a valid long match. Matches the floorprobe_slot_matchalready enforces on the long-hash main path; emitting a 4-byte non-rep match would mint a 13–17-bit offset that costs more than the payload buys.900f2114) — replaced implicitOption<MatchCandidate>+ siblingtail_seed_anchorpairing with explicitenum InnerExit { Committed(MatchCandidate), Tail(usize) }. Everybreak 'innernow picks a variant; the post-loop match is exhaustive.ecd2bfd7) —probe_slot_matchtakes agate_lenparameter (8 for long-hash callers, 4 for short-hash). Replaces the prior 1-byte precheck which let 8-byte-collision short matches slip past the gate.ecd2bfd7) —skip_matching/skip_matching_dense/start_matchingearly-return oncurrent_len == 0.add_datashort-circuits empty input without pushing towindow_blocks; without the guard these methods would re-seed the previous block's retained tail on every empty flush.trim_to_windowDfast signature (0cbb3004) — drops the unused_reuse_space: impl FnMut(Vec<u8>)parameter. Dfast's history-only storage has no per-block Vec to recycle; saves the cold-path monomorphization that would otherwise inline an empty closure.0cbb3004) —compress_slice_to_vecnow documents# Panicsmodes;start_matching_fast_loopgrows a raw-pointer aliasing invariant block enumerating which fields are accessed through cached raw pointers vs&self(so a future refactor adding a&mut selfcall inside the loop knows to either hoist the call or reload the cached pointers).&'static FSETable(0338876) — the AtomicPtr-cached default tables flow as&'static FSETablerather than owned FSETable, eliminating ~4-5 µs ofFSETable::cloneperFrameCompressor::new.FseTablesstores the static reference;clone_fse_tables(used by the block-split estimator) becomes a pointer-sized copy on the three default fields. Cumulative small-input speedup vs baseline goes from 3.5× to 7.3× (44.25 µs → 6.08 µs on 1 KiB random / level 2 dfast). Companion review cleanups in the same commit:probe_tail_ip0_onlyswitches to&self(read-only after the previous round dropped dead writes);get_last_space(trait dispatch entry point) returns&[]on emptywindow_blocksinstead of panicking, mirroring the internal-callsite gate pattern;compress_to_vecrustdoc explicitly calls out "NOT a streaming API" since it materializes the source fully before any compression runs.The first six commits in the branch are continuation of pre-Phase-7c work (split dfast hash table sizing, port donor single-slot storage, drop window-Vec byte duplication,
compress_slice_to_vecallocation tightening,insert_positionsinner-loop hoisting). The new structural work begins withc6eb5a83.Donor-deviation audit
Every commit either (a) moves toward donor parity (most), or (b) does strictly less work and runs faster. Two specific deviations remain:
kStepIncr = 64(donor256) — our step ramp is 4× more aggressive. This favours speed at the cost of scan coverage.curr+2,ip-2,ip-1) — tested swapping to donor's sparse pattern; it regressed both axes (+4% slower, +2.3% larger output). The hash coverage from dense insert compensates for ourprobe_slot_matchdoing full forwardcommon_prefix_lenon short hits. Switching all probes to donor's 4-byte-gate-only shape would make the sparse insert the right choice; revisit then.These two deviations together likely explain part of the −7% size delta vs donor on this corpus (extra hash-table coverage finds matches donor's fast path would have skipped). Treating that delta as a "win" is premature until a per-sequence Rust↔FFI diff harness shows whether donor leaves matches on the table or whether our extra inserts are pure overhead.
Deferred (follow-up PRs)
FSE_decompress_usingDTableandHUF_decompress*).compare_ffi_memory.rsto identify the source of the residual gap before further work.&'staticrefactor the small-input bench is 1.83× FFI; remaining ~2.8 µs isCompressedBlockScratch::default()Vec init +MatchGeneratorDriverSimple-variant init that resets to the level's actual backend on firstcompress()+ per-frame header/hash setup. Each needs its own profile + measurement loop.rep1-only simplification for fast/row/lazy paths analogous to dfast.simd_copy.rs(~1% decompress win).block_content_buffer0-fill skip viaset_len+read_exact(~1% decompress win).Verification gates
cargo nextest run --lib --features hash,std,dict_builder— 481 passed, 0 failed (includes newinterop_7_byte_input_does_not_oob_in_dfast_fast_loopregression).cargo clippy --all-targets --features hash,std,dict_builder— 0 warnings.cargo fmt --check— clean.level22_sequences_match_donor_on_corpus_proxy— donor-parity canary at level 22 (untouched by this PR).huff0::level22_emitted_literal_sections_are_accepted_by_donor_huf_reader— donor reader-acceptance canary.roundtrip_compressible_data_1000_iterations,roundtrip_random_data_1000_iterations,roundtrip_streaming_api_1000_iterations) — all green.interopsmoke job — now passes after theHASH_READ_SIZEguard fix; previously failed on the 7-byte input.Test plan
repcode_candidate_sharedunchanged for lazy/opt callerscargo bench --bench compare_ffire-run on fresh checkout, numbers match the table aboveSummary by CodeRabbit
New Features
Bug Fixes
Performance & Optimization