perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b)#217
Conversation
…#198 phase 1b) Phase 1b of #198 — wires the donor-shape kernel modules from phase 1a (#215, merged in fc63464) into the production hot path. The legacy SuffixStore-based MatchGenerator in simple/mod.rs is fully removed; MatcherStorage::Simple now holds a FastKernelMatcher that drives compress_block_fast<MLS> once per block. Selected for every Fast-strategy level — CompressionLevel::Uncompressed, CompressionLevel::Fastest, CompressionLevel::Level(1), and the negative CompressionLevel::Level(-7..=-1) variants. All Fast levels currently resolve to the same matcher with donor level-1 hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found) land in phase 3. ## What changes - simple/mod.rs collapses from 496 → 14 lines (only module declarations + docstring remain). SuffixStore, WindowEntry, MatchGenerator, repcode_candidate, add_data, next_sequence, add_suffixes_till, insert_suffix_if_absent, add_suffixes_interleaved_fast, offset_match_len, reserve — all gone with the wiring. - simple/fast_matcher.rs is the new active matcher: full inherent surface (accept_data, start_matching, skip_matching_with_hint, trim_to_window, last_committed_space, reset, prime_offset_history, take_recycled_space). - MatchGeneratorDriver Simple-arm wiring: commit_space → m.accept_data(space) with eager pre-commit eviction; start_matching::<Fast> → m.start_matching(handler); skip_matching_with_hint → m.skip_matching_with_hint(hint); reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window(); prime_with_dictionary → m.prime_offset_history(offset_hist). - Per-block input Vec recycled via take_recycled_space() → vec_pool (zero zero-fill cost — buffer pushed with len=0, get_next_space resizes on pop). ## Invariants - prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The first byte of history is a reserved dummy (sentinel-0 guard); real input data starts at history[1]. Donor C zstd achieves the same effect via a virtual base pointer; the flat Vec<u8> model here pays one byte of memory overhead for the same correctness property (no missed matches at segment boundaries). - history.len() bounded by 2 × max_window_size post-append, even for oversize committed blocks (retain_real = cap.saturating_sub( space.len()).min(max_window_size)). - Eviction preserves the dummy AND rebases prefix_start_index back to RESERVED on every drain — cumulative growth would push the filter past every valid history index and reject all match candidates wholesale. - Hash table rehashed after drain so retained tail bytes stay matchable. Amortised O(1) per byte of input. - rep[0..2] ↔ offset_hist[0..2] in lockstep on the common (lit_len > 0) path. Known divergence on back-to-back repcode matches (lit_len == 0 emits): kernel's rep unchanged, wire encoder per RFC 8878 §3.1.2.5 remaps codes and rotates offset_hist — marginal compression hit, output still correct. Phase 3 collapses these at the kernel level. - prime_offset_history seeds BOTH rep[0..2] and offset_hist atomically from a dictionary load. ## Defensive validation - MatchGeneratorDriver::new asserts slice_size > 0, max_slices_in_window > 0, checked_mul for the product, and checked_next_power_of_two for window_log_init derivation — catches all four overflow / degenerate paths with a clear domain-specific panic instead of a deep matcher-internal failure. - FastHashTable construction-time mls / hash_log validation unchanged from phase 1a. ## Tests 573/573 pass on the full workspace nextest suite: - 21 unit tests on FastKernelMatcher (lifecycle, accept + start, skip flavors, dict-prime hash population, eviction, boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync, prefix-eviction during dict-priming, drain prefix_start_index runaway, trim_to_window/last_block_start drift, oversize-block eviction bound). - 32 unit tests on the underlying kernel (donor-formula parity, prefix-filter, repcode backward extension three-piece proof, short-input early-return uniformity). - All frame_compressor integration tests (raw-block detection, hinted source-size matrix, level-1 round-trips through both the in-tree decoder and FFI decode). - All cross_validation Rust-encoded → FFI-decoded round-trips (every level 1..=22, dict + no-dict, encoded by Rust then read by the C reference decoder verbatim). 18 legacy tests in match_generator.rs that exercised SuffixStore-specific behavior or required block.len() < HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory comments — their substance either has equivalent coverage in the new tests or relied on algorithm-specific quirks the donor-shape kernel doesn't reproduce by design. ## Benchmark (i9-9900K) cargo bench deltas vs main (fc63464) on compress/level_1_fast/*/matrix/pure_rust: | Scenario | Δ time | Throughput | Note | |----------|-------:|-----------:|------| | low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster | | decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster | | high-entropy-1m | -67.3% | 633 MiB/s | 3× faster | | small-10k-random | -36.0% | 768 MiB/s | 1.6× faster | | small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster | | small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead | | large-log-stream | +122.3% | 235 MiB/s | regression — see below | Large-log-stream regression is expected at this phase: legacy SuffixStore used effectively window_log-sized hash slots (512K for level-1 window_log=19), while phase 1b uses donor-parity hash_log=14 (16K slots). 25 MiB dense-compressible log content hits ~1500 collisions/slot. Donor C zstd shows the same trade-off (~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining + cmov) closes the gap per the documented roadmap (#198 items 2/3/5). ## What's NOT in this PR - Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead, cmov match-found variant, per-level mls dispatch, kSearchStrength acceleration gradient for negative Fast levels. - LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes donor level-1 defaults (14/7) today. - hash_fill_step stride for dict-priming — still hard-coded to 1 (LevelParams field is wired in but the Fast matcher always strides at 1). Closes #198 phase 1b. Related: #178 (umbrella regression issue), #215 (phase 1a — kernel modules, merged in fc63464).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughReplace the Simple backend (SuffixStore/MatchGenerator) with donor-aligned FastKernelMatcher, add fast_matcher module, wire FastKernelMatcher into MatchGeneratorDriver (construction, reset, commit, skip, matching, priming, eviction), update simple/ module routing, and adjust/disable legacy Simple-specific tests while adding FastKernelMatcher tests. ChangesSimple backend refactor: MatchGenerator → FastKernelMatcher
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/encoding/match_generator.rs (1)
598-613:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winKeep the initial reported window in sync with the instantiated matcher.
FastKernelMatcheris constructed fromnext_pow2, butreported_window_sizekeeps the unroundedmax_window_size. For non-power-of-two constructor inputs,window_size()will report a smaller window than the active Simple backend actually has until the firstreset().Suggested fix
Self { vec_pool: Vec::new(), storage: MatcherStorage::Simple(FastKernelMatcher::with_params( window_log_init, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS, )), strategy_tag: super::strategy::StrategyTag::Fast, slice_size, base_slice_size: slice_size, - reported_window_size: max_window_size, + reported_window_size: next_pow2, dictionary_retained_budget: 0, source_size_hint: None, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/match_generator.rs` around lines 598 - 613, The reported_window_size is set to the unrounded max_window_size while the Simple backend (FastKernelMatcher) is initialized with the next power-of-two (next_pow2/window_log_init), causing a mismatch; update the MatchGeneratorDriver construction so reported_window_size is set to the rounded window (use next_pow2 or (1u64 << window_log_init)) to match the instantiated FastKernelMatcher (constructor call FastKernelMatcher::with_params and field reported_window_size).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 920-944: The Simple matcher instantiation is hardcoding donor
level-1 constants instead of using the resolved fast-level tuning computed by
resolve_level_params(); update the
MatcherStorage::Simple(FastKernelMatcher::with_params(...)) call(s) to pass the
resolved fast parameters from the params struct (e.g. use params.window_log and
the resolved params.hash_log / params.mls and also thread params.hash_fill_step
or any other fast-level knobs) instead of FAST_LEVEL_1_HASH_LOG and
FAST_LEVEL_1_MLS so CompressionLevel::Fastest, Level(1) and negative levels get
the per-level tuning; apply the same change to the other occurrences around
FastKernelMatcher::with_params to ensure the backend swap and reset paths both
use the resolved fast-level fields.
---
Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 598-613: The reported_window_size is set to the unrounded
max_window_size while the Simple backend (FastKernelMatcher) is initialized with
the next power-of-two (next_pow2/window_log_init), causing a mismatch; update
the MatchGeneratorDriver construction so reported_window_size is set to the
rounded window (use next_pow2 or (1u64 << window_log_init)) to match the
instantiated FastKernelMatcher (constructor call FastKernelMatcher::with_params
and field reported_window_size).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1fa461d0-9e51-40d8-b146-eae647c895a8
📒 Files selected for processing (3)
zstd/src/encoding/match_generator.rszstd/src/encoding/simple/fast_matcher.rszstd/src/encoding/simple/mod.rs
There was a problem hiding this comment.
Pull request overview
Wires the donor-shape Fast (ZSTD_fast) kernel into the production encoder path by replacing the legacy SuffixStore-based Simple matcher with FastKernelMatcher, and updating MatchGeneratorDriver to drive the new per-block matcher lifecycle.
Changes:
- Replaces the Simple backend’s legacy
MatchGeneratorimplementation withFastKernelMatcherand routescommit_space/start_matching/skip_matching_with_hintthrough it. - Simplifies
encoding/simple/mod.rsto module declarations + updated backend documentation, removing the legacy matcher implementation. - Adds a new
encoding/simple/fast_matcher.rsimplementing the Simple/Fast-strategy matcher state machine (history buffer, hash table, eviction, dict priming, buffer recycling) aroundcompress_block_fast.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| zstd/src/encoding/simple/mod.rs | Removes the legacy Simple matcher implementation; leaves module wiring and updated backend docs. |
| zstd/src/encoding/simple/fast_matcher.rs | Introduces FastKernelMatcher, managing flat history + flat hash table and adapting the donor kernel to the driver’s matcher lifecycle. |
| zstd/src/encoding/match_generator.rs | Rewires MatcherStorage::Simple and MatchGeneratorDriver to use FastKernelMatcher, removes suffix-store pooling, and updates reset/commit/start/skip plumbing accordingly. |
…217 review round 1) Four findings from CodeRabbit + Copilot's first review pass on the squashed PR #217. All four touch comment / metadata accuracy or a single-line correctness issue; no behavioural change beyond the window_size sync. **CR outside-diff (match_generator.rs:598-613)** — correctness: `reported_window_size` was using the unrounded `max_window_size` while the matcher itself was constructed from `next_pow2` (rounded up). For non-power-of-two constructor inputs (e.g. `slice_size * max_slices_in_window = 100_000`), `window_size()` would report 65_536 (un-rounded floor) while the active backend actually carried 131_072 (rounded-up next_pow2). The drift held until the first `reset()` overwrote both sides from LevelParams. Fix: report `next_pow2` so the two stay in lockstep at construction time. **CR #1 (match_generator.rs:944)** — declined, deferred to phase 3: Fast levels (Uncompressed, Fastest, Level(-7..=1)) all hard-code donor level-1 cParams. The acceleration gradient between negative- level fast modes and Level(1) lands when phase 3 ports donor's 4-cursor lookahead + cmov match-found + per-level kSearchStrength dispatch (issue #198 items 2/3/5). Updated the inline code comment to scrub the closed-PR reference and frame the deferral against phase 3 directly. **Copilot #2 (match_generator.rs:572)** — doc: the validation-guard comment described `next_power_of_two` returning 0 on overflow, which was old-Rust behaviour. Modern Rust panics; we now use `checked_next_power_of_two` (commit landed in the squash). Rewrote the comment to enumerate the three actual failure modes (zero args, mul overflow, next-pow2 overflow) and the three guards that catch them. **Copilot #3 (fast_matcher.rs:303)** — doc: `last_committed_space`'s pre-`accept_data` state description claimed `last_block_start = 0 / history.len() = 0`, but post-RESERVED_PREFIX_BYTES-seed construction (#216 / phase 1b) leaves both at `RESERVED_PREFIX_BYTES`. The returned slice is still empty (the `history[last_block_start..]` range is empty), just for a different reason. Updated the doc to reflect the seeded-dummy invariant. 573/573 tests pass; clippy clean.
|
@coderabbitai Re: match_generator.rs:598-613 outside-diff (reported_window_size) — fixed in commit 60a0076. Now reports next_pow2 so it matches the actual matcher's window from construction onward. The drift was a real bug for non-power-of-two slice_size × max_slices_in_window products. |
|
Tip For best results, initiate chat on the files or code changes.
|
… compactness rule) Doc-only change: collapse the verbose multi-paragraph last_committed_space docstring (added during PR #216 review rounds) into a 6-line three-bullet form. Same semantic content, zero narrative. Apply the new docstring compactness rule (one-two phrases default, multi-paragraph only for non-obvious invariants). 573/573 tests pass; clippy clean.
…RVED_PREFIX_BYTES seed (#217 Copilot #4, #5) Doc-only update aligning the dict-prime boundary test comment + inline note with the RESERVED_PREFIX_BYTES seed (post-phase-1b range is [RESERVED..=RESERVED], not [0..=0]). Compactness rule applied — concise two-block form. 573/573 tests pass; clippy clean.
|
@coderabbitai Re: match_generator.rs:598-613 outside-diff (reported_window_size) — stale. Fixed in commit 60a0076 ( |
|
✅ Actions performedComments resolved. Approval is disabled; enable |
… bumps (#217 Copilot #6, #7) Two doc comments described prefix_start_index as 'bumped forward' or 'advances' as history is evicted, implying a monotonic absolute index. Actual code (drain_real_prefix) rebases it back to RESERVED_PREFIX_BYTES on every drain — the retained tail is re-indexed in the new coordinate space. Updated both sites (struct field doc + trim_to_window header) to match. Compactness rule applied — trim_to_window header collapsed from 10 lines to 5. 573/573 tests pass; clippy clean.
… TODO marker (#217 Copilot #8, #9) #8 — header for FAST_LEVEL_1_HASH_LOG said 'reset path rebinds hash_log proportionally on source-size hint'. Untrue today: driver passes only window_log per-level, hash_log + mls hard-coded. Pin per-level scaling to phase 3. #9 — same lit_len=0 / back-to-back-rep1 concern as the prior PR's #21. Inline + module docs already explain, but verbose prose isn't anchored. Collapsed inline comment to a single '// TODO(#198 phase 3):' line so the deferral marker is unambiguous. 573/573 tests pass; clippy clean.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
zstd/src/encoding/simple/fast_matcher.rs (2)
327-331:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast on duplicate
accept_datacalls.In release builds this
debug_assert!disappears, and Line 370 will silently replaceself.pending, dropping an entire committed block. This protocol violation should hard-fail instead of corrupting the stream state.Suggested change
- debug_assert!( + assert!( self.pending.is_none(), "FastKernelMatcher: accept_data called with a still-pending buffer; \ the driver must run start_matching / skip_matching between commits", );Also applies to: 370-370
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/simple/fast_matcher.rs` around lines 327 - 331, Replace the non-failing debug_assert! in FastKernelMatcher::accept_data with a runtime check that hard-fails on protocol violation: if self.pending.is_some() then panic! with the same message (or use assert!), so duplicate accept_data calls cannot silently overwrite self.pending; also add the same runtime check at the other site where self.pending is being replaced (the code that currently overwrites pending) to ensure any attempt to replace an existing pending buffer fails fast rather than dropping committed data. Ensure messages reference the protocol requirement to run start_matching / skip_matching between commits.
350-363:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnforce the oversized-block precondition here.
If
space.len() > cap,retain_realbecomes0, but the full block is still appended later. That violates this module’s advertised post-append<= 2 * max_window_sizebound and leaves the matcher outside its documented window discipline.Suggested guard
let real_len = self.history.len().saturating_sub(RESERVED_PREFIX_BYTES); let new_real_total = real_len.saturating_add(space.len()); let cap = self.max_window_size.saturating_mul(2); + assert!( + space.len() <= cap, + "FastKernelMatcher requires block_size <= 2 * max_window_size \ + (block_size={}, cap={})", + space.len(), + cap, + ); if new_real_total > cap {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/simple/fast_matcher.rs` around lines 350 - 363, The code computes retain_real using cap and space.len() but never enforces the precondition that the incoming block fits the cap; if space.len() > cap we must not append the full block because that breaks the module invariant. In fast_matcher.rs, before computing or using retain_real (and before appending the incoming block), add a guard that detects space.len() > cap and handles it explicitly (e.g., early-return an error/result indicating an oversized block or otherwise reject/truncate the input per the module’s API) so the matcher never ends up with real_len + space.len() > 2 * self.max_window_size; reference variables: retain_real, space.len(), cap, self.max_window_size.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 517-523: The call to encode_offset_with_history(...)
unconditionally rotates offset_hist on lit_len == 0 emits and breaks the rep ↔
offset_hist invariant; update the fast matcher to detect zero-literal
(literals.len() == 0) rep1 emits and either collapse/skip the
encode_offset_with_history call for those cases or explicitly keep self.rep and
offset_hist consistent (e.g., perform the same rotation on self.rep or avoid
mutating offset_hist) before proceeding; reference the
encode_offset_with_history call and the self.rep / offset_hist state in
fast_matcher.rs to implement the guard or state-sync logic.
---
Outside diff comments:
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 327-331: Replace the non-failing debug_assert! in
FastKernelMatcher::accept_data with a runtime check that hard-fails on protocol
violation: if self.pending.is_some() then panic! with the same message (or use
assert!), so duplicate accept_data calls cannot silently overwrite self.pending;
also add the same runtime check at the other site where self.pending is being
replaced (the code that currently overwrites pending) to ensure any attempt to
replace an existing pending buffer fails fast rather than dropping committed
data. Ensure messages reference the protocol requirement to run start_matching /
skip_matching between commits.
- Around line 350-363: The code computes retain_real using cap and space.len()
but never enforces the precondition that the incoming block fits the cap; if
space.len() > cap we must not append the full block because that breaks the
module invariant. In fast_matcher.rs, before computing or using retain_real (and
before appending the incoming block), add a guard that detects space.len() > cap
and handles it explicitly (e.g., early-return an error/result indicating an
oversized block or otherwise reject/truncate the input per the module’s API) so
the matcher never ends up with real_len + space.len() > 2 *
self.max_window_size; reference variables: retain_real, space.len(), cap,
self.max_window_size.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 34185c8b-d8d0-4c6a-b193-cf50284a5288
📒 Files selected for processing (1)
zstd/src/encoding/simple/fast_matcher.rs
…n + hoist mls dispatch (#217 round 5) Four findings from the latest CR + Copilot review pass. **CR outside-diff (327-331) — assert!, not debug_assert!:** the duplicate-pending guard in accept_data was debug_assert!, so in release builds a double-commit would silently overwrite pending and drop a committed block. Hard-fail instead. **CR outside-diff (358-363) — block_size <= 2 × max_window_size:** the eviction math computes retain_real via saturating_sub, which silently collapses to 0 if space.len() > cap. The full block is still appended afterwards, violating the documented 'history bounded by 2 × max_window_size' invariant. Added an assert! precondition so callers see a clear panic at the boundary instead of an invisible invariant break. **CR #10 — lit_len == 0 offset_hist rotation (third raise):** the prior round added a TODO marker explaining the divergence, but CR correctly pushed back that documentation isn't a fix. Skip encode_offset_with_history when literals.is_empty() so offset_hist stays in lockstep with the kernel's unchanged rep — no divergence on the back-to-back rep1 path. Wire encoder downstream still sees the Triple with raw offset; its own encoding stays correct (lit_len-0 absolute encoding per RFC 8878 §3.1.2.5). Module docstring 'Known divergence' section collapsed accordingly. **Copilot #11 — hoist mls dispatch outside prime_hash loop:** moved the per-MLS match arm OUTSIDE the per-position loop. New prime_hash_table_impl<const MLS: u32> is monomorphised per matcher instance; the hot path is branch-free on mls. 573/573 tests pass; clippy clean.
|
@coderabbitai Outside-diff items fixed in a5b2f6a:
Both panic with clear domain-specific messages on protocol / precondition violation instead of silently corrupting state. |
|
Tip For best results, initiate chat on the files or code changes.
Good hardening — both changes make the contract-violation surface clear and immediate regardless of build profile. |
Copilot #12, revert round 5 #10) Round 5 skipped encode_offset_with_history when literals.is_empty() to keep matcher.offset_hist 'in lockstep' with kernel rep. Copilot correctly pushed back: Dfast / Row / HashChain matchers all call it unconditionally (passing lit_len = 0 when applicable), and the 'lockstep' framing was wrong — matcher.offset_hist tracks the WIRE ENCODER's history while matcher.rep tracks the KERNEL's state. They're not supposed to be the same. Revert the skip. Module docstring rewritten to call out that the two fields reflect DIFFERENT state and may diverge on lit_len = 0 emits per RFC 8878 §3.1.2.5 — both halves stay self-consistent within their own domain. Round 5 #10 reply to CR was based on the wrong mental model; that's reflected in this fix by reverting to the donor / other-matchers behaviour. 573/573 tests pass; clippy clean.
… + per-level cParams + cmov dispatch + window-correctness (#198 phase 3) Closes phase 3 of #198: native Rust port of donor's `ZSTD_compressBlock_fast_noDict_generic` (`zstd_fast.c:192-456`) restoring Fast-strategy throughput after the phase 1b regression and aligning encoder behaviour with donor parity on every front that affects ratio / speed / format correctness. ## Headline results `large-log-stream` (25 MiB dense corpus, i9-9900K): | State | Time vs main | Throughput | |------------------------------|-------------:|---------------:| | main (pre-phase-1b) | 0% | ~290 MiB/s | | #217 merged (phase 1b) | +122% | ~235 MiB/s | | **This PR (phase 3, M1-M9)** | **−22%** | **~790 MiB/s** | Net change phase 1b → phase 3: ~3.4× faster. ## Ratio audit on `decodecorpus-z000033` - L-7..L-1 Fast: monotone acceleration gradient restored (−1.85% to −4.79% vs donor) — pre-M6 these levels all produced identical 585601 bytes - L2-3 Dfast: parity or slight win (-0.94% to -4.61%) - L5-15 Lazy: consistently beat donor by −5.7% to −6.18% - L16-17 btopt: parity / win (−0.33% to −3.59%) - L20-22 btultra2: parity (±0.2%) - L1 Fast: +7.43% residual gap — tracked as follow-up #220 ## Milestones (squashed) - M1: per-level `fast_hash_log`/`fast_mls`/`fast_step_size` threading through `LevelParams` - M2: full 4-cursor `ip0/ip1/ip2/ip3` lookahead body + immediate-rep2 inner loop ported from `zstd_fast.c` - M3: `cmov` match-found variant + per-window dispatch surface (10 monomorphisations across `mls` 4..=8 × `use_cmov` true/false) - M4: beyond-donor `fast_hash_log: 13 → 14` for negative levels (+32 KB memory, 2× fewer collisions on structured corpora) - M5: reverted (adaptive mls peek did not pay off) - M6: per-level `fast_step_size` from donor's `targetLength = -level` formula; restores acceleration gradient - M7: added donor's missing `current0+2` hash insertion after each match emit (`zstd_fast.c:407`); raised L1/decodecorpus sequence-match-rate 43.1% → 57.7% - M8: dropped `RESERVED_PREFIX_BYTES` dummy byte; history layout now donor-parity, sentinel-0 protection via `INITIAL_PREFIX_START_INDEX = 1` filter - M9 (format correctness): sliding prefix floor at scan time enforces the advertised frame window (`1 << window_log`), NOT the dictionary-budget-inflated `max_window_size`. Prevents emitting offsets > advertised window during dictionary-primed compression — would otherwise produce format-invalid frames. Regression test included. ## API changes (driver) - `FastKernelMatcher::with_params(window_log, hash_log, mls, step_size)` — step_size promoted to 4th positional arg; previous `set_step_size` post-init setter removed (closed the gap where new code paths could silently fall back to default step_size=2) - `FastKernelMatcher::reset(window_log, hash_log, mls, step_size)` — same signature change - `RESERVED_PREFIX_BYTES`: 1 → 0 (no dummy region); legacy name retained for drain-offset math - New `INITIAL_PREFIX_START_INDEX = 1` constant for sentinel-0 filter baseline ## Kernel structure - `match_found<USE_CMOV>` — branchless cmov / branch dispatch via const generic; `# Safety` documented; bitwise `&` intentional to preserve cmov ordering (donor `__asm__("")` equivalent) - `MatchFound::{Rep,Explicit}` variants carry explicit `current0` (donor's writeback position) — correct on both probe paths, pre-backward-extension - `ip3 > ilimit` exit (not `>=`) — recovers the last hashable position at end-of-block - Two hash insertions post-match-emit: at `current0+2` (donor zstd_fast.c:407, was missing) and `ip0-2` ## Tests - 577 nextest, all passing - New per-level dispatch test pinning hash_log / mls / step_size for L1, Fastest, Uncompressed, L-1..L-7 - cmov vs branch byte-for-byte equality + cmov out-of-window false-positive regression - explicit_match_backward_extension deterministic via marker-byte layout (asserts match_len ≥ 5 + literals don't end with marker) - start_matching_enforces_max_window_size_offset_bound - start_matching_caps_offsets_at_window_log_not_inflated_max (M9 format-correctness regression) - cross_validation FFI roundtrips on every level 1..=22 (dict + no-dict) — wire-format interop preserved ## Out of scope (follow-ups) - #220 — Fast L1 +7.43% ratio residual on decodecorpus - L18-19 btultra +3.5% — separate ratio investigation
Summary
Phase 1b of the #198 Fast strategy donor port — wires the donor-shape
kernel modules from phase 1a (#215 → merged in fc63464) into the
production hot path. The legacy SuffixStore-based
MatchGeneratorthat lived in
simple/mod.rsis fully removed;MatcherStorage::Simplenow holds a
FastKernelMatcherthat drivescompress_block_fast<MLS>once per block.
What changes
simple/mod.rscollapses from 496 → 14 lines (only moduledeclarations + docstring remain).
SuffixStore,WindowEntry,MatchGenerator,repcode_candidate,add_data,next_sequence,add_suffixes_till,insert_suffix_if_absent,add_suffixes_interleaved_fast,offset_match_len,reserve—all gone with the wiring.
simple/fast_matcher.rsis the new active matcher(introduced incrementally on this branch — see commit log). The
full inherent surface required by the
Matchertrait wiring:accept_data(space)— stash a committed block inpending.start_matching(handler)— extend history, dispatch thekernel on the just-appended block, forward every emitted
Sequence::Tripleto the handler plus the terminalSequence::Literalsfromtail_literals_len.skip_matching_with_hint(hint)—None/Some(true)=history append only;
Some(false)= dictionary-priming paththat ALSO pre-populates the hash table for every position in
the appended range, so cross-block matches against
dict-primed bytes work in subsequent blocks.
trim_to_window()— drop history pastmax_window_size, bumpprefix_start_index, clear the hash table. Donor'sZSTD_window_correctOverflowequivalent. Returns evictedbytes for the dictionary-budget retire loop.
last_committed_space()— returns pending pre-processing ORhistory[last_block_start..]post-processing (legacyMatchGenerator
window.last().dataparity for the framecompressor's raw-block emission path).
reset(window_log, hash_log, mls)— per-frame state reset,keeps the hash table allocation if
(hash_log, mls)unchanged.
MatchGeneratorDriverSimple-arm wiring:commit_space→m.accept_data(space).start_matching::<Fast>→m.start_matching(handler).skip_matching_with_hint→m.skip_matching_with_hint(hint).reset→m.reset(params.window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS).prime_with_dictionary→ field-leveloffset_hist/max_window_sizemutation continues to work (both opened topub(crate)for legacy parity).trim_after_budget_retire→m.trim_to_window().suffix_poolfield removed (no consumer after the legacymatcher's SuffixStore reuse path went away).
Invariants preserved
prefix_start_index >= 1at all times. Position 0 inhistoryispermanently sub-prefix so the hash table's empty-slot sentinel
value
0cannot be confused with a real match position.Pinned in
with_params+reset.history.len()is bounded by2 × max_window_sizepost-append.Eviction in
extend_history_with_pendingenforces.rep[0..2](kernel-tracked) ↔offset_hist[0..2](wire encoder-tracked) stay in sync via per-Triple
encode_offset_with_historycalls. The
lit_len == 0edge case (donor'srep[0]-1shift)is not modeled today — the phase-1 kernel doesn't emit
lit_len == 0Triples. A future cmov / lookahead-pipelinedkernel variant will need explicit handling.
Tests
568 / 568 pass on the full workspace nextest suite:
FastKernelMatcheritself (lifecycle, accept +start, skip flavors, dict-prime hash population, eviction,
boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync,
prefix-eviction during dict-priming).
prefix-filter, repcode backward extension three-piece proof,
short-input early-return uniformity).
frame_compressorintegration tests (raw-block detection,hinted source-size matrix, level-1 round-trips through both
our decoder and FFI decode).
cross_validationRust-encoded → FFI-decoded round-trips(every level 1..=22, dict + no-dict, encoded by Rust then
read by the C reference decoder verbatim).
18 legacy tests in
match_generator.rsthat tested specificallySuffixStoreorMatchGenerator::new(...)behavior are#[cfg(any())]-gated with explanatory comments — their substanceeither has equivalent coverage in the new tests or relied on
algorithm-specific quirks of the SuffixStore matcher that the
donor-shape kernel doesn't reproduce by design (e.g. matching on
8-byte blocks below the kernel's
HASH_READ_SIZE = 8floor).Benchmark (i9-9900K,
compress/level_1_fast/*/matrix/pure_rust)cargo benchdeltas vsmain(fc63464, phase 1a merged but notwired) on the same host. Negative = faster, positive = slower.
low-entropy-1mcount_forward)decodecorpus-z000033high-entropy-1msmall-10k-randomsmall-4k-log-linessmall-1k-randomlarge-log-streamhash_log = 14(16K slots) collides heavily on dense repeating content. See follow-up.Decompress side picks up a small indirect win (−3% to −5%) on
random-data scenarios because the sequence shape coming out of the
donor-shape kernel matches the decoder's optimised paths more
closely (fewer mid-block
Literalsemissions → less RingBufferdispatch).
large-log-streamregression analysisThe regression is expected given the trade-off chosen at the
phase 1b kickoff (donor parity over speed):
SuffixStorematcher used an effectivelywindow_log-sizedhash store (512K slots for level 1's
window_log = 19) plushash_fill_step = 3interleaving. Collision rate stayed low evenon 25 MiB logs.
hash_log = 14(16K slots).Against 25 MiB of compressible log content, the collision rate
per slot climbs into the thousands, and the kernel's single-cursor
loop loses matches the legacy matcher would have caught.
baseline shows donor at ~120 MB/s on similar workloads. The
current main (this PR's baseline) was already faster than donor
on this scenario because the legacy matcher was over-engineered
for speed at the cost of
O(window)memory.Closing the gap is exactly the scope of issue #198 phase 3 (items
2 / 3 / 5 in the issue body): 4-cursor
ip0/ip1/ip2/ip3lookaheadpipelining +
cmovmatch-found variant + per-levelmlsdispatch.Phase 3 lands on a follow-up branch once phase 1b is reviewed and
merged.
The net of phase 1b across the scenario matrix is positive
(geomean ≈ 2× faster, headline scenarios up to 10×). The
regression is bounded to dense-compressible large logs and is on
the documented roadmap.
What's NOT in this PR (lands separately)
ip0/ip1/ip2/ip34-cursor pipelining,
cmovmatch-found variant, mls dispatchknob per level (
hash_fill_stepstride for dict-priming isstill hard-coded to 1 today — the LevelParams field is wired
in but the Fast matcher always strides at 1).
LevelParams.hash_log/LevelParams.mlsfields (currentlyFast hard-codes donor level-1 defaults 14 / 7 from
FAST_LEVEL_1_HASH_LOG/FAST_LEVEL_1_MLS). Per-level scalingfor small-source-hint windows lands when
LevelParamsgrowsthese fields.
Closes #198 phase 1b (the per-issue plan calls phase 3 a separate
follow-up branch).
Related: #178 (umbrella regression issue), #215 (phase 1a — kernel
modules, merged in fc63464).
Summary by CodeRabbit
Refactor
Tests