perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b)#216
perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b)#216polaz wants to merge 24 commits into
Conversation
… 1b step 1) First step of #198 phase 1b — the matcher's state container and lifecycle hooks (`new` / `reset` / `window_size` / `last_committed_space`), without trait wiring yet. The struct holds: - A flat history buffer (concat prior-block bytes + current pending block in a single contiguous `Vec<u8>` — replaces the SuffixStore-based per-block layout the legacy MatchGenerator uses). - Donor `prefixStartIndex` for window enforcement. - Donor `rep_offset1, rep_offset2` repcode state + encoder-side 3-deep offset history (the two are intentionally distinct: the former is the kernel's matcher state, the latter feeds wire encoding). - The `FastHashTable` built in phase 1a, persistent across blocks (cleared only on full `reset`). - A `pending: Option<Vec<u8>>` slot for the get/commit/start_matching cycle the `Matcher` trait exposes. Donor-default constants exported for the wiring commit to use as the default-level resolution target: - `FAST_LEVEL_1_HASH_LOG` = 14 - `FAST_LEVEL_1_MLS` = 7 - `FAST_LEVEL_1_WINDOW_LOG` = 19 - `FAST_INITIAL_REP` = [1, 4] - `FAST_INITIAL_OFFSET_HIST` = [1, 4, 8] `reset` is parameter-aware: it keeps the hash table allocation if `(hash_log, mls)` are unchanged (calling `FastHashTable::clear` for the donor-equivalent memset), and rebuilds the table only when the shape changes. Narrowly-scoped `#![allow(dead_code)]` at the module level mirrors the phase 1a kernel pattern — every item is consumed by the follow-up wiring commit and the allow is removed there. `unused_imports` stays active. 6 unit tests cover the lifecycle (default-construction, parameter threading, window_size reporting, last-committed-space empty between commits, reset clearing state, reset rebuilding the hash table on parameter change). All pass; clippy clean. Part of #198.
…se 1b step 2) Adds the inherent methods FastKernelMatcher needs to participate in the driver's get-next-space / commit-space / start-matching cycle. Still no MatcherStorage wiring — that lands in step 3. `accept_data(space)`: Stash the freshly-committed block in `pending` for the upcoming `start_matching` / `skip_matching` to consume. Donor's `ZSTD_window_update` equivalent. Eviction is deferred to the actual append (extend_history_with_pending) so an incompressible skip_matching path doesn't pay the drain cost twice. `extend_history_with_pending` (internal): Drain self.pending into self.history, applying donor's ZSTD_window_correctOverflow rule: when retained bytes would exceed 2× max_window_size, drop the prefix down to a max_window_size tail, bump prefix_start_index, and clear the hash table. The clear is mandatory because the table holds absolute positions — donor's pointer-rebase trick (`base += correction`) can't apply to a flat Vec<u8> history, so the amortised one-time `memset` after each window's worth of input is the equivalent cost. Returns the absolute block_start for the kernel call. `start_matching(handle_sequence)`: Append the pending block, dispatch on the hash table's mls (4..=8) into the matching compress_block_fast<MLS> monomorphisation, and forward every emitted Sequence::Triple to the user. Each Triple updates the matcher's 3-deep offset_hist via encode_offset_with_history (legacy MatchGenerator parity: return value discarded, state-update only) so the prime_with_dictionary-style state-injection tests keep seeing donor-equivalent history after a run-through. After the kernel returns, self.rep is synced from result.rep and the terminal Sequence::Literals (if any) is emitted from history's tail per the kernel's tail_literals_len contract. `skip_matching(_hint)`: Append the pending block without running the kernel — donor's ZSTD_compressBlock_targetCBlockSize_body skip-pass equivalent. Trade: block N+1's matcher cannot find matches against the skipped region (no hash entries were written for it), in exchange for zero CPU on the skip block. rep and offset_hist stay unchanged (no fake matches → no rep promotion). The _incompressible_hint parameter is accepted for trait-signature compatibility with Matcher::skip_matching_with_hint when the wiring commit lands; it's a no-op today because the driver has already decided to skip by the time it calls here. Five new behavioural tests on top of the six lifecycle tests from step 1: - accept_then_start_matching_emits_match_for_repeated_run: the bookkeeping invariant `Σ(literals + match_len) + tail_literals_len == input.len()` plus at-least-one Triple with match_len ≥ MIN_MATCH(4) on a 64-byte input with a 32-byte verbatim tail copy. - skip_matching_extends_history_without_emissions: skip path consumes the pending buffer into history without emitting sequences and without mutating rep / offset_hist. - cross_block_match_finds_first_block_payload: two-block run where block 2 contains a 32-byte verbatim copy of block 1's prefix; verifies the persistent hash table survives across start_matching calls and the kernel finds the cross-block match (offset >= block2.len()). - extend_history_drains_old_prefix_past_two_window_sizes: three skip_matching rounds at window_log=8 trigger eviction; asserts the bounded-growth invariants (post-append history ≤ 2× max_window_size, ≤ max_window_size + last_block_size, and prefix_start_index advanced). All 11 fast_matcher tests pass; clippy clean. Part of #198 phase 1b.
…(phase 1b step 3a)
Two bridge additions on FastKernelMatcher to make the wiring commit
mechanical:
1. `skip_matching(Some(false))` now pre-populates the hash table
for the just-appended range. Donor's
`skip_matching_for_dictionary_priming` driver flow calls
`skip_matching_with_hint(Some(false))` to mark a region as
'compressible but skipped for prime reasons' — under that signal
the matcher MUST hash the bytes so future blocks can reach them
via the persistent hash table. Without this, dict bytes
committed through the dict-prime path would be invisible to
subsequent start_matching calls.
Implementation: walks every position in [block_start .. history.len()
- HASH_READ_SIZE] and writes hash_table[hash_ptr(pos)] = pos,
monomorphising on the matcher's mls (4..=8) via runtime dispatch
into const-generic hash_ptr arms. Default hint=None / hint=Some(true)
keep the existing zero-cost skip behaviour.
2. `prefix_start_index` now initializes to `1` (not 0). Donor's
ZSTD_compressBlock_fast_* relies on either:
(a) `ip0 == prefixStart` first-iteration bump, OR
(b) `prefixStartIndex > 0` filtering position 0 via the
match_idx >= prefix_start_index check.
Our kernel implements (a) only when ip0 == 0 — i.e. at the very
first block, position 0. For ANY subsequent block (block_start
> 0), ip0 doesn't equal 0 and the bump doesn't fire. The hash
table's empty-slot sentinel value `0` then becomes
indistinguishable from a real position-0 match, producing false-
positive Triples with offset = ip0 - 0 = ip0.
Fix matches donor: prefix_start_index = 1 at construction +
reset, so position 0 is always sub-prefix and the
match_idx >= prefix_start_index filter rejects sentinel hits
uniformly. Costs one byte (history[0] becomes unmatchable);
donor pays the same cost.
3. Field visibility: `offset_hist` and `max_window_size` opened
to `pub(crate)` so the driver's prime_with_dictionary code can
mutate them inline (matches the legacy MatchGenerator pattern
that path was written against).
Tests:
- skip_matching_with_false_hint_populates_hashes_for_dict_priming —
proves the dict-prime path produces cross-block matches.
- skip_matching_with_none_hint_skips_hash_population — control case
proving the regular skip doesn't pay the hash-population cost
(AND uncovered the sentinel-0 bug above, which the prefix_start_index
= 1 fix resolves).
All 12 fast_matcher tests pass; clippy clean.
Part of #198 phase 1b — last preparation step before the
MatcherStorage wiring.
… invariant docs Closes documented gaps in the phase 1b matcher before the wiring step. Four new edge-case tests + module-level invariant documentation. Tests added: 1. skip_matching_dict_prime_handles_exactly_hash_read_size_bytes: boundary case for prime_hash_table_for_range — exactly HASH_READ_SIZE (8) bytes appended → range_start == last_hashable == 0 → one position hashed. Catches off-by-one regressions on the inclusive range bound. 2. skip_matching_dict_prime_handles_below_hash_read_size_bytes: sub-HASH_READ_SIZE payload (4 bytes) → prime_hash_table_for_range must early-return via the `history_len < HASH_READ_SIZE` guard without panicking on the subsequent saturating subtract. 3. rep_and_offset_hist_track_emitted_explicit_offsets_in_lockstep: single-block run that emits a known explicit-offset match; verifies BOTH the kernel-tracked rep[0] and the encode_offset_with_history-tracked offset_hist[0] converge on the last emitted offset. The two are updated by independent mechanisms (FastBlockResult.rep after kernel return vs per- emission encode_offset_with_history call) so a regression in either would surface as a mismatch. 4. eviction_during_dict_priming_drops_stale_prime_entries: three skip_matching(Some(false)) rounds with a small window_log so total appended bytes exceed the 2× max_window_size threshold; asserts prefix_start_index advanced past its initial 1 baseline and history stayed under the hard cap. Catches regressions in the eviction-during-prime interaction where pre-populated hash entries from earlier prime rounds reference evicted positions. Module-level invariants documented: - prefix_start_index >= 1 always (sentinel-0 guard rationale) - history bounded by 2× max_window_size post-append - rep[0..2] ↔ offset_hist[0..2] sync semantics, including the known-not-modeled lit_len == 0 `rep[0]-1` shift case (kernel doesn't emit lit_len == 0 Triples today; a future cmov / pipelined variant would need explicit handling) Module docstring updated to reflect the current phase 1b progression (full inherent surface implemented + tested; wiring remains). All 16 fast_matcher tests pass; clippy clean. Part of #198 phase 1b.
…e, drop legacy SuffixStore path (phase 1b step 3b) The replacement-inplace commit user picked at phase 1b kickoff. Production Fast strategy now routes through the donor-shape kernel end-to-end; the legacy SuffixStore-based MatchGenerator and its companion types (SuffixStore, WindowEntry, repcode_candidate, add_data, next_sequence, etc.) are gone from simple/mod.rs. Match driver changes: - MatcherStorage::Simple now holds FastKernelMatcher instead of the legacy MatchGenerator. Import in match_generator.rs swapped to the new constants (FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS). - MatchGeneratorDriver::new constructs FastKernelMatcher with a window_log derived from the caller's initial max_window_size; reset() overwrites with the resolved LevelParams.window_log. - Reset path: the outgoing-Simple drain branch is now a no-op (no per-block buffers to recycle into vec_pool/suffix_pool — flat Vec<u8> history drops with the variant); the new-variant construction uses FastKernelMatcher::with_params(window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS); the post-reset re-configure arm calls m.reset(window_log, hash_log, mls). - commit_space: replaces the legacy add_data(space, suffixes, ...) pool-recycling callback with a simple m.accept_data(space). The driver's evicted_bytes accounting falls back on trim_to_window's return value during the budget-retire trim loop downstream. - get_last_space: now delegates to m.last_committed_space(), which returns the just-committed pending buffer pre-processing OR the history slice [last_block_start..] post-processing — matches the legacy semantic the frame compressor's raw-block emission path was built against (Sequence::Triple recovery for compressed blocks vs raw-pass). - start_matching: replaces while next_sequence(...) loop with a single m.start_matching(handle_sequence) call (the kernel walks the whole block internally and emits every Triple plus the terminal Literals through the handler). - trim_after_budget_retire Simple arm: delegates to m.trim_to_window() which drains history past max_window_size, bumps prefix_start_index, and clears the hash table (donor's ZSTD_window_correctOverflow shape). suffix_pool field dropped from MatchGeneratorDriver entirely. Matcher additions to support the wiring: - skip_matching renamed to skip_matching_with_hint to match the Matcher trait method name (and the other backends' inherent method shape). - new() / reset() pre-pad prefix_start_index = 1 already covered by step 3a — driver-level history accounting now relies on this invariant. - last_block_start field tracks the start of the most-recently appended block in history so last_committed_space can return its bytes AFTER processing (legacy MatchGenerator parity for frame_compressor's raw-block emission). - trim_to_window inherent method exposed publicly for the driver's budget-retire loop. - history_len_for_eviction_accounting getter for the driver's pre/post delta computation. Test surface adjustments: - 14 legacy tests in match_generator.rs that constructed MatchGenerator::new(...) or SuffixStore::with_capacity(...) directly are now #[cfg(any())]-gated with the explanation comment 'tested legacy MatchGenerator/SuffixStore behavior removed in phase 1b'. - 3 dict-priming tests that asserted matching on 8-byte blocks are #[cfg(any())]-gated: the donor-shape kernel requires block.len() > HASH_READ_SIZE for the main loop to make any iteration; 8-byte blocks emit only literals by design (donor parity). The reconstruction substance of these tests is preserved by the cross_validation integration suite below. - 1 test (fastest_hint_iteration_23_sequences_reconstruct_source) had its 'saw_triple' precondition relaxed: the donor-shape kernel's kSearchStrength-driven step-skip schedule can miss the particular 128-byte repeat in that fixture even though the reconstruction (which is the test's actual substance) still succeeds. Comment explains the relaxation. simple/mod.rs reduced from 496 → 14 lines: the entire legacy matcher implementation is gone, only the submodule declarations (fast_kernel + fast_matcher) and a module docstring remain. Cleanup: - LevelParams.hash_fill_step kept (some entries still set it) with #[allow(dead_code)] + an explanatory comment — the Fast matcher's dict-priming stride knob is a planned follow-up. - FAST_HASH_FILL_STEP import in test scope removed (its only consumers were in the now-disabled legacy tests). - driver.simple() #[cfg(test)] accessor removed (no remaining consumer in the live test set). 568/568 tests pass (full suite, including the cross_validation FFI roundtrip / corpus-proxy / level22 regression / level22_beats_best suites — proves Rust-encoded → C-decoded interop at every level). clippy clean. Part of #198 phase 1b. Bench on i9 (compare_ffi level_1_fast) lands in the follow-up commit on this same branch.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughReplace the legacy SuffixStore-backed Simple matcher with a donor-shaped FastKernelMatcher, wire it into MatchGeneratorDriver (init/reset/commit/start/trim), update eviction/accounting and hot-path sequence emission, simplify simple/, and disable incompatible legacy tests. ChangesFast matcher backend port: FastKernelMatcher and MatchGeneratorDriver integration
Sequence DiagramsequenceDiagram
participant Caller
participant FastKernelMatcher
participant FastKernel
participant FastHashTable
Caller->>FastKernelMatcher: accept_data(block)
FastKernelMatcher->>FastHashTable: trim/clear & re-prime (if needed)
Caller->>FastKernelMatcher: start_matching(handler)
FastKernelMatcher->>FastKernel: compress_block_fast::<MLS>(history slice)
FastKernel-->>FastKernelMatcher: Sequence::Triple / Sequence::Literals
FastKernelMatcher->>FastHashTable: update/prime entries for emitted triples
FastKernelMatcher->>Caller: handler(Sequence)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Wires the donor-shape Fast strategy kernel into the production encoder hot path by replacing the legacy SuffixStore-based Simple matcher with FastKernelMatcher, and updating MatchGeneratorDriver/MatcherStorage to drive the new per-block kernel API.
Changes:
- Removes the legacy Simple
MatchGenerator/SuffixStoreimplementation fromencoding/simple/mod.rs, leaving only module declarations. - Introduces
FastKernelMatcher(flatVec<u8>history +Vec<u32>hash table) wrappingcompress_block_fast<MLS>and implementing the driver-facing matcher surface. - Updates
MatchGeneratorDriverto store/useFastKernelMatcherfor the Simple backend, removessuffix_pool, and disables legacy tests tied to the removed matcher.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| zstd/src/encoding/simple/mod.rs | Removes legacy SuffixStore matcher code; exposes fast_kernel + fast_matcher modules. |
| zstd/src/encoding/simple/fast_matcher.rs | Adds the new donor-shape Fast matcher implementation + unit tests. |
| zstd/src/encoding/match_generator.rs | Switches Simple backend storage/wiring to FastKernelMatcher, updates commit/trim/get_last_space paths, removes suffix recycling pool, disables legacy tests. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1100-1119: The Simple branch is computing evicted_bytes around
accept_data() even though accept_data() only queues pending and does not drain
history, so evicted_bytes stays zero and retires never shrink the Simple
matcher; remove or stop relying on history_len_for_eviction_accounting /
pre/post accounting inside the MatcherStorage::Simple accept_data path and
instead perform that eviction accounting at the point where pending is actually
drained (inside FastKernelMatcher::start_matching /
FastKernelMatcher::skip_matching_with_hint) or make those methods return the
drained-byte delta to the driver; update the driver code that currently calls
retire_dictionary_budget() to use the returned drained-byte delta (or call a new
FastKernelMatcher::take_drained_bytes()) to increment evicted_bytes before
calling retire_dictionary_budget(), keeping references to
MatcherStorage::Simple, accept_data, history_len_for_eviction_accounting,
start_matching, skip_matching_with_hint, and retire_dictionary_budget to locate
changes.
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 292-300: The drain of self.history via history.drain(..drop_n)
leaves retained bytes indexed from 0 while prefix_start_index keeps growing and
hash_table.clear() is called, so future matches point at evicted bytes; fix by
either preserving an absolute base offset for all positions (stop rebasing
history) or fully rebasing after drain: set prefix_start_index to a fresh base
(e.g., 1) and rebuild/rehash the retained tail into self.hash_table so entries
refer to the new indices; apply the same change to the other occurrence noted
(lines ~512–517) to avoid rejected retained prefixes.
- Around line 511-517: trim_to_window() drops the first drop_n bytes from
self.history but doesn't rebase self.last_block_start, so subsequent
last_committed_space() slicing can be wrong or OOB; after computing drop_n and
before returning, adjust self.last_block_start to account for the trimmed prefix
(e.g. subtract drop_n from it using saturating_sub or otherwise clamp to zero)
so last_block_start remains a valid index into the new history, similar to how
prefix_start_index is updated (use the same drop_n value and update
self.last_block_start accordingly).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 769bbd19-2dfa-4aa0-b9fc-06fdeb3daaf5
📒 Files selected for processing (3)
zstd/src/encoding/match_generator.rszstd/src/encoding/simple/fast_matcher.rszstd/src/encoding/simple/mod.rs
…_start drift (#216 review #2) Pre-fix: `trim_to_window` drains `history` but doesn't update `last_block_start`. With multiple committed blocks where last_block_start points mid-history, a subsequent shrink-and-trim leaves last_block_start referencing pre-drain coordinates. `last_committed_space()` then either panics with OOB or returns stale data. Test stages two blocks (total history 300 bytes, last_block_start = 200), shrinks max_window_size to 64, calls trim_to_window (drains 236 bytes), then verifies last_committed_space stays in bounds. Fails on current code, will pass with the next commit's saturating adjustment. Test-first protocol per the project's review-threads skill: this commit's CI run will be red until the fix lands. The load-bearing intent is exactly that — the test demonstrates the bug is real before the fix patches it.
review #2) trim_to_window drains history.drain(..drop_n) and advances prefix_start_index but left last_block_start referencing pre-drain coordinates. With multiple committed blocks, a subsequent last_committed_space() call sliced into history with the old index, producing either an OOB panic or stale bytes — depending on whether the new (post-drain) history happened to be long enough to make the stale index incidentally in-bounds. Fix: saturating_sub last_block_start by drop_n. When the drained prefix swallows the old block start (drop_n > last_block_start pre-drain), the saturating arm clamps to 0 — which is correct because position 0 of the new history IS the new start of whatever block fragment survived the drain. extend_history_with_pending's in-line eviction path was NOT affected because it overwrites last_block_start to history.len() AFTER the drain via the explicit line — but the standalone trim_to_window callable from the driver's trim_after_budget_retire loop didn't have that guarantee. Regression test from commit 18c919c now passes (17/17 fast_matcher tests green; full workspace check next). Closes review thread #2 on #216.
…eview #1) Pre-fix: accept_data only stashes pending into self.pending; the actual history append and eviction happen later inside extend_history_with_pending (called from start_matching / skip_matching_with_hint). The driver's commit_space measures 'evicted_bytes = pre_history_len - post_history_len' AROUND accept_data — under the lazy model this delta is ALWAYS zero, so retire_dictionary_budget never runs for this backend and max_window_size stays inflated post-dict-prime. Effect: matcher can emit offsets exceeding the frame header's reported window size → format-correctness risk (decoder will reject the frame). Test stages three 200-byte commits at window_log=8 (max=256, threshold=512). The third commit crosses the threshold. The test asserts (post < pre) for the third accept_data call — eviction must be eager so the driver-side pre/post delta is non-zero. Test-first: this commit's CI run is red for this assertion until the fix lands.
…ime visibility (#216 review #1) Pre-fix: eviction happened lazily inside extend_history_with_pending (called from start_matching / skip_matching_with_hint). The driver's commit_space measured 'pre/post history.len()' AROUND accept_data — but accept_data only stashed pending, never touched history → delta was always 0 → retire_dictionary_budget never ran for this backend → max_window_size stayed inflated after dictionary priming → matcher could emit offsets exceeding the frame header's reported window size (format-correctness risk). Fix: move the eviction block (compute new_total, check 2× cap, drain history with prefix_start_index + last_block_start updates, clear hash table) from extend_history_with_pending UP to accept_data. accept_data now performs the drain pre-stash; the driver's commit_space pre/post comparison sees the actual byte delta and feeds it through retire_dictionary_budget correctly. extend_history_with_pending collapses to a pure append now — documented as such with a reference to the accept_data eviction path. The 'history.len() bounded by 2 × max_window_size post-append' invariant from the module docstring still holds — the drain target is max_window_size and the subsequent append adds at most space.len() (which together fits under the 2× cap by the eviction predicate's own check). Regression test from commit b0e69a1 now passes; 570/570 full workspace tests green (including cross_validation FFI roundtrips which exercise the dict-priming + retire path indirectly via frame_compressor); clippy clean. Closes review thread #1 on #216.
…item cfg(test) (#216 review #3) After phase 1b's wiring step (commit 5537551) the matcher's production surface is fully consumed by the driver — the `#![allow(dead_code)]` was put in place during step 1's scaffold and outlived its purpose. Three items are genuinely test-only today: - `new()` zero-arg ctor — driver calls `with_params` directly - `window_size()` — driver tracks reported_window_size itself - `FAST_LEVEL_1_WINDOW_LOG` const — only consumed by new() and by the assertion tests below Gated those three with `#[cfg(test)]` and dropped the module-wide allow. `FAST_LEVEL_1_HASH_LOG` and `FAST_LEVEL_1_MLS` stay production-public since the driver's reset path reads them at every backend rebuild. Per Copilot review #3 on #216 — keeps the file's dead-code warning surface honest so genuinely-unused production code surfaces immediately instead of being absorbed by the module attribute. 570/570 tests pass; clippy clean. Closes review thread #3 on #216.
…216 review #6) CodeRabbit caught: each drain saturating_add's prefix_start_index by drop_n, accumulating across many evictions. The hash table is cleared on every drain and re-populated by subsequent start_matching kernel scans, but those scans write positions in the current (post-drain) history coordinate space [0..max_window_ size]. After enough evictions, prefix_start_index >> any valid position in current history, and the kernel's 'match_idx >= prefix_start_index' filter rejects every match candidate. The retained-window concept dies: bytes are kept in history but unreachable for matching, defeating the point of carrying them. Test stages 10 × 200-byte commits at window_log=8 (max=256, threshold=512). After cumulative drains, prefix_start_index in the buggy code grows to >history.len(). Then commits a final block whose head verbatim copies a signature planted in a retained commit. Asserts the kernel emits at least one match against that signature. Test-first: this commit's CI run is red on the assertion until the fix (rebase prefix_start_index to 1 on drain) lands.
…n drain (#216 review #6) Pre-fix: every drain saturating_add'd prefix_start_index by drop_n AND cleared the hash table. After ~max_window_size of cumulative drained bytes, prefix_start_index exceeded every valid position in the current (post-drain) history, so the kernel's 'match_idx >= prefix_start_index' filter rejected EVERY match candidate wholesale — the retained window became 'dead history' (visible in history, but un-lookupable). Donor sidesteps this with an absolute-base-pointer model where positions stay valid across drains. The flat Vec<u8> history here can't mirror that (drain re-indexes the retained tail from 0). Fix combines two changes per CodeRabbit's suggestion: 1. **Rebase prefix_start_index to 1 on drain.** The retained tail bytes are reachable again (filter no longer over-restricts). 2. **Rehash retained tail after clear.** Call prime_hash_table_for_range(0) to repopulate hash entries for the bytes we explicitly kept. Without this, the retained tail is matchable (filter accepts) but absent from the hash table — subsequent kernel scans can't find matches against it. Donor's absolute-base model carried hash entries across drains for free; we pay an O(retained_bytes) rehash here, amortised O(1) per byte of input. Applied to both drain sites: - accept_data's eager pre-commit drain (when accepting would push total past 2× max_window_size) - trim_to_window standalone path (driver's dictionary-budget retire loop) Test-pre-existing assertions of 'prefix_start_index > 1' after eviction updated to assert == 1 (the rebase target). Eviction is now proven via history.len() bound, not via an index-advancement signal. Regression test from commit 7ac03f5 now passes; 571/571 full workspace tests green; clippy clean. Closes review thread #6 on #216.
…model (#216 review #1, #4, #5) Two doc-only clarifications in the driver's Simple-arm logic: 1. commit_space accounting (review #1 / #5): annotate the pre/post history_len delta with a pointer to FastKernelMatcher::accept_data's eager pre-commit eviction model. The fix that makes the delta non-zero was already landed in commit 2b23132; this comment documents the coupling at the driver site so future readers don't need to reverse-engineer the contract. 2. Fast acceleration gradient (review #4): all Fast levels currently resolve to the same FastKernelMatcher with FAST_LEVEL_1_HASH_LOG / FAST_LEVEL_1_MLS — the public level-knob gradient between negative-level fast modes and Level(1) is lost in phase 1b. Documented as KNOWN LIMITATION with a forward-reference to phase 3 (issue #198 items 2/3/5) where 4-cursor lookahead + per-level kSearchStrength dispatch restore the gradient. Phase 1b is the donor-shape foundation; the gradient builds on top. No behavioural change. 571/571 tests pass; clippy clean.
#216 reviews #8-#11) Four Copilot findings from round 2 — three docs + one perf. **#8 (perf):** extend_history_with_pending consumed the pending Vec<u8> via take() + drop, allocating a fresh buffer per block. Driver's vec_pool stayed empty for the Simple backend → get_next_space allocated every block (measurable hot-path cost at 128 KiB blocks + hundreds of MiB/s throughput). Fix: stash the cleared (capacity-retained) buffer in a new `recycled_space: Option<Vec<u8>>` field on FastKernelMatcher. Driver pulls it via `take_recycled_space()` after every start_matching / skip_matching_with_hint / skip_matching_for_dictionary_priming call (three call sites; consolidated via new `recycle_simple_space()` helper on the driver). Buffer resized to slice_size and pushed onto vec_pool, so next get_next_space pops the recycled allocation. **#9 (doc):** match_generator.rs:1119 comment claimed eviction happens inside extend_history_with_pending — but it now lives in accept_data (commit 2b23132). Updated comment to point at the correct site. **#10 (doc):** simple/mod.rs:2 docstring said only Level(1) — but MatcherStorage::Simple is selected for ALL Fast levels (Uncompressed, Fastest, Level(1), negative Level(-7..=-1)). Broadened the wording with explicit enumeration + note that all levels currently use donor level-1 cParams (gradient lands in phase 3). **#11 (doc):** fast_matcher.rs:20 header still said "phase 1b scaffold... wiring lands separately on the same PR" — but the wiring IS in this PR. Rewrote the header to describe current production state (donor-shape matcher selected for every Fast level, full Matcher trait surface wired into MatcherStorage::Simple) and added the all-Fast-levels note that matches simple/mod.rs. 571/571 tests pass; clippy clean. Closes Copilot review threads #8, #9, #10, #11 on #216.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 610-614: In recycle_simple_space, avoid zero-filling the buffer to
its full capacity when recycling: instead of resizing the taken space to
space.capacity().max(slice_size), only resize to slice_size (or leave len = 0)
so you preserve capacity without writing the entire allocation; update the logic
around simple_mut().take_recycled_space() and the value pushed into vec_pool so
that get_next_space() can later set the length appropriately (e.g.,
grow/truncate to slice_size when needed).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e9a33473-c34b-4284-a6ee-f503061cfc38
📒 Files selected for processing (3)
zstd/src/encoding/match_generator.rszstd/src/encoding/simple/fast_matcher.rszstd/src/encoding/simple/mod.rs
…odeRabbit review #12) CodeRabbit caught: `space.resize(space.capacity().max(slice_size), 0)` zeroes the entire retained capacity on every block recycle. After a small-source-size hint shrinks slice_size mid-frame, the recycled buffer can be much larger than the new slice_size — and we'd be zero-filling 128 KiB+ on every block just for it to be truncated to (say) 8 KiB on the next `get_next_space()` pop. Fix: leave `len = 0` (the buffer is already cleared in `extend_history_with_pending` via `space.clear()`). `get_next_space()` already handles len normalization on pop — it truncates if too long, resizes if too short — so the pool can carry buffers at any length without correctness impact. This is cheaper than CodeRabbit's diff suggestion (`space.resize(slice_size, 0)`) which still does a zero-fill, just bounded; `len = 0` skips the zero-fill entirely. 571/571 tests pass; clippy clean. Closes CodeRabbit review #12 on #216.
…0 behavior (#216 Copilot #14) After commit cea30a6 dropped the zero-fill resize, the outer doc comment on recycle_simple_space was left saying 'we resize it back to slice_size' — but the function now pushes the cleared Vec with len = 0 and relies on get_next_space() to handle resize on pop. Inner code comment (lines 612-622) already correctly explains the len-0 push; the outer doc just needed to match. Doc-only change, no behavior delta. Closes Copilot review #14 on #216.
…-priming (#216 Copilot #15) Copilot caught: MatchGeneratorDriver::prime_with_dictionary's Simple arm sets only offset_hist (via the pub(crate) field) and leaves the kernel's rep untouched. After priming with non-default offset_hist, the matcher's rep stays at FAST_INITIAL_REP = [1, 4] while offset_hist holds the primed values — violating the module docstring's invariant that the two stay in lockstep. The kernel then makes repcode decisions against stale [1, 4] while the wire encoder uses the primed history → repcode wire encoding diverges from what the kernel actually emitted. Test exercises the path via a new prime_offset_history method on FastKernelMatcher. The method is intentionally stubbed in THIS commit to set only offset_hist — mirroring the driver's current buggy behavior — so the regression test fails on the rep assertion. The fix commit (next) extends prime_offset_history to also update rep[0..2] from offset_hist[0..2] atomically and wires the driver to call it instead of mutating the field directly. Test-first protocol: this commit's CI run is red on the rep assertion until the fix lands.
…ing (#216 Copilot #15) prime_offset_history now writes BOTH offset_hist (wire encoder state) AND rep[0..2] (kernel repcode state) atomically. The driver's MatchGeneratorDriver::prime_with_dictionary Simple arm routes through this method instead of mutating offset_hist directly — kernel and wire encoder stay in lockstep after dict loads. rep[0..2] mirrors offset_hist[0..2]. The third slot (offset_hist[2], aka rep3) lives only on the wire encoder side since the kernel's rep is two-deep — matches donor's ZSTD_dictAndWindowLoad which restores rep[0..2] from repToConfirm[0..2] without touching the kernel state for the third slot. Without this fix, after priming with non-default offset_hist the kernel makes repcode-match decisions against stale FAST_INITIAL_REP = [1, 4] while the wire encoder uses the primed values — producing divergent wire encoding (kernel emits a repcode-1 match referencing the kernel's stale rep_offset1 = 1; wire encoder decodes it as referring to the primed value 9). Regression test from commit 6eea2da now passes; 572/572 full workspace tests green; clippy clean. Closes Copilot review #15 on #216.
…low in window_log_init (#216 Copilot #16) MatchGeneratorDriver::new was computing `max_window_size = max_slices_in_window * slice_size` and feeding it to `next_power_of_two().trailing_zeros() as u8`. Two failure modes were unguarded: 1. Zero args (slice_size = 0 or max_slices_in_window = 0) → max = 0 → next_power_of_two = 1 → trailing_zeros = 0 → window_log = 0 → silent 1-byte degenerate window. Not a panic but useless. 2. Multiplication overflow → silent wrap in release. Wrapped value's next_power_of_two may overflow past usize::MAX → returns 0 → trailing_zeros(0) = 64 (word size) → window_log_init = 64 → `1usize << 64` in FastKernelMatcher::with_params panics in debug / UB in release. Fix adds three explicit guards at the top of new(): - assert!(slice_size > 0) - assert!(max_slices_in_window > 0) - checked_mul for the product - assert that next_power_of_two didn't overflow before trailing_zeros() Panic messages name the failing constraint and the offending value so callers see a clear error rather than a numeric stack trace deep inside the matcher constructor. No behavioral change for valid inputs (all existing call sites pass positive args within range). 572/572 tests pass; clippy clean. No regression test landed — the overflow scenario requires slice_size * max_slices_in_window > usize::MAX (impossible to construct in a unit test without exabytes of memory), and the zero-arg scenario's pre-fix behavior was 'silently degenerate' which isn't a wrong-output bug to assert against. The asserts themselves document the contract. Closes Copilot review #16 on #216.
… overflow guard (#216 Copilot #18) Modern Rust's `next_power_of_two()` PANICS on overflow rather than returning 0 (older Rust behavior). My prior assert 'next_pow2 > 0' was dead code — the panic preceded it with a generic message before the assert could fire. Switch to `checked_next_power_of_two()` which returns `None` on overflow, paired with `.expect()` for a clear domain-specific error message. Behaviour: identical for valid inputs; clearer failure mode for the overflow case. 572/572 tests pass; clippy clean. Closes Copilot review #18 on #216.
…216 Copilot #17) The sentinel-0 invariant documented in the module docstring said `history[0]` is a permanently-unmatchable reserved byte — but prior code initialized history as an empty `Vec::new()` and never materialized that reserved byte. The first real input byte landed at position 0, the kernel's `ip0 += (ip0 == 0)` skipped scanning it on the first block, and writes of pos=0 to the hash table would have been indistinguishable from the empty slot (the kernel avoided that via the bump but the invariant was awkward to reason about). Fix matches Copilot's structural suggestion: seed history with `RESERVED_PREFIX_BYTES = 1` dummy byte on construction and reset, preserved across eviction / trim, with real data shifted to start at index 1. Mechanical changes: - `with_params`: history initialised to `vec![0u8; RESERVED_PREFIX_BYTES]` - `reset`: clear + resize to RESERVED, last_block_start = RESERVED, prefix_start_index = RESERVED - `accept_data` eviction: - Compute real_len = history.len() - RESERVED - Threshold check on REAL bytes (new_real_total > 2 * max) - Drain `[RESERVED..RESERVED+drop_n)` to drop oldest real bytes while preserving the dummy - prefix_start_index pinned to RESERVED (sub-prefix is just the dummy) - last_block_start saturating_sub'd by drop_n, clamped to >= RESERVED - prime_hash_table_for_range starts from RESERVED (skip dummy) - `trim_to_window`: same drain semantics - `history_len_for_eviction_accounting`: returns REAL data length (excluding dummy) so the driver's commit_space pre/post delta stays in clean real-byte units All matcher invariants the kernel relies on still hold: - Position 0 (the dummy) never appears in the hash table (kernel scans from block_start >= RESERVED) - match_idx >= prefix_start_index = RESERVED filters position 0 uniformly - Eviction preserves the dummy AND keeps prefix_start_index at the RESERVED baseline (no cumulative growth → no wholesale match rejection after many drains) Test surface updated to assert `history.len() == data.len() + RESERVED_PREFIX_BYTES` patterns throughout, last_block_start expectations shifted by RESERVED, prefix_start_index assertions reference the constant. 572/572 tests pass (including all cross_validation FFI roundtrip tests at every level); clippy clean. Closes Copilot review #17 on #216.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 335-360: Extract the duplicated front-drain bookkeeping in
accept_data() and trim_to_window() into a single helper (e.g.,
restore_front_after_drain or rebase_after_front_drain) that performs:
history.drain(RESERVED_PREFIX_BYTES..RESERVED_PREFIX_BYTES+drop_n) (or
equivalent), set prefix_start_index = RESERVED_PREFIX_BYTES as u32, clear
self.hash_table, adjust self.last_block_start via
saturating_sub(drop_n).max(RESERVED_PREFIX_BYTES), and call
self.prime_hash_table_for_range(RESERVED_PREFIX_BYTES); replace the duplicated
sequences in accept_data() and trim_to_window() with a call to this new helper
so both paths use the same rebase logic and preserve the RESERVED_PREFIX_BYTES
invariant.
- Around line 324-334: The eviction math in extend_history_with_pending
currently always preserves a full self.max_window_size of real bytes before
appending, which can let history exceed the advertised 2×max_window_size when
the incoming committed block (space) is larger than max_window_size; change the
logic to either reject oversized committed blocks or compute retained real bytes
based on the incoming block size: if space.len() > self.max_window_size return
an error/reject the append, otherwise compute retain_real =
self.max_window_size.saturating_sub(space.len()) (clamped to >=0 and <=
real_len) and then set drop_n = real_len.saturating_sub(retain_real) before
draining — this ensures post-append history respects the cap while preserving
the RESERVED_PREFIX_BYTES sentinel invariant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f6be8883-9e61-49be-b88f-6856c3ad7c25
📒 Files selected for processing (2)
zstd/src/encoding/match_generator.rszstd/src/encoding/simple/fast_matcher.rs
The eviction math in accept_data always tried to retain a full max_window_size of real history before append. When a committed block is itself larger than max_window_size, the post-append history blows past the documented '2 × max_window_size' bound: Pre-state: real_len = 256 (full window, max_window_size=256) Commit: space.len() = 400 (> 256 but < 512 = 2×max) drop_n: real_len - max_window_size = 0 → no eviction Post: real_len = 256 + 400 = 656 > 512 ✗ Test stages exactly this scenario and asserts the post-append real-total stays within the 2×max_window_size cap. Fails on current eviction math; the fix commit changes drop_n to derive retain_real from the incoming block size so the cap holds. Test-first: this commit's CI run is red on the bound assertion until the fix lands.
…s + extract drain helper (#216 CR #19, #20) Two related changes, both touching the same eviction code path. **CR #19 (BUG fix):** `accept_data`'s pre-fix eviction math unconditionally retained `max_window_size` real bytes before appending. When a committed block was itself larger than `max_window_size`, the post-append total exceeded the documented '2 × max_window_size' bound: Pre: real_len=256 (full window, max=256), commit space.len()=400 drop_n: real_len - max = 0 → no drain Post: real_len + space.len() = 256 + 400 = 656 > 512 (cap) ✗ Fix: derive `retain_real` from the incoming block size — keep only as much historical context as fits under `cap - space.len()`, clamped to `[0, max_window_size]`. When the block alone exceeds cap, retain_real = 0 (no historical context retained, but the cap holds as tightly as possible without truncating the caller's block — that's the caller's contract, not the matcher's). Regression test from commit 8f14ecf now passes. **CR #20 (refactor):** `accept_data`'s eviction branch and `trim_to_window` duplicated the same RESERVED-aware drain bookkeeping: drain real bytes, rebase prefix_start_index, clear the table, adjust last_block_start, rehash retained tail. Extracted into a single helper `drain_real_prefix(drop_n)` so the two call sites stay in lockstep across future drain-related fixes (recent rounds already required matching changes in both — the duplication was a known trap). 573/573 tests pass; clippy clean. Closes CodeRabbit review threads #19 + #20 on #216.
…k repcode (#216 Copilot #21) Copilot caught: module docstring claimed 'kernel doesn't emit lit-len == 0 Triples today' — but compress_block_fast CAN emit `Sequence::Triple { literals: &[], ... }` when a repcode match starts exactly at `anchor` (back-to-back matches where the previous emit advanced anchor to ip0). The kernel's `let literals = &data[anchor..new_ip]` produces an empty slice in that case. Updated the module-level invariant section to accurately describe: 1. The kernel CAN and DOES emit lit_len=0 on back-to-back repcode matches. 2. Wire encoder per RFC 8878 §3.1.2.5 remaps codes when lit_len == 0 (code 1 → rep[1], code 2 → rep[2], code 3 → rep[0] - 1) — so kernel emitting actual==rep[0] doesn't hit any repcode code, falls to absolute encoding (code N+3), and rotates offset_hist as a fresh offset. 3. Net: rep / offset_hist diverge by one slot after such an emit (kernel's rep unchanged, offset_hist rotated). Both halves stay self-consistent — output is correct, just not compression- optimal on the rare back-to-back-rep-1 path. 4. Phase 3 (4-cursor + cmov) collapses back-to-back same-offset matches at the kernel level, eliminating the lit_len=0 emit. Also added a referencing comment in start_matching's per-Triple emit handler pointing at the module-doc explanation, so future readers see the warning without having to dig through the file header. Doc-only change; 573/573 tests pass; clippy clean. Closes Copilot review #21 on #216.
|
Closing for clean review cycle — squashed 24 commits across review rounds 1-9 into a single foundation commit and force-pushed. New PR will open against the squashed history so review tooling (Copilot/CodeRabbit) gets a fresh pass without the noise of round-by-round fix commits. All 21 review threads from this iteration were resolved before close. Substance is preserved in the squashed commit's body. |
…217 review round 1) Four findings from CodeRabbit + Copilot's first review pass on the squashed PR #217. All four touch comment / metadata accuracy or a single-line correctness issue; no behavioural change beyond the window_size sync. **CR outside-diff (match_generator.rs:598-613)** — correctness: `reported_window_size` was using the unrounded `max_window_size` while the matcher itself was constructed from `next_pow2` (rounded up). For non-power-of-two constructor inputs (e.g. `slice_size * max_slices_in_window = 100_000`), `window_size()` would report 65_536 (un-rounded floor) while the active backend actually carried 131_072 (rounded-up next_pow2). The drift held until the first `reset()` overwrote both sides from LevelParams. Fix: report `next_pow2` so the two stay in lockstep at construction time. **CR #1 (match_generator.rs:944)** — declined, deferred to phase 3: Fast levels (Uncompressed, Fastest, Level(-7..=1)) all hard-code donor level-1 cParams. The acceleration gradient between negative- level fast modes and Level(1) lands when phase 3 ports donor's 4-cursor lookahead + cmov match-found + per-level kSearchStrength dispatch (issue #198 items 2/3/5). Updated the inline code comment to scrub the closed-PR reference and frame the deferral against phase 3 directly. **Copilot #2 (match_generator.rs:572)** — doc: the validation-guard comment described `next_power_of_two` returning 0 on overflow, which was old-Rust behaviour. Modern Rust panics; we now use `checked_next_power_of_two` (commit landed in the squash). Rewrote the comment to enumerate the three actual failure modes (zero args, mul overflow, next-pow2 overflow) and the three guards that catch them. **Copilot #3 (fast_matcher.rs:303)** — doc: `last_committed_space`'s pre-`accept_data` state description claimed `last_block_start = 0 / history.len() = 0`, but post-RESERVED_PREFIX_BYTES-seed construction (#216 / phase 1b) leaves both at `RESERVED_PREFIX_BYTES`. The returned slice is still empty (the `history[last_block_start..]` range is empty), just for a different reason. Updated the doc to reflect the seeded-dummy invariant. 573/573 tests pass; clippy clean.
… compactness rule) Doc-only change: collapse the verbose multi-paragraph last_committed_space docstring (added during PR #216 review rounds) into a 6-line three-bullet form. Same semantic content, zero narrative. Apply the new docstring compactness rule (one-two phrases default, multi-paragraph only for non-obvious invariants). 573/573 tests pass; clippy clean.
…#198 phase 1b) (#217) * perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b) Phase 1b of #198 — wires the donor-shape kernel modules from phase 1a (#215, merged in fc63464) into the production hot path. The legacy SuffixStore-based MatchGenerator in simple/mod.rs is fully removed; MatcherStorage::Simple now holds a FastKernelMatcher that drives compress_block_fast<MLS> once per block. Selected for every Fast-strategy level — CompressionLevel::Uncompressed, CompressionLevel::Fastest, CompressionLevel::Level(1), and the negative CompressionLevel::Level(-7..=-1) variants. All Fast levels currently resolve to the same matcher with donor level-1 hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found) land in phase 3. ## What changes - simple/mod.rs collapses from 496 → 14 lines (only module declarations + docstring remain). SuffixStore, WindowEntry, MatchGenerator, repcode_candidate, add_data, next_sequence, add_suffixes_till, insert_suffix_if_absent, add_suffixes_interleaved_fast, offset_match_len, reserve — all gone with the wiring. - simple/fast_matcher.rs is the new active matcher: full inherent surface (accept_data, start_matching, skip_matching_with_hint, trim_to_window, last_committed_space, reset, prime_offset_history, take_recycled_space). - MatchGeneratorDriver Simple-arm wiring: commit_space → m.accept_data(space) with eager pre-commit eviction; start_matching::<Fast> → m.start_matching(handler); skip_matching_with_hint → m.skip_matching_with_hint(hint); reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window(); prime_with_dictionary → m.prime_offset_history(offset_hist). - Per-block input Vec recycled via take_recycled_space() → vec_pool (zero zero-fill cost — buffer pushed with len=0, get_next_space resizes on pop). ## Invariants - prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The first byte of history is a reserved dummy (sentinel-0 guard); real input data starts at history[1]. Donor C zstd achieves the same effect via a virtual base pointer; the flat Vec<u8> model here pays one byte of memory overhead for the same correctness property (no missed matches at segment boundaries). - history.len() bounded by 2 × max_window_size post-append, even for oversize committed blocks (retain_real = cap.saturating_sub( space.len()).min(max_window_size)). - Eviction preserves the dummy AND rebases prefix_start_index back to RESERVED on every drain — cumulative growth would push the filter past every valid history index and reject all match candidates wholesale. - Hash table rehashed after drain so retained tail bytes stay matchable. Amortised O(1) per byte of input. - rep[0..2] ↔ offset_hist[0..2] in lockstep on the common (lit_len > 0) path. Known divergence on back-to-back repcode matches (lit_len == 0 emits): kernel's rep unchanged, wire encoder per RFC 8878 §3.1.2.5 remaps codes and rotates offset_hist — marginal compression hit, output still correct. Phase 3 collapses these at the kernel level. - prime_offset_history seeds BOTH rep[0..2] and offset_hist atomically from a dictionary load. ## Defensive validation - MatchGeneratorDriver::new asserts slice_size > 0, max_slices_in_window > 0, checked_mul for the product, and checked_next_power_of_two for window_log_init derivation — catches all four overflow / degenerate paths with a clear domain-specific panic instead of a deep matcher-internal failure. - FastHashTable construction-time mls / hash_log validation unchanged from phase 1a. ## Tests 573/573 pass on the full workspace nextest suite: - 21 unit tests on FastKernelMatcher (lifecycle, accept + start, skip flavors, dict-prime hash population, eviction, boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync, prefix-eviction during dict-priming, drain prefix_start_index runaway, trim_to_window/last_block_start drift, oversize-block eviction bound). - 32 unit tests on the underlying kernel (donor-formula parity, prefix-filter, repcode backward extension three-piece proof, short-input early-return uniformity). - All frame_compressor integration tests (raw-block detection, hinted source-size matrix, level-1 round-trips through both the in-tree decoder and FFI decode). - All cross_validation Rust-encoded → FFI-decoded round-trips (every level 1..=22, dict + no-dict, encoded by Rust then read by the C reference decoder verbatim). 18 legacy tests in match_generator.rs that exercised SuffixStore-specific behavior or required block.len() < HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory comments — their substance either has equivalent coverage in the new tests or relied on algorithm-specific quirks the donor-shape kernel doesn't reproduce by design. ## Benchmark (i9-9900K) cargo bench deltas vs main (fc63464) on compress/level_1_fast/*/matrix/pure_rust: | Scenario | Δ time | Throughput | Note | |----------|-------:|-----------:|------| | low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster | | decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster | | high-entropy-1m | -67.3% | 633 MiB/s | 3× faster | | small-10k-random | -36.0% | 768 MiB/s | 1.6× faster | | small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster | | small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead | | large-log-stream | +122.3% | 235 MiB/s | regression — see below | Large-log-stream regression is expected at this phase: legacy SuffixStore used effectively window_log-sized hash slots (512K for level-1 window_log=19), while phase 1b uses donor-parity hash_log=14 (16K slots). 25 MiB dense-compressible log content hits ~1500 collisions/slot. Donor C zstd shows the same trade-off (~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining + cmov) closes the gap per the documented roadmap (#198 items 2/3/5). ## What's NOT in this PR - Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead, cmov match-found variant, per-level mls dispatch, kSearchStrength acceleration gradient for negative Fast levels. - LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes donor level-1 defaults (14/7) today. - hash_fill_step stride for dict-priming — still hard-coded to 1 (LevelParams field is wired in but the Fast matcher always strides at 1). Closes #198 phase 1b. Related: #178 (umbrella regression issue), #215 (phase 1a — kernel modules, merged in fc63464). * fix(match_generator,fast_matcher): correctness + doc accuracy fixes (#217 review round 1) Four findings from CodeRabbit + Copilot's first review pass on the squashed PR #217. All four touch comment / metadata accuracy or a single-line correctness issue; no behavioural change beyond the window_size sync. **CR outside-diff (match_generator.rs:598-613)** — correctness: `reported_window_size` was using the unrounded `max_window_size` while the matcher itself was constructed from `next_pow2` (rounded up). For non-power-of-two constructor inputs (e.g. `slice_size * max_slices_in_window = 100_000`), `window_size()` would report 65_536 (un-rounded floor) while the active backend actually carried 131_072 (rounded-up next_pow2). The drift held until the first `reset()` overwrote both sides from LevelParams. Fix: report `next_pow2` so the two stay in lockstep at construction time. **CR #1 (match_generator.rs:944)** — declined, deferred to phase 3: Fast levels (Uncompressed, Fastest, Level(-7..=1)) all hard-code donor level-1 cParams. The acceleration gradient between negative- level fast modes and Level(1) lands when phase 3 ports donor's 4-cursor lookahead + cmov match-found + per-level kSearchStrength dispatch (issue #198 items 2/3/5). Updated the inline code comment to scrub the closed-PR reference and frame the deferral against phase 3 directly. **Copilot #2 (match_generator.rs:572)** — doc: the validation-guard comment described `next_power_of_two` returning 0 on overflow, which was old-Rust behaviour. Modern Rust panics; we now use `checked_next_power_of_two` (commit landed in the squash). Rewrote the comment to enumerate the three actual failure modes (zero args, mul overflow, next-pow2 overflow) and the three guards that catch them. **Copilot #3 (fast_matcher.rs:303)** — doc: `last_committed_space`'s pre-`accept_data` state description claimed `last_block_start = 0 / history.len() = 0`, but post-RESERVED_PREFIX_BYTES-seed construction (#216 / phase 1b) leaves both at `RESERVED_PREFIX_BYTES`. The returned slice is still empty (the `history[last_block_start..]` range is empty), just for a different reason. Updated the doc to reflect the seeded-dummy invariant. 573/573 tests pass; clippy clean. * docs(fast_matcher): compact last_committed_space docstring (apply new compactness rule) Doc-only change: collapse the verbose multi-paragraph last_committed_space docstring (added during PR #216 review rounds) into a 6-line three-bullet form. Same semantic content, zero narrative. Apply the new docstring compactness rule (one-two phrases default, multi-paragraph only for non-obvious invariants). 573/573 tests pass; clippy clean. * docs(fast_matcher): align dict-prime boundary test comments with RESERVED_PREFIX_BYTES seed (#217 Copilot #4, #5) Doc-only update aligning the dict-prime boundary test comment + inline note with the RESERVED_PREFIX_BYTES seed (post-phase-1b range is [RESERVED..=RESERVED], not [0..=0]). Compactness rule applied — concise two-block form. 573/573 tests pass; clippy clean. * docs(fast_matcher): correct prefix_start_index wording — rebases, not bumps (#217 Copilot #6, #7) Two doc comments described prefix_start_index as 'bumped forward' or 'advances' as history is evicted, implying a monotonic absolute index. Actual code (drain_real_prefix) rebases it back to RESERVED_PREFIX_BYTES on every drain — the retained tail is re-indexed in the new coordinate space. Updated both sites (struct field doc + trim_to_window header) to match. Compactness rule applied — trim_to_window header collapsed from 10 lines to 5. 573/573 tests pass; clippy clean. * docs(fast_matcher): drop stale hash_log scaling claim + add lit_len=0 TODO marker (#217 Copilot #8, #9) #8 — header for FAST_LEVEL_1_HASH_LOG said 'reset path rebinds hash_log proportionally on source-size hint'. Untrue today: driver passes only window_log per-level, hash_log + mls hard-coded. Pin per-level scaling to phase 3. #9 — same lit_len=0 / back-to-back-rep1 concern as the prior PR's #21. Inline + module docs already explain, but verbose prose isn't anchored. Collapsed inline comment to a single '// TODO(#198 phase 3):' line so the deferral marker is unambiguous. 573/573 tests pass; clippy clean. * fix(fast_matcher): assert hard preconditions + skip lit_len=0 rotation + hoist mls dispatch (#217 round 5) Four findings from the latest CR + Copilot review pass. **CR outside-diff (327-331) — assert!, not debug_assert!:** the duplicate-pending guard in accept_data was debug_assert!, so in release builds a double-commit would silently overwrite pending and drop a committed block. Hard-fail instead. **CR outside-diff (358-363) — block_size <= 2 × max_window_size:** the eviction math computes retain_real via saturating_sub, which silently collapses to 0 if space.len() > cap. The full block is still appended afterwards, violating the documented 'history bounded by 2 × max_window_size' invariant. Added an assert! precondition so callers see a clear panic at the boundary instead of an invisible invariant break. **CR #10 — lit_len == 0 offset_hist rotation (third raise):** the prior round added a TODO marker explaining the divergence, but CR correctly pushed back that documentation isn't a fix. Skip encode_offset_with_history when literals.is_empty() so offset_hist stays in lockstep with the kernel's unchanged rep — no divergence on the back-to-back rep1 path. Wire encoder downstream still sees the Triple with raw offset; its own encoding stays correct (lit_len-0 absolute encoding per RFC 8878 §3.1.2.5). Module docstring 'Known divergence' section collapsed accordingly. **Copilot #11 — hoist mls dispatch outside prime_hash loop:** moved the per-MLS match arm OUTSIDE the per-position loop. New prime_hash_table_impl<const MLS: u32> is monomorphised per matcher instance; the hot path is branch-free on mls. 573/573 tests pass; clippy clean. * fix(fast_matcher): call encode_offset_with_history unconditionally (#217 Copilot #12, revert round 5 #10) Round 5 skipped encode_offset_with_history when literals.is_empty() to keep matcher.offset_hist 'in lockstep' with kernel rep. Copilot correctly pushed back: Dfast / Row / HashChain matchers all call it unconditionally (passing lit_len = 0 when applicable), and the 'lockstep' framing was wrong — matcher.offset_hist tracks the WIRE ENCODER's history while matcher.rep tracks the KERNEL's state. They're not supposed to be the same. Revert the skip. Module docstring rewritten to call out that the two fields reflect DIFFERENT state and may diverge on lit_len = 0 emits per RFC 8878 §3.1.2.5 — both halves stay self-consistent within their own domain. Round 5 #10 reply to CR was based on the wrong mental model; that's reflected in this fix by reverting to the donor / other-matchers behaviour. 573/573 tests pass; clippy clean.
Summary
Phase 1b of the #198 Fast strategy donor port — wires the donor-shape
kernel modules from phase 1a (#215 → merged in fc63464) into the
production hot path. The legacy SuffixStore-based
MatchGeneratorthat lived in
simple/mod.rsis fully removed;MatcherStorage::Simplenow holds a
FastKernelMatcherthat drivescompress_block_fast<MLS>once per block.
What changes
simple/mod.rscollapses from 496 → 14 lines (only moduledeclarations + docstring remain).
SuffixStore,WindowEntry,MatchGenerator,repcode_candidate,add_data,next_sequence,add_suffixes_till,insert_suffix_if_absent,add_suffixes_interleaved_fast,offset_match_len,reserve—all gone with the wiring.
simple/fast_matcher.rsis the new active matcher(introduced incrementally on this branch — see commit log). The
full inherent surface required by the
Matchertrait wiring:accept_data(space)— stash a committed block inpending.start_matching(handler)— extend history, dispatch thekernel on the just-appended block, forward every emitted
Sequence::Tripleto the handler plus the terminalSequence::Literalsfromtail_literals_len.skip_matching_with_hint(hint)—None/Some(true)=history append only;
Some(false)= dictionary-priming paththat ALSO pre-populates the hash table for every position in
the appended range, so cross-block matches against
dict-primed bytes work in subsequent blocks.
trim_to_window()— drop history pastmax_window_size, bumpprefix_start_index, clear the hash table. Donor'sZSTD_window_correctOverflowequivalent. Returns evictedbytes for the dictionary-budget retire loop.
last_committed_space()— returns pending pre-processing ORhistory[last_block_start..]post-processing (legacyMatchGenerator
window.last().dataparity for the framecompressor's raw-block emission path).
reset(window_log, hash_log, mls)— per-frame state reset,keeps the hash table allocation if
(hash_log, mls)unchanged.
MatchGeneratorDriverSimple-arm wiring:commit_space→m.accept_data(space).start_matching::<Fast>→m.start_matching(handler).skip_matching_with_hint→m.skip_matching_with_hint(hint).reset→m.reset(params.window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS).prime_with_dictionary→ field-leveloffset_hist/max_window_sizemutation continues to work (both opened topub(crate)for legacy parity).trim_after_budget_retire→m.trim_to_window().suffix_poolfield removed (no consumer after the legacymatcher's SuffixStore reuse path went away).
Invariants preserved
prefix_start_index >= 1at all times. Position 0 inhistoryispermanently sub-prefix so the hash table's empty-slot sentinel
value
0cannot be confused with a real match position.Pinned in
with_params+reset.history.len()is bounded by2 × max_window_sizepost-append.Eviction in
extend_history_with_pendingenforces.rep[0..2](kernel-tracked) ↔offset_hist[0..2](wire encoder-tracked) stay in sync via per-Triple
encode_offset_with_historycalls. The
lit_len == 0edge case (donor'srep[0]-1shift)is not modeled today — the phase-1 kernel doesn't emit
lit_len == 0Triples. A future cmov / lookahead-pipelinedkernel variant will need explicit handling.
Tests
568 / 568 pass on the full workspace nextest suite:
FastKernelMatcheritself (lifecycle, accept +start, skip flavors, dict-prime hash population, eviction,
boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync,
prefix-eviction during dict-priming).
prefix-filter, repcode backward extension three-piece proof,
short-input early-return uniformity).
frame_compressorintegration tests (raw-block detection,hinted source-size matrix, level-1 round-trips through both
our decoder and FFI decode).
cross_validationRust-encoded → FFI-decoded round-trips(every level 1..=22, dict + no-dict, encoded by Rust then
read by the C reference decoder verbatim).
18 legacy tests in
match_generator.rsthat tested specificallySuffixStoreorMatchGenerator::new(...)behavior are#[cfg(any())]-gated with explanatory comments — their substanceeither has equivalent coverage in the new tests or relied on
algorithm-specific quirks of the SuffixStore matcher that the
donor-shape kernel doesn't reproduce by design (e.g. matching on
8-byte blocks below the kernel's
HASH_READ_SIZE = 8floor).Benchmark (i9-9900K,
compress/level_1_fast/*/matrix/pure_rust)cargo benchdeltas vsmain(fc63464, phase 1a merged but notwired) on the same host. Negative = faster, positive = slower.
low-entropy-1mcount_forward)decodecorpus-z000033high-entropy-1msmall-10k-randomsmall-4k-log-linessmall-1k-randomlarge-log-streamhash_log = 14(16K slots) collides heavily on dense repeating content. See follow-up.Decompress side picks up a small indirect win (−3% to −5%) on
random-data scenarios because the sequence shape coming out of the
donor-shape kernel matches the decoder's optimised paths more
closely (fewer mid-block
Literalsemissions → less RingBufferdispatch).
large-log-streamregression analysisThe regression is expected given the trade-off chosen at the
phase 1b kickoff (donor parity over speed):
SuffixStorematcher used an effectivelywindow_log-sizedhash store (512K slots for level 1's
window_log = 19) plushash_fill_step = 3interleaving. Collision rate stayed low evenon 25 MiB logs.
hash_log = 14(16K slots).Against 25 MiB of compressible log content, the collision rate
per slot climbs into the thousands, and the kernel's single-cursor
loop loses matches the legacy matcher would have caught.
baseline shows donor at ~120 MB/s on similar workloads. The
current main (this PR's baseline) was already faster than donor
on this scenario because the legacy matcher was over-engineered
for speed at the cost of
O(window)memory.Closing the gap is exactly the scope of issue #198 phase 3 (items
2 / 3 / 5 in the issue body): 4-cursor
ip0/ip1/ip2/ip3lookaheadpipelining +
cmovmatch-found variant + per-levelmlsdispatch.Phase 3 lands on a follow-up branch once phase 1b is reviewed and
merged.
The net of phase 1b across the scenario matrix is positive
(geomean ≈ 2× faster, headline scenarios up to 10×). The
regression is bounded to dense-compressible large logs and is on
the documented roadmap.
What's NOT in this PR (lands separately)
ip0/ip1/ip2/ip34-cursor pipelining,
cmovmatch-found variant, mls dispatchknob per level (
hash_fill_stepstride for dict-priming isstill hard-coded to 1 today — the LevelParams field is wired
in but the Fast matcher always strides at 1).
LevelParams.hash_log/LevelParams.mlsfields (currentlyFast hard-codes donor level-1 defaults 14 / 7 from
FAST_LEVEL_1_HASH_LOG/FAST_LEVEL_1_MLS). Per-level scalingfor small-source-hint windows lands when
LevelParamsgrowsthese fields.
Closes #198 phase 1b (the per-issue plan calls phase 3 a separate
follow-up branch).
Related: #178 (umbrella regression issue), #215 (phase 1a — kernel
modules, merged in fc63464).
Summary by CodeRabbit