perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b) by polaz · Pull Request #217 · structured-world/structured-zstd

polaz · 2026-05-21T11:55:52Z

Summary

Phase 1b of the #198 Fast strategy donor port — wires the donor-shape
kernel modules from phase 1a (#215 → merged in fc63464) into the
production hot path. The legacy SuffixStore-based MatchGenerator
that lived in simple/mod.rs is fully removed; MatcherStorage::Simple
now holds a FastKernelMatcher that drives compress_block_fast<MLS>
once per block.

What changes

simple/mod.rs collapses from 496 → 14 lines (only module
declarations + docstring remain). SuffixStore, WindowEntry,
MatchGenerator, repcode_candidate, add_data, next_sequence,
add_suffixes_till, insert_suffix_if_absent,
add_suffixes_interleaved_fast, offset_match_len, reserve —
all gone with the wiring.
simple/fast_matcher.rs is the new active matcher
(introduced incrementally on this branch — see commit log). The
full inherent surface required by the Matcher trait wiring:
- accept_data(space) — stash a committed block in pending.
- start_matching(handler) — extend history, dispatch the
  kernel on the just-appended block, forward every emitted
  Sequence::Triple to the handler plus the terminal
  Sequence::Literals from tail_literals_len.
- skip_matching_with_hint(hint) — None / Some(true) =
  history append only; Some(false) = dictionary-priming path
  that ALSO pre-populates the hash table for every position in
  the appended range, so cross-block matches against
  dict-primed bytes work in subsequent blocks.
- trim_to_window() — drop history past max_window_size, bump
  prefix_start_index, clear the hash table. Donor's
  ZSTD_window_correctOverflow equivalent. Returns evicted
  bytes for the dictionary-budget retire loop.
- last_committed_space() — returns pending pre-processing OR
  history[last_block_start..] post-processing (legacy
  MatchGenerator window.last().data parity for the frame
  compressor's raw-block emission path).
- reset(window_log, hash_log, mls) — per-frame state reset,
  keeps the hash table allocation if (hash_log, mls)
  unchanged.
MatchGeneratorDriver Simple-arm wiring:
- commit_space → m.accept_data(space).
- start_matching::<Fast> → m.start_matching(handler).
- skip_matching_with_hint → m.skip_matching_with_hint(hint).
- reset → m.reset(params.window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS).
- prime_with_dictionary → field-level offset_hist /
  max_window_size mutation continues to work (both opened to
  pub(crate) for legacy parity).
- trim_after_budget_retire → m.trim_to_window().
- suffix_pool field removed (no consumer after the legacy
  matcher's SuffixStore reuse path went away).

Invariants preserved

prefix_start_index >= 1 at all times. Position 0 in history is
permanently sub-prefix so the hash table's empty-slot sentinel
value 0 cannot be confused with a real match position.
Pinned in with_params + reset.
history.len() is bounded by 2 × max_window_size post-append.
Eviction in extend_history_with_pending enforces.
rep[0..2] (kernel-tracked) ↔ offset_hist[0..2] (wire encoder-
tracked) stay in sync via per-Triple encode_offset_with_history
calls. The lit_len == 0 edge case (donor's rep[0]-1 shift)
is not modeled today — the phase-1 kernel doesn't emit
lit_len == 0 Triples. A future cmov / lookahead-pipelined
kernel variant will need explicit handling.

Tests

568 / 568 pass on the full workspace nextest suite:

16 unit tests on FastKernelMatcher itself (lifecycle, accept +
start, skip flavors, dict-prime hash population, eviction,
boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync,
prefix-eviction during dict-priming).
32 unit tests on the underlying kernel (donor-formula parity,
prefix-filter, repcode backward extension three-piece proof,
short-input early-return uniformity).
All frame_compressor integration tests (raw-block detection,
hinted source-size matrix, level-1 round-trips through both
our decoder and FFI decode).
All cross_validation Rust-encoded → FFI-decoded round-trips
(every level 1..=22, dict + no-dict, encoded by Rust then
read by the C reference decoder verbatim).

18 legacy tests in match_generator.rs that tested specifically
SuffixStore or MatchGenerator::new(...) behavior are
#[cfg(any())]-gated with explanatory comments — their substance
either has equivalent coverage in the new tests or relied on
algorithm-specific quirks of the SuffixStore matcher that the
donor-shape kernel doesn't reproduce by design (e.g. matching on
8-byte blocks below the kernel's HASH_READ_SIZE = 8 floor).

Benchmark (i9-9900K, `compress/level_1_fast/*/matrix/pure_rust`)

cargo bench deltas vs main (fc63464, phase 1a merged but not
wired) on the same host. Negative = faster, positive = slower.

Scenario	Δ time	Throughput	Note
`low-entropy-1m`	−90.7%	3.3 GiB/s	10× faster (hot-path dominated by `count_forward`)
`decodecorpus-z000033`	−83.4%	156 MiB/s	6× faster (mixed corpus, donor-shape gets the typical case right)
`high-entropy-1m`	−67.3%	633 MiB/s	3× faster (no matches → kernel walks fast, exits early)
`small-10k-random`	−36.0%	768 MiB/s	small random — kernel's step-skip helps
`small-4k-log-lines`	−15.4%	154 MiB/s	small compressible — marginal
`small-1k-random`	+42.2%	130 MiB/s	tiny block (1 KiB < HASH_READ_SIZE × 32 step) — kernel overhead exceeds work
`large-log-stream`	+122.3%	235 MiB/s	regression — 25 MiB log; donor-parity `hash_log = 14` (16K slots) collides heavily on dense repeating content. See follow-up.

Decompress side picks up a small indirect win (−3% to −5%) on
random-data scenarios because the sequence shape coming out of the
donor-shape kernel matches the decoder's optimised paths more
closely (fewer mid-block Literals emissions → less RingBuffer
dispatch).

`large-log-stream` regression analysis

The regression is expected given the trade-off chosen at the
phase 1b kickoff (donor parity over speed):

Legacy SuffixStore matcher used an effectively window_log-sized
hash store (512K slots for level 1's window_log = 19) plus
hash_fill_step = 3 interleaving. Collision rate stayed low even
on 25 MiB logs.
Phase 1b uses the donor's level-1 hash_log = 14 (16K slots).
Against 25 MiB of compressible log content, the collision rate
per slot climbs into the thousands, and the kernel's single-cursor
loop loses matches the legacy matcher would have caught.
Donor C zstd at level 1 Fast also pays this — the issue body's
baseline shows donor at ~120 MB/s on similar workloads. The
current main (this PR's baseline) was already faster than donor
on this scenario because the legacy matcher was over-engineered
for speed at the cost of O(window) memory.

Closing the gap is exactly the scope of issue #198 phase 3 (items
2 / 3 / 5 in the issue body): 4-cursor ip0/ip1/ip2/ip3 lookahead
pipelining + cmov match-found variant + per-level mls dispatch.
Phase 3 lands on a follow-up branch once phase 1b is reviewed and
merged.

The net of phase 1b across the scenario matrix is positive
(geomean ≈ 2× faster, headline scenarios up to 10×). The
regression is bounded to dense-compressible large logs and is on
the documented roadmap.

What's NOT in this PR (lands separately)

Phase 1 → phase 3 follow-ups from perf(fast): port donor ZSTD_compressBlock_fast — 22× regression on Fast strategy #198: ip0/ip1/ip2/ip3
4-cursor pipelining, cmov match-found variant, mls dispatch
knob per level (hash_fill_step stride for dict-priming is
still hard-coded to 1 today — the LevelParams field is wired
in but the Fast matcher always strides at 1).
LevelParams.hash_log / LevelParams.mls fields (currently
Fast hard-codes donor level-1 defaults 14 / 7 from
FAST_LEVEL_1_HASH_LOG / FAST_LEVEL_1_MLS). Per-level scaling
for small-source-hint windows lands when LevelParams grows
these fields.

Closes #198 phase 1b (the per-issue plan calls phase 3 a separate
follow-up branch).

Related: #178 (umbrella regression issue), #215 (phase 1a — kernel
modules, merged in fc63464).

Summary by CodeRabbit

Refactor
- Replaced the simple-level matcher with a new, faster backend featuring persistent per-block match state, eager window eviction, single-start-per-block matching, streamlined dictionary priming to keep encoder histories synchronized, and removal of legacy buffering to reduce memory churn.
Tests
- Updated unit and regression tests for the new backend; several legacy assertions were disabled and one regression relaxed an emission-count expectation while preserving reconstruction correctness.

…#198 phase 1b) Phase 1b of #198 — wires the donor-shape kernel modules from phase 1a (#215, merged in fc63464) into the production hot path. The legacy SuffixStore-based MatchGenerator in simple/mod.rs is fully removed; MatcherStorage::Simple now holds a FastKernelMatcher that drives compress_block_fast<MLS> once per block. Selected for every Fast-strategy level — CompressionLevel::Uncompressed, CompressionLevel::Fastest, CompressionLevel::Level(1), and the negative CompressionLevel::Level(-7..=-1) variants. All Fast levels currently resolve to the same matcher with donor level-1 hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found) land in phase 3. ## What changes - simple/mod.rs collapses from 496 → 14 lines (only module declarations + docstring remain). SuffixStore, WindowEntry, MatchGenerator, repcode_candidate, add_data, next_sequence, add_suffixes_till, insert_suffix_if_absent, add_suffixes_interleaved_fast, offset_match_len, reserve — all gone with the wiring. - simple/fast_matcher.rs is the new active matcher: full inherent surface (accept_data, start_matching, skip_matching_with_hint, trim_to_window, last_committed_space, reset, prime_offset_history, take_recycled_space). - MatchGeneratorDriver Simple-arm wiring: commit_space → m.accept_data(space) with eager pre-commit eviction; start_matching::<Fast> → m.start_matching(handler); skip_matching_with_hint → m.skip_matching_with_hint(hint); reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window(); prime_with_dictionary → m.prime_offset_history(offset_hist). - Per-block input Vec recycled via take_recycled_space() → vec_pool (zero zero-fill cost — buffer pushed with len=0, get_next_space resizes on pop). ## Invariants - prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The first byte of history is a reserved dummy (sentinel-0 guard); real input data starts at history[1]. Donor C zstd achieves the same effect via a virtual base pointer; the flat Vec<u8> model here pays one byte of memory overhead for the same correctness property (no missed matches at segment boundaries). - history.len() bounded by 2 × max_window_size post-append, even for oversize committed blocks (retain_real = cap.saturating_sub( space.len()).min(max_window_size)). - Eviction preserves the dummy AND rebases prefix_start_index back to RESERVED on every drain — cumulative growth would push the filter past every valid history index and reject all match candidates wholesale. - Hash table rehashed after drain so retained tail bytes stay matchable. Amortised O(1) per byte of input. - rep[0..2] ↔ offset_hist[0..2] in lockstep on the common (lit_len > 0) path. Known divergence on back-to-back repcode matches (lit_len == 0 emits): kernel's rep unchanged, wire encoder per RFC 8878 §3.1.2.5 remaps codes and rotates offset_hist — marginal compression hit, output still correct. Phase 3 collapses these at the kernel level. - prime_offset_history seeds BOTH rep[0..2] and offset_hist atomically from a dictionary load. ## Defensive validation - MatchGeneratorDriver::new asserts slice_size > 0, max_slices_in_window > 0, checked_mul for the product, and checked_next_power_of_two for window_log_init derivation — catches all four overflow / degenerate paths with a clear domain-specific panic instead of a deep matcher-internal failure. - FastHashTable construction-time mls / hash_log validation unchanged from phase 1a. ## Tests 573/573 pass on the full workspace nextest suite: - 21 unit tests on FastKernelMatcher (lifecycle, accept + start, skip flavors, dict-prime hash population, eviction, boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync, prefix-eviction during dict-priming, drain prefix_start_index runaway, trim_to_window/last_block_start drift, oversize-block eviction bound). - 32 unit tests on the underlying kernel (donor-formula parity, prefix-filter, repcode backward extension three-piece proof, short-input early-return uniformity). - All frame_compressor integration tests (raw-block detection, hinted source-size matrix, level-1 round-trips through both the in-tree decoder and FFI decode). - All cross_validation Rust-encoded → FFI-decoded round-trips (every level 1..=22, dict + no-dict, encoded by Rust then read by the C reference decoder verbatim). 18 legacy tests in match_generator.rs that exercised SuffixStore-specific behavior or required block.len() < HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory comments — their substance either has equivalent coverage in the new tests or relied on algorithm-specific quirks the donor-shape kernel doesn't reproduce by design. ## Benchmark (i9-9900K) cargo bench deltas vs main (fc63464) on compress/level_1_fast/*/matrix/pure_rust: | Scenario | Δ time | Throughput | Note | |----------|-------:|-----------:|------| | low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster | | decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster | | high-entropy-1m | -67.3% | 633 MiB/s | 3× faster | | small-10k-random | -36.0% | 768 MiB/s | 1.6× faster | | small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster | | small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead | | large-log-stream | +122.3% | 235 MiB/s | regression — see below | Large-log-stream regression is expected at this phase: legacy SuffixStore used effectively window_log-sized hash slots (512K for level-1 window_log=19), while phase 1b uses donor-parity hash_log=14 (16K slots). 25 MiB dense-compressible log content hits ~1500 collisions/slot. Donor C zstd shows the same trade-off (~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining + cmov) closes the gap per the documented roadmap (#198 items 2/3/5). ## What's NOT in this PR - Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead, cmov match-found variant, per-level mls dispatch, kSearchStrength acceleration gradient for negative Fast levels. - LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes donor level-1 defaults (14/7) today. - hash_fill_step stride for dict-priming — still hard-coded to 1 (LevelParams field is wired in but the Fast matcher always strides at 1). Closes #198 phase 1b. Related: #178 (umbrella regression issue), #215 (phase 1a — kernel modules, merged in fc63464).

coderabbitai · 2026-05-21T11:56:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4c0ec497-fae4-483a-a353-af00e6bbd2c5

📥 Commits

Reviewing files that changed from the base of the PR and between a5b2f6a and d4428a4.

📒 Files selected for processing (1)

zstd/src/encoding/simple/fast_matcher.rs

📝 Walkthrough

Walkthrough

Replace the Simple backend (SuffixStore/MatchGenerator) with donor-aligned FastKernelMatcher, add fast_matcher module, wire FastKernelMatcher into MatchGeneratorDriver (construction, reset, commit, skip, matching, priming, eviction), update simple/ module routing, and adjust/disable legacy Simple-specific tests while adding FastKernelMatcher tests.

Changes

Simple backend refactor: MatchGenerator → FastKernelMatcher

Layer / File(s)	Summary
Fast constants, module docs `zstd/src/encoding/simple/fast_matcher.rs`, `zstd/src/encoding/simple/mod.rs`	Add Fast-level constants, initial seeds, reserved dummy prefix and update simple/ module docs to route to fast_kernel + fast_matcher.
FastKernelMatcher struct and state layout `zstd/src/encoding/simple/fast_matcher.rs`	Define `FastKernelMatcher` with persistent `history`, `prefix_start_index`, kernel `rep`, `offset_hist`, `FastHashTable`, `pending`/recycled buffers, and bookkeeping.
Constructors and reset `zstd/src/encoding/simple/fast_matcher.rs`	Implement `new`/`with_params`/`reset`, allocate/clear hash table, seed reserved prefix, and restore invariants for new frames.
Accessors `zstd/src/encoding/simple/fast_matcher.rs`	Expose `window_size()` (test-only) and `last_committed_space()` for driver usage.
accept_data and eviction helpers `zstd/src/encoding/simple/fast_matcher.rs`	Implement `accept_data` that stages pending bytes and performs eager eviction enforcing the 2× max-window cap; do not update hot-path hashes until matching.
Draining/appending and buffer recycle `zstd/src/encoding/simple/fast_matcher.rs`	Implement `drain_real_prefix`, prefix rebasing, hash rehash/clear for retained tail, `extend_history_with_pending`, `take_recycled_space`.
start_matching and emission wiring `zstd/src/encoding/simple/fast_matcher.rs`	Append pending into history, dispatch `compress_block_fast::<MLS>`, forward `Sequence` emissions while updating `offset_hist` for `Triple` emissions (skip zero-literal triples), persist kernel `rep`, and emit terminal `Literals`.
skip_matching_with_hint, prime_offset_history, trim_to_window `zstd/src/encoding/simple/fast_matcher.rs`	Implement `skip_matching_with_hint` (no kernel run; optional hash prepopulation for priming), `prime_offset_history` to sync wire and kernel rep/history, `history_len_for_eviction_accounting`, and `trim_to_window` eviction helper.
Dictionary-priming hash prepopulation `zstd/src/encoding/simple/fast_matcher.rs`	Implement `prime_hash_table_for_range` with `HASH_READ_SIZE` bounds and MLS-specialized hash insertion via unsafe reads and `hash_ptr`/`put`.
FastKernelMatcher tests `zstd/src/encoding/simple/fast_matcher.rs`	Add comprehensive unit/regression tests for init/reset, matching/skip semantics, dict-priming, eviction/trim invariants, rep/offset sync, and oversized-block invariants.
simple/mod.rs module restructuring `zstd/src/encoding/simple/mod.rs`	Replace prior in-file Simple implementation with module routing for `fast_kernel` + `fast_matcher`; remove legacy `SuffixStore`, `WindowEntry`, `MatchGenerator`.
MatchGeneratorDriver imports and struct layout `zstd/src/encoding/match_generator.rs`	Update imports to donor Fast constants and `FastKernelMatcher`; change `MatcherStorage::Simple` to hold `FastKernelMatcher`; remove `suffix_pool` field.
MatchGeneratorDriver constructor and accessors `zstd/src/encoding/match_generator.rs`	Rewrite `MatchGeneratorDriver::new` to validate inputs, compute initial window size, initialize Simple via `FastKernelMatcher::with_params`, and add `simple_mut()` and `recycle_simple_space()`.
Matcher::reset backend wiring for Simple `zstd/src/encoding/match_generator.rs`	Adjust backend-switch behavior to drop legacy pre-drain for Simple, construct Simple with `FastKernelMatcher::with_params`, and call `FastKernelMatcher::reset(...)` on resets.
Driver per-operation integrations `zstd/src/encoding/match_generator.rs`	`get_last_space` returns `last_committed_space()`; `commit_space` computes evicted bytes via `history_len_for_eviction_accounting` delta around `accept_data`; `skip_matching_with_hint` calls matcher skip and recycles buffers; `prime_with_dictionary` calls `prime_offset_history`.
Commit-time eviction and hot path changes `zstd/src/encoding/match_generator.rs`	Eviction uses `FastKernelMatcher::trim_to_window` and accounting deltas; per-block matching uses a single `start_matching` call then recycles cleared history buffers (remove legacy per-sequence loop).
Documentation and dead-code markers `zstd/src/encoding/match_generator.rs`	Remove `FAST_HASH_FILL_STEP` re-export and annotate `LevelParams.hash_fill_step` as retained/dead-code for future use.
Test adjustments `zstd/src/encoding/match_generator.rs`	Disable many legacy SuffixStore/MatchGenerator tests with `#[cfg(any())]` and relax a regression test by removing the mandatory `Triple` emission assertion while preserving reconstruction correctness.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

perf(encoder): Fast band — close ratio gap at L1 and add donor-shaped skip-on-miss for negatives #188: Overlaps the Simple matching path; this PR replaces the legacy SuffixStore/MatchGenerator with FastKernelMatcher addressing the same code area.
feat: encoder architecture rewrite (split monolith, const-generic Strategy, arena allocator) #111: Related modularization of match generator into per-backend modules; the PR implements that split for Simple.

Possibly related PRs

structured-world/structured-zstd#215: Adds donor-shaped Fast kernel pieces (FastHashTable, compress_block_fast) that FastKernelMatcher depends on.
structured-world/structured-zstd#187: Related window-parameter and level tuning changes that interact with FastKernelMatcher construction.
structured-world/structured-zstd#125: Earlier backend/MatcherStorage dispatch refactor overlapping driver integration changes.

"🐰 I munched old suffix vines, then leapt—
FastKernel sprouted, neat and apt.
Hashes hum, history stays bright,
Blocks match swiftly through the night.
Hooray — the simple path takes flight!"

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly describes the main change: wiring the donor-shaped Fast kernel into MatchGeneratorDriver as phase 1b of issue `#198`.
Linked Issues check	✅ Passed	The PR successfully implements the core objectives from `#198` phase 1b: replaces SuffixStore with Vec hash table, implements donor-style matching loop, integrates compress_block_fast kernel, and achieves significant speedups while maintaining correctness.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the phase 1b integration: matcher refactoring, fast_matcher.rs module creation, and MatchGeneratorDriver updates. No unrelated changes detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#198-fast-matcher-port-phase1b

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-21T11:58:10Z

Codecov Report

❌ Patch coverage is 93.41737% with 47 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/encoding/simple/fast_matcher.rs	92.29%	46 Missing ⚠️
zstd/src/encoding/match_generator.rs	99.14%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

zstd/src/encoding/match_generator.rs (1)

598-613: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the initial reported window in sync with the instantiated matcher.

FastKernelMatcher is constructed from next_pow2, but reported_window_size keeps the unrounded max_window_size. For non-power-of-two constructor inputs, window_size() will report a smaller window than the active Simple backend actually has until the first reset().

Suggested fix

         Self {
             vec_pool: Vec::new(),
             storage: MatcherStorage::Simple(FastKernelMatcher::with_params(
                 window_log_init,
                 FAST_LEVEL_1_HASH_LOG,
                 FAST_LEVEL_1_MLS,
             )),
             strategy_tag: super::strategy::StrategyTag::Fast,
             slice_size,
             base_slice_size: slice_size,
-            reported_window_size: max_window_size,
+            reported_window_size: next_pow2,
             dictionary_retained_budget: 0,
             source_size_hint: None,
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/match_generator.rs` around lines 598 - 613, The
reported_window_size is set to the unrounded max_window_size while the Simple
backend (FastKernelMatcher) is initialized with the next power-of-two
(next_pow2/window_log_init), causing a mismatch; update the MatchGeneratorDriver
construction so reported_window_size is set to the rounded window (use next_pow2
or (1u64 << window_log_init)) to match the instantiated FastKernelMatcher
(constructor call FastKernelMatcher::with_params and field
reported_window_size).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 920-944: The Simple matcher instantiation is hardcoding donor
level-1 constants instead of using the resolved fast-level tuning computed by
resolve_level_params(); update the
MatcherStorage::Simple(FastKernelMatcher::with_params(...)) call(s) to pass the
resolved fast parameters from the params struct (e.g. use params.window_log and
the resolved params.hash_log / params.mls and also thread params.hash_fill_step
or any other fast-level knobs) instead of FAST_LEVEL_1_HASH_LOG and
FAST_LEVEL_1_MLS so CompressionLevel::Fastest, Level(1) and negative levels get
the per-level tuning; apply the same change to the other occurrences around
FastKernelMatcher::with_params to ensure the backend swap and reset paths both
use the resolved fast-level fields.

---

Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 598-613: The reported_window_size is set to the unrounded
max_window_size while the Simple backend (FastKernelMatcher) is initialized with
the next power-of-two (next_pow2/window_log_init), causing a mismatch; update
the MatchGeneratorDriver construction so reported_window_size is set to the
rounded window (use next_pow2 or (1u64 << window_log_init)) to match the
instantiated FastKernelMatcher (constructor call FastKernelMatcher::with_params
and field reported_window_size).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1fa461d0-9e51-40d8-b146-eae647c895a8

📥 Commits

Reviewing files that changed from the base of the PR and between fc63464 and d5056b6.

📒 Files selected for processing (3)

zstd/src/encoding/match_generator.rs
zstd/src/encoding/simple/fast_matcher.rs
zstd/src/encoding/simple/mod.rs

Copilot

Pull request overview

Wires the donor-shape Fast (ZSTD_fast) kernel into the production encoder path by replacing the legacy SuffixStore-based Simple matcher with FastKernelMatcher, and updating MatchGeneratorDriver to drive the new per-block matcher lifecycle.

Changes:

Replaces the Simple backend’s legacy MatchGenerator implementation with FastKernelMatcher and routes commit_space/start_matching/skip_matching_with_hint through it.
Simplifies encoding/simple/mod.rs to module declarations + updated backend documentation, removing the legacy matcher implementation.
Adds a new encoding/simple/fast_matcher.rs implementing the Simple/Fast-strategy matcher state machine (history buffer, hash table, eviction, dict priming, buffer recycling) around compress_block_fast.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
zstd/src/encoding/simple/mod.rs	Removes the legacy Simple matcher implementation; leaves module wiring and updated backend docs.
zstd/src/encoding/simple/fast_matcher.rs	Introduces `FastKernelMatcher`, managing flat history + flat hash table and adapting the donor kernel to the driver’s matcher lifecycle.
zstd/src/encoding/match_generator.rs	Rewires `MatcherStorage::Simple` and `MatchGeneratorDriver` to use `FastKernelMatcher`, removes suffix-store pooling, and updates reset/commit/start/skip plumbing accordingly.

…217 review round 1) Four findings from CodeRabbit + Copilot's first review pass on the squashed PR #217. All four touch comment / metadata accuracy or a single-line correctness issue; no behavioural change beyond the window_size sync. **CR outside-diff (match_generator.rs:598-613)** — correctness: `reported_window_size` was using the unrounded `max_window_size` while the matcher itself was constructed from `next_pow2` (rounded up). For non-power-of-two constructor inputs (e.g. `slice_size * max_slices_in_window = 100_000`), `window_size()` would report 65_536 (un-rounded floor) while the active backend actually carried 131_072 (rounded-up next_pow2). The drift held until the first `reset()` overwrote both sides from LevelParams. Fix: report `next_pow2` so the two stay in lockstep at construction time. **CR #1 (match_generator.rs:944)** — declined, deferred to phase 3: Fast levels (Uncompressed, Fastest, Level(-7..=1)) all hard-code donor level-1 cParams. The acceleration gradient between negative- level fast modes and Level(1) lands when phase 3 ports donor's 4-cursor lookahead + cmov match-found + per-level kSearchStrength dispatch (issue #198 items 2/3/5). Updated the inline code comment to scrub the closed-PR reference and frame the deferral against phase 3 directly. **Copilot #2 (match_generator.rs:572)** — doc: the validation-guard comment described `next_power_of_two` returning 0 on overflow, which was old-Rust behaviour. Modern Rust panics; we now use `checked_next_power_of_two` (commit landed in the squash). Rewrote the comment to enumerate the three actual failure modes (zero args, mul overflow, next-pow2 overflow) and the three guards that catch them. **Copilot #3 (fast_matcher.rs:303)** — doc: `last_committed_space`'s pre-`accept_data` state description claimed `last_block_start = 0 / history.len() = 0`, but post-RESERVED_PREFIX_BYTES-seed construction (#216 / phase 1b) leaves both at `RESERVED_PREFIX_BYTES`. The returned slice is still empty (the `history[last_block_start..]` range is empty), just for a different reason. Updated the doc to reflect the seeded-dummy invariant. 573/573 tests pass; clippy clean.

polaz · 2026-05-21T12:16:10Z

@coderabbitai Re: match_generator.rs:598-613 outside-diff (reported_window_size) — fixed in commit 60a0076. Now reports next_pow2 so it matches the actual matcher's window from construction onward. The drift was a real bug for non-power-of-two slice_size × max_slices_in_window products.

coderabbitai · 2026-05-21T12:16:38Z

Tip

For best results, initiate chat on the files or code changes.

@polaz: Thanks for the confirmation! Using next_pow2 to align reported_window_size with the matcher's actual construction-time window is the right fix — without it, any caller inspecting the reported size for a non-power-of-two slice_size × max_slices_in_window product would see a value that diverges from what FastKernelMatcher is actually configured with, which could cause subtle mismatch bugs downstream (e.g. incorrect eviction accounting or window-bound checks). Good catch and clean resolution.

… compactness rule) Doc-only change: collapse the verbose multi-paragraph last_committed_space docstring (added during PR #216 review rounds) into a 6-line three-bullet form. Same semantic content, zero narrative. Apply the new docstring compactness rule (one-two phrases default, multi-paragraph only for non-obvious invariants). 573/573 tests pass; clippy clean.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

…RVED_PREFIX_BYTES seed (#217 Copilot #4, #5) Doc-only update aligning the dict-prime boundary test comment + inline note with the RESERVED_PREFIX_BYTES seed (post-phase-1b range is [RESERVED..=RESERVED], not [0..=0]). Compactness rule applied — concise two-block form. 573/573 tests pass; clippy clean.

polaz · 2026-05-21T12:51:33Z

@coderabbitai Re: match_generator.rs:598-613 outside-diff (reported_window_size) — stale. Fixed in commit 60a0076 (reported_window_size: next_pow2). Current code already matches your suggested diff.

coderabbitai · 2026-05-21T12:51:51Z

@polaz: Confirmed — the reported_window_size: next_pow2(...) fix is already live in commit 60a00764 and the current code matches the suggested diff. I'll mark this comment as resolved.

✅ Actions performed

Comments resolved. Approval is disabled; enable reviews.request_changes_workflow to allow explicit top-level @coderabbitai resolve or @coderabbitai approve commands.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

… bumps (#217 Copilot #6, #7) Two doc comments described prefix_start_index as 'bumped forward' or 'advances' as history is evicted, implying a monotonic absolute index. Actual code (drain_real_prefix) rebases it back to RESERVED_PREFIX_BYTES on every drain — the retained tail is re-indexed in the new coordinate space. Updated both sites (struct field doc + trim_to_window header) to match. Compactness rule applied — trim_to_window header collapsed from 10 lines to 5. 573/573 tests pass; clippy clean.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

… TODO marker (#217 Copilot #8, #9) #8 — header for FAST_LEVEL_1_HASH_LOG said 'reset path rebinds hash_log proportionally on source-size hint'. Untrue today: driver passes only window_log per-level, hash_log + mls hard-coded. Pin per-level scaling to phase 3. #9 — same lit_len=0 / back-to-back-rep1 concern as the prior PR's #21. Inline + module docs already explain, but verbose prose isn't anchored. Collapsed inline comment to a single '// TODO(#198 phase 3):' line so the deferral marker is unambiguous. 573/573 tests pass; clippy clean.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

zstd/src/encoding/simple/fast_matcher.rs (2)

327-331: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on duplicate accept_data calls.

In release builds this debug_assert! disappears, and Line 370 will silently replace self.pending, dropping an entire committed block. This protocol violation should hard-fail instead of corrupting the stream state.

Suggested change

-        debug_assert!(
+        assert!(
             self.pending.is_none(),
             "FastKernelMatcher: accept_data called with a still-pending buffer; \
              the driver must run start_matching / skip_matching between commits",
         );

Also applies to: 370-370

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/simple/fast_matcher.rs` around lines 327 - 331, Replace the
non-failing debug_assert! in FastKernelMatcher::accept_data with a runtime check
that hard-fails on protocol violation: if self.pending.is_some() then panic!
with the same message (or use assert!), so duplicate accept_data calls cannot
silently overwrite self.pending; also add the same runtime check at the other
site where self.pending is being replaced (the code that currently overwrites
pending) to ensure any attempt to replace an existing pending buffer fails fast
rather than dropping committed data. Ensure messages reference the protocol
requirement to run start_matching / skip_matching between commits.

350-363: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce the oversized-block precondition here.

If space.len() > cap, retain_real becomes 0, but the full block is still appended later. That violates this module’s advertised post-append <= 2 * max_window_size bound and leaves the matcher outside its documented window discipline.

Suggested guard

         let real_len = self.history.len().saturating_sub(RESERVED_PREFIX_BYTES);
         let new_real_total = real_len.saturating_add(space.len());
         let cap = self.max_window_size.saturating_mul(2);
+        assert!(
+            space.len() <= cap,
+            "FastKernelMatcher requires block_size <= 2 * max_window_size \
+             (block_size={}, cap={})",
+            space.len(),
+            cap,
+        );
         if new_real_total > cap {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/simple/fast_matcher.rs` around lines 350 - 363, The code
computes retain_real using cap and space.len() but never enforces the
precondition that the incoming block fits the cap; if space.len() > cap we must
not append the full block because that breaks the module invariant. In
fast_matcher.rs, before computing or using retain_real (and before appending the
incoming block), add a guard that detects space.len() > cap and handles it
explicitly (e.g., early-return an error/result indicating an oversized block or
otherwise reject/truncate the input per the module’s API) so the matcher never
ends up with real_len + space.len() > 2 * self.max_window_size; reference
variables: retain_real, space.len(), cap, self.max_window_size.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 517-523: The call to encode_offset_with_history(...)
unconditionally rotates offset_hist on lit_len == 0 emits and breaks the rep ↔
offset_hist invariant; update the fast matcher to detect zero-literal
(literals.len() == 0) rep1 emits and either collapse/skip the
encode_offset_with_history call for those cases or explicitly keep self.rep and
offset_hist consistent (e.g., perform the same rotation on self.rep or avoid
mutating offset_hist) before proceeding; reference the
encode_offset_with_history call and the self.rep / offset_hist state in
fast_matcher.rs to implement the guard or state-sync logic.

---

Outside diff comments:
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 327-331: Replace the non-failing debug_assert! in
FastKernelMatcher::accept_data with a runtime check that hard-fails on protocol
violation: if self.pending.is_some() then panic! with the same message (or use
assert!), so duplicate accept_data calls cannot silently overwrite self.pending;
also add the same runtime check at the other site where self.pending is being
replaced (the code that currently overwrites pending) to ensure any attempt to
replace an existing pending buffer fails fast rather than dropping committed
data. Ensure messages reference the protocol requirement to run start_matching /
skip_matching between commits.
- Around line 350-363: The code computes retain_real using cap and space.len()
but never enforces the precondition that the incoming block fits the cap; if
space.len() > cap we must not append the full block because that breaks the
module invariant. In fast_matcher.rs, before computing or using retain_real (and
before appending the incoming block), add a guard that detects space.len() > cap
and handles it explicitly (e.g., early-return an error/result indicating an
oversized block or otherwise reject/truncate the input per the module’s API) so
the matcher never ends up with real_len + space.len() > 2 *
self.max_window_size; reference variables: retain_real, space.len(), cap,
self.max_window_size.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 34185c8b-d8d0-4c6a-b193-cf50284a5288

📥 Commits

Reviewing files that changed from the base of the PR and between c1c6a9e and 4e497d8.

📒 Files selected for processing (1)

zstd/src/encoding/simple/fast_matcher.rs

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

…n + hoist mls dispatch (#217 round 5) Four findings from the latest CR + Copilot review pass. **CR outside-diff (327-331) — assert!, not debug_assert!:** the duplicate-pending guard in accept_data was debug_assert!, so in release builds a double-commit would silently overwrite pending and drop a committed block. Hard-fail instead. **CR outside-diff (358-363) — block_size <= 2 × max_window_size:** the eviction math computes retain_real via saturating_sub, which silently collapses to 0 if space.len() > cap. The full block is still appended afterwards, violating the documented 'history bounded by 2 × max_window_size' invariant. Added an assert! precondition so callers see a clear panic at the boundary instead of an invisible invariant break. **CR #10 — lit_len == 0 offset_hist rotation (third raise):** the prior round added a TODO marker explaining the divergence, but CR correctly pushed back that documentation isn't a fix. Skip encode_offset_with_history when literals.is_empty() so offset_hist stays in lockstep with the kernel's unchanged rep — no divergence on the back-to-back rep1 path. Wire encoder downstream still sees the Triple with raw offset; its own encoding stays correct (lit_len-0 absolute encoding per RFC 8878 §3.1.2.5). Module docstring 'Known divergence' section collapsed accordingly. **Copilot #11 — hoist mls dispatch outside prime_hash loop:** moved the per-MLS match arm OUTSIDE the per-position loop. New prime_hash_table_impl<const MLS: u32> is monomorphised per matcher instance; the hot path is branch-free on mls. 573/573 tests pass; clippy clean.

polaz · 2026-05-21T14:37:56Z

@coderabbitai Outside-diff items fixed in a5b2f6a:

327-331 — debug_assert! → assert! on accept_data pending guard
358-363 — added assert!(space.len() <= cap) precondition before the eviction math

Both panic with clear domain-specific messages on protocol / precondition violation instead of silently corrupting state.

coderabbitai · 2026-05-21T14:38:14Z

Tip

For best results, initiate chat on the files or code changes.

@polaz: Confirmed — both fixes look correct:

327-331: Upgrading debug_assert! → assert! ensures the pending-block guard fires in release builds too, which is appropriate since a violation here indicates a caller protocol error that should never be silently swallowed.
358-363: The explicit assert!(space.len() <= cap) precondition before the eviction math gives a precise, early failure point with a meaningful message, rather than letting the arithmetic produce a silent underflow or wrong eviction count.

Good hardening — both changes make the contract-violation surface clear and immediate regardless of build profile.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Copilot #12, revert round 5 #10) Round 5 skipped encode_offset_with_history when literals.is_empty() to keep matcher.offset_hist 'in lockstep' with kernel rep. Copilot correctly pushed back: Dfast / Row / HashChain matchers all call it unconditionally (passing lit_len = 0 when applicable), and the 'lockstep' framing was wrong — matcher.offset_hist tracks the WIRE ENCODER's history while matcher.rep tracks the KERNEL's state. They're not supposed to be the same. Revert the skip. Module docstring rewritten to call out that the two fields reflect DIFFERENT state and may diverge on lit_len = 0 emits per RFC 8878 §3.1.2.5 — both halves stay self-consistent within their own domain. Round 5 #10 reply to CR was based on the wrong mental model; that's reflected in this fix by reverting to the donor / other-matchers behaviour. 573/573 tests pass; clippy clean.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

… + per-level cParams + cmov dispatch + window-correctness (#198 phase 3) Closes phase 3 of #198: native Rust port of donor's `ZSTD_compressBlock_fast_noDict_generic` (`zstd_fast.c:192-456`) restoring Fast-strategy throughput after the phase 1b regression and aligning encoder behaviour with donor parity on every front that affects ratio / speed / format correctness. ## Headline results `large-log-stream` (25 MiB dense corpus, i9-9900K): | State | Time vs main | Throughput | |------------------------------|-------------:|---------------:| | main (pre-phase-1b) | 0% | ~290 MiB/s | | #217 merged (phase 1b) | +122% | ~235 MiB/s | | **This PR (phase 3, M1-M9)** | **−22%** | **~790 MiB/s** | Net change phase 1b → phase 3: ~3.4× faster. ## Ratio audit on `decodecorpus-z000033` - L-7..L-1 Fast: monotone acceleration gradient restored (−1.85% to −4.79% vs donor) — pre-M6 these levels all produced identical 585601 bytes - L2-3 Dfast: parity or slight win (-0.94% to -4.61%) - L5-15 Lazy: consistently beat donor by −5.7% to −6.18% - L16-17 btopt: parity / win (−0.33% to −3.59%) - L20-22 btultra2: parity (±0.2%) - L1 Fast: +7.43% residual gap — tracked as follow-up #220 ## Milestones (squashed) - M1: per-level `fast_hash_log`/`fast_mls`/`fast_step_size` threading through `LevelParams` - M2: full 4-cursor `ip0/ip1/ip2/ip3` lookahead body + immediate-rep2 inner loop ported from `zstd_fast.c` - M3: `cmov` match-found variant + per-window dispatch surface (10 monomorphisations across `mls` 4..=8 × `use_cmov` true/false) - M4: beyond-donor `fast_hash_log: 13 → 14` for negative levels (+32 KB memory, 2× fewer collisions on structured corpora) - M5: reverted (adaptive mls peek did not pay off) - M6: per-level `fast_step_size` from donor's `targetLength = -level` formula; restores acceleration gradient - M7: added donor's missing `current0+2` hash insertion after each match emit (`zstd_fast.c:407`); raised L1/decodecorpus sequence-match-rate 43.1% → 57.7% - M8: dropped `RESERVED_PREFIX_BYTES` dummy byte; history layout now donor-parity, sentinel-0 protection via `INITIAL_PREFIX_START_INDEX = 1` filter - M9 (format correctness): sliding prefix floor at scan time enforces the advertised frame window (`1 << window_log`), NOT the dictionary-budget-inflated `max_window_size`. Prevents emitting offsets > advertised window during dictionary-primed compression — would otherwise produce format-invalid frames. Regression test included. ## API changes (driver) - `FastKernelMatcher::with_params(window_log, hash_log, mls, step_size)` — step_size promoted to 4th positional arg; previous `set_step_size` post-init setter removed (closed the gap where new code paths could silently fall back to default step_size=2) - `FastKernelMatcher::reset(window_log, hash_log, mls, step_size)` — same signature change - `RESERVED_PREFIX_BYTES`: 1 → 0 (no dummy region); legacy name retained for drain-offset math - New `INITIAL_PREFIX_START_INDEX = 1` constant for sentinel-0 filter baseline ## Kernel structure - `match_found<USE_CMOV>` — branchless cmov / branch dispatch via const generic; `# Safety` documented; bitwise `&` intentional to preserve cmov ordering (donor `__asm__("")` equivalent) - `MatchFound::{Rep,Explicit}` variants carry explicit `current0` (donor's writeback position) — correct on both probe paths, pre-backward-extension - `ip3 > ilimit` exit (not `>=`) — recovers the last hashable position at end-of-block - Two hash insertions post-match-emit: at `current0+2` (donor zstd_fast.c:407, was missing) and `ip0-2` ## Tests - 577 nextest, all passing - New per-level dispatch test pinning hash_log / mls / step_size for L1, Fastest, Uncompressed, L-1..L-7 - cmov vs branch byte-for-byte equality + cmov out-of-window false-positive regression - explicit_match_backward_extension deterministic via marker-byte layout (asserts match_len ≥ 5 + literals don't end with marker) - start_matching_enforces_max_window_size_offset_bound - start_matching_caps_offsets_at_window_log_not_inflated_max (M9 format-correctness regression) - cross_validation FFI roundtrips on every level 1..=22 (dict + no-dict) — wire-format interop preserved ## Out of scope (follow-ups) - #220 — Fast L1 +7.43% ratio residual on decodecorpus - L18-19 btultra +3.5% — separate ratio investigation

Copilot AI review requested due to automatic review settings May 21, 2026 11:55

Copilot started reviewing on behalf of polaz May 21, 2026 11:56 View session

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread zstd/src/encoding/match_generator.rs Outdated