Skip to content

perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b)#217

Merged
polaz merged 8 commits into
mainfrom
perf/#198-fast-matcher-port-phase1b
May 21, 2026
Merged

perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b)#217
polaz merged 8 commits into
mainfrom
perf/#198-fast-matcher-port-phase1b

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 21, 2026

Summary

Phase 1b of the #198 Fast strategy donor port — wires the donor-shape
kernel modules from phase 1a (#215 → merged in fc63464) into the
production hot path. The legacy SuffixStore-based MatchGenerator
that lived in simple/mod.rs is fully removed; MatcherStorage::Simple
now holds a FastKernelMatcher that drives compress_block_fast<MLS>
once per block.

What changes

  • simple/mod.rs collapses from 496 → 14 lines (only module
    declarations + docstring remain). SuffixStore, WindowEntry,
    MatchGenerator, repcode_candidate, add_data, next_sequence,
    add_suffixes_till, insert_suffix_if_absent,
    add_suffixes_interleaved_fast, offset_match_len, reserve
    all gone with the wiring.
  • simple/fast_matcher.rs is the new active matcher
    (introduced incrementally on this branch — see commit log). The
    full inherent surface required by the Matcher trait wiring:
    • accept_data(space) — stash a committed block in pending.
    • start_matching(handler) — extend history, dispatch the
      kernel on the just-appended block, forward every emitted
      Sequence::Triple to the handler plus the terminal
      Sequence::Literals from tail_literals_len.
    • skip_matching_with_hint(hint)None / Some(true) =
      history append only; Some(false) = dictionary-priming path
      that ALSO pre-populates the hash table for every position in
      the appended range, so cross-block matches against
      dict-primed bytes work in subsequent blocks.
    • trim_to_window() — drop history past max_window_size, bump
      prefix_start_index, clear the hash table. Donor's
      ZSTD_window_correctOverflow equivalent. Returns evicted
      bytes for the dictionary-budget retire loop.
    • last_committed_space() — returns pending pre-processing OR
      history[last_block_start..] post-processing (legacy
      MatchGenerator window.last().data parity for the frame
      compressor's raw-block emission path).
    • reset(window_log, hash_log, mls) — per-frame state reset,
      keeps the hash table allocation if (hash_log, mls)
      unchanged.
  • MatchGeneratorDriver Simple-arm wiring:
    • commit_spacem.accept_data(space).
    • start_matching::<Fast>m.start_matching(handler).
    • skip_matching_with_hintm.skip_matching_with_hint(hint).
    • resetm.reset(params.window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS).
    • prime_with_dictionary → field-level offset_hist /
      max_window_size mutation continues to work (both opened to
      pub(crate) for legacy parity).
    • trim_after_budget_retirem.trim_to_window().
    • suffix_pool field removed (no consumer after the legacy
      matcher's SuffixStore reuse path went away).

Invariants preserved

  • prefix_start_index >= 1 at all times. Position 0 in history is
    permanently sub-prefix so the hash table's empty-slot sentinel
    value 0 cannot be confused with a real match position.
    Pinned in with_params + reset.
  • history.len() is bounded by 2 × max_window_size post-append.
    Eviction in extend_history_with_pending enforces.
  • rep[0..2] (kernel-tracked) ↔ offset_hist[0..2] (wire encoder-
    tracked) stay in sync via per-Triple encode_offset_with_history
    calls. The lit_len == 0 edge case (donor's rep[0]-1 shift)
    is not modeled today — the phase-1 kernel doesn't emit
    lit_len == 0 Triples. A future cmov / lookahead-pipelined
    kernel variant will need explicit handling.

Tests

568 / 568 pass on the full workspace nextest suite:

  • 16 unit tests on FastKernelMatcher itself (lifecycle, accept +
    start, skip flavors, dict-prime hash population, eviction,
    boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync,
    prefix-eviction during dict-priming).
  • 32 unit tests on the underlying kernel (donor-formula parity,
    prefix-filter, repcode backward extension three-piece proof,
    short-input early-return uniformity).
  • All frame_compressor integration tests (raw-block detection,
    hinted source-size matrix, level-1 round-trips through both
    our decoder and FFI decode).
  • All cross_validation Rust-encoded → FFI-decoded round-trips
    (every level 1..=22, dict + no-dict, encoded by Rust then
    read by the C reference decoder verbatim).

18 legacy tests in match_generator.rs that tested specifically
SuffixStore or MatchGenerator::new(...) behavior are
#[cfg(any())]-gated with explanatory comments — their substance
either has equivalent coverage in the new tests or relied on
algorithm-specific quirks of the SuffixStore matcher that the
donor-shape kernel doesn't reproduce by design (e.g. matching on
8-byte blocks below the kernel's HASH_READ_SIZE = 8 floor).

Benchmark (i9-9900K, compress/level_1_fast/*/matrix/pure_rust)

cargo bench deltas vs main (fc63464, phase 1a merged but not
wired) on the same host. Negative = faster, positive = slower.

Scenario Δ time Throughput Note
low-entropy-1m −90.7% 3.3 GiB/s 10× faster (hot-path dominated by count_forward)
decodecorpus-z000033 −83.4% 156 MiB/s 6× faster (mixed corpus, donor-shape gets the typical case right)
high-entropy-1m −67.3% 633 MiB/s 3× faster (no matches → kernel walks fast, exits early)
small-10k-random −36.0% 768 MiB/s small random — kernel's step-skip helps
small-4k-log-lines −15.4% 154 MiB/s small compressible — marginal
small-1k-random +42.2% 130 MiB/s tiny block (1 KiB < HASH_READ_SIZE × 32 step) — kernel overhead exceeds work
large-log-stream +122.3% 235 MiB/s regression — 25 MiB log; donor-parity hash_log = 14 (16K slots) collides heavily on dense repeating content. See follow-up.

Decompress side picks up a small indirect win (−3% to −5%) on
random-data scenarios because the sequence shape coming out of the
donor-shape kernel matches the decoder's optimised paths more
closely (fewer mid-block Literals emissions → less RingBuffer
dispatch).

large-log-stream regression analysis

The regression is expected given the trade-off chosen at the
phase 1b kickoff (donor parity over speed):

  • Legacy SuffixStore matcher used an effectively window_log-sized
    hash store (512K slots for level 1's window_log = 19) plus
    hash_fill_step = 3 interleaving. Collision rate stayed low even
    on 25 MiB logs.
  • Phase 1b uses the donor's level-1 hash_log = 14 (16K slots).
    Against 25 MiB of compressible log content, the collision rate
    per slot climbs into the thousands, and the kernel's single-cursor
    loop loses matches the legacy matcher would have caught.
  • Donor C zstd at level 1 Fast also pays this — the issue body's
    baseline shows donor at ~120 MB/s on similar workloads. The
    current main (this PR's baseline) was already faster than donor
    on this scenario because the legacy matcher was over-engineered
    for speed at the cost of O(window) memory.

Closing the gap is exactly the scope of issue #198 phase 3 (items
2 / 3 / 5 in the issue body): 4-cursor ip0/ip1/ip2/ip3 lookahead
pipelining + cmov match-found variant + per-level mls dispatch.
Phase 3 lands on a follow-up branch once phase 1b is reviewed and
merged.

The net of phase 1b across the scenario matrix is positive
(geomean ≈ 2× faster, headline scenarios up to 10×). The
regression is bounded to dense-compressible large logs and is on
the documented roadmap.

What's NOT in this PR (lands separately)

  • Phase 1 → phase 3 follow-ups from perf(fast): port donor ZSTD_compressBlock_fast — 22× regression on Fast strategy #198: ip0/ip1/ip2/ip3
    4-cursor pipelining, cmov match-found variant, mls dispatch
    knob per level (hash_fill_step stride for dict-priming is
    still hard-coded to 1 today — the LevelParams field is wired
    in but the Fast matcher always strides at 1).
  • LevelParams.hash_log / LevelParams.mls fields (currently
    Fast hard-codes donor level-1 defaults 14 / 7 from
    FAST_LEVEL_1_HASH_LOG / FAST_LEVEL_1_MLS). Per-level scaling
    for small-source-hint windows lands when LevelParams grows
    these fields.

Closes #198 phase 1b (the per-issue plan calls phase 3 a separate
follow-up branch).

Related: #178 (umbrella regression issue), #215 (phase 1a — kernel
modules, merged in fc63464).

Summary by CodeRabbit

  • Refactor

    • Replaced the simple-level matcher with a new, faster backend featuring persistent per-block match state, eager window eviction, single-start-per-block matching, streamlined dictionary priming to keep encoder histories synchronized, and removal of legacy buffering to reduce memory churn.
  • Tests

    • Updated unit and regression tests for the new backend; several legacy assertions were disabled and one regression relaxed an emission-count expectation while preserving reconstruction correctness.

Review Change Stack

…#198 phase 1b)

Phase 1b of #198 — wires the donor-shape kernel modules from
phase 1a (#215, merged in fc63464) into the production hot path.
The legacy SuffixStore-based MatchGenerator in simple/mod.rs is
fully removed; MatcherStorage::Simple now holds a
FastKernelMatcher that drives compress_block_fast<MLS> once per
block.

Selected for every Fast-strategy level — CompressionLevel::Uncompressed,
CompressionLevel::Fastest, CompressionLevel::Level(1), and the
negative CompressionLevel::Level(-7..=-1) variants. All Fast
levels currently resolve to the same matcher with donor level-1
hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength
dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found)
land in phase 3.

## What changes

- simple/mod.rs collapses from 496 → 14 lines (only module
  declarations + docstring remain). SuffixStore, WindowEntry,
  MatchGenerator, repcode_candidate, add_data, next_sequence,
  add_suffixes_till, insert_suffix_if_absent,
  add_suffixes_interleaved_fast, offset_match_len, reserve — all
  gone with the wiring.
- simple/fast_matcher.rs is the new active matcher: full inherent
  surface (accept_data, start_matching, skip_matching_with_hint,
  trim_to_window, last_committed_space, reset,
  prime_offset_history, take_recycled_space).
- MatchGeneratorDriver Simple-arm wiring: commit_space →
  m.accept_data(space) with eager pre-commit eviction;
  start_matching::<Fast> → m.start_matching(handler);
  skip_matching_with_hint → m.skip_matching_with_hint(hint);
  reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG,
  FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window();
  prime_with_dictionary → m.prime_offset_history(offset_hist).
- Per-block input Vec recycled via take_recycled_space() →
  vec_pool (zero zero-fill cost — buffer pushed with len=0,
  get_next_space resizes on pop).

## Invariants

- prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The
  first byte of history is a reserved dummy (sentinel-0 guard);
  real input data starts at history[1]. Donor C zstd achieves
  the same effect via a virtual base pointer; the flat Vec<u8>
  model here pays one byte of memory overhead for the same
  correctness property (no missed matches at segment boundaries).
- history.len() bounded by 2 × max_window_size post-append, even
  for oversize committed blocks (retain_real = cap.saturating_sub(
  space.len()).min(max_window_size)).
- Eviction preserves the dummy AND rebases prefix_start_index back
  to RESERVED on every drain — cumulative growth would push the
  filter past every valid history index and reject all match
  candidates wholesale.
- Hash table rehashed after drain so retained tail bytes stay
  matchable. Amortised O(1) per byte of input.
- rep[0..2] ↔ offset_hist[0..2] in lockstep on the common
  (lit_len > 0) path. Known divergence on back-to-back repcode
  matches (lit_len == 0 emits): kernel's rep unchanged, wire
  encoder per RFC 8878 §3.1.2.5 remaps codes and rotates
  offset_hist — marginal compression hit, output still correct.
  Phase 3 collapses these at the kernel level.
- prime_offset_history seeds BOTH rep[0..2] and offset_hist
  atomically from a dictionary load.

## Defensive validation

- MatchGeneratorDriver::new asserts slice_size > 0,
  max_slices_in_window > 0, checked_mul for the product, and
  checked_next_power_of_two for window_log_init derivation —
  catches all four overflow / degenerate paths with a clear
  domain-specific panic instead of a deep matcher-internal
  failure.
- FastHashTable construction-time mls / hash_log validation
  unchanged from phase 1a.

## Tests

573/573 pass on the full workspace nextest suite:

- 21 unit tests on FastKernelMatcher (lifecycle, accept + start,
  skip flavors, dict-prime hash population, eviction, boundary
  cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync,
  prefix-eviction during dict-priming, drain prefix_start_index
  runaway, trim_to_window/last_block_start drift, oversize-block
  eviction bound).
- 32 unit tests on the underlying kernel (donor-formula parity,
  prefix-filter, repcode backward extension three-piece proof,
  short-input early-return uniformity).
- All frame_compressor integration tests (raw-block detection,
  hinted source-size matrix, level-1 round-trips through both
  the in-tree decoder and FFI decode).
- All cross_validation Rust-encoded → FFI-decoded round-trips
  (every level 1..=22, dict + no-dict, encoded by Rust then
  read by the C reference decoder verbatim).

18 legacy tests in match_generator.rs that exercised
SuffixStore-specific behavior or required block.len() <
HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory
comments — their substance either has equivalent coverage in the
new tests or relied on algorithm-specific quirks the donor-shape
kernel doesn't reproduce by design.

## Benchmark (i9-9900K)

cargo bench deltas vs main (fc63464) on
compress/level_1_fast/*/matrix/pure_rust:

| Scenario | Δ time | Throughput | Note |
|----------|-------:|-----------:|------|
| low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster |
| decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster |
| high-entropy-1m | -67.3% | 633 MiB/s | 3× faster |
| small-10k-random | -36.0% | 768 MiB/s | 1.6× faster |
| small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster |
| small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead |
| large-log-stream | +122.3% | 235 MiB/s | regression — see below |

Large-log-stream regression is expected at this phase: legacy
SuffixStore used effectively window_log-sized hash slots (512K
for level-1 window_log=19), while phase 1b uses donor-parity
hash_log=14 (16K slots). 25 MiB dense-compressible log content
hits ~1500 collisions/slot. Donor C zstd shows the same trade-off
(~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining +
cmov) closes the gap per the documented roadmap (#198 items 2/3/5).

## What's NOT in this PR

- Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead,
  cmov match-found variant, per-level mls dispatch,
  kSearchStrength acceleration gradient for negative Fast levels.
- LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes
  donor level-1 defaults (14/7) today.
- hash_fill_step stride for dict-priming — still hard-coded to 1
  (LevelParams field is wired in but the Fast matcher always
  strides at 1).

Closes #198 phase 1b.

Related: #178 (umbrella regression issue), #215 (phase 1a — kernel
modules, merged in fc63464).
Copilot AI review requested due to automatic review settings May 21, 2026 11:55
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4c0ec497-fae4-483a-a353-af00e6bbd2c5

📥 Commits

Reviewing files that changed from the base of the PR and between a5b2f6a and d4428a4.

📒 Files selected for processing (1)
  • zstd/src/encoding/simple/fast_matcher.rs

📝 Walkthrough

Walkthrough

Replace the Simple backend (SuffixStore/MatchGenerator) with donor-aligned FastKernelMatcher, add fast_matcher module, wire FastKernelMatcher into MatchGeneratorDriver (construction, reset, commit, skip, matching, priming, eviction), update simple/ module routing, and adjust/disable legacy Simple-specific tests while adding FastKernelMatcher tests.

Changes

Simple backend refactor: MatchGenerator → FastKernelMatcher

Layer / File(s) Summary
Fast constants, module docs
zstd/src/encoding/simple/fast_matcher.rs, zstd/src/encoding/simple/mod.rs
Add Fast-level constants, initial seeds, reserved dummy prefix and update simple/ module docs to route to fast_kernel + fast_matcher.
FastKernelMatcher struct and state layout
zstd/src/encoding/simple/fast_matcher.rs
Define FastKernelMatcher with persistent history, prefix_start_index, kernel rep, offset_hist, FastHashTable, pending/recycled buffers, and bookkeeping.
Constructors and reset
zstd/src/encoding/simple/fast_matcher.rs
Implement new/with_params/reset, allocate/clear hash table, seed reserved prefix, and restore invariants for new frames.
Accessors
zstd/src/encoding/simple/fast_matcher.rs
Expose window_size() (test-only) and last_committed_space() for driver usage.
accept_data and eviction helpers
zstd/src/encoding/simple/fast_matcher.rs
Implement accept_data that stages pending bytes and performs eager eviction enforcing the 2× max-window cap; do not update hot-path hashes until matching.
Draining/appending and buffer recycle
zstd/src/encoding/simple/fast_matcher.rs
Implement drain_real_prefix, prefix rebasing, hash rehash/clear for retained tail, extend_history_with_pending, take_recycled_space.
start_matching and emission wiring
zstd/src/encoding/simple/fast_matcher.rs
Append pending into history, dispatch compress_block_fast::<MLS>, forward Sequence emissions while updating offset_hist for Triple emissions (skip zero-literal triples), persist kernel rep, and emit terminal Literals.
skip_matching_with_hint, prime_offset_history, trim_to_window
zstd/src/encoding/simple/fast_matcher.rs
Implement skip_matching_with_hint (no kernel run; optional hash prepopulation for priming), prime_offset_history to sync wire and kernel rep/history, history_len_for_eviction_accounting, and trim_to_window eviction helper.
Dictionary-priming hash prepopulation
zstd/src/encoding/simple/fast_matcher.rs
Implement prime_hash_table_for_range with HASH_READ_SIZE bounds and MLS-specialized hash insertion via unsafe reads and hash_ptr/put.
FastKernelMatcher tests
zstd/src/encoding/simple/fast_matcher.rs
Add comprehensive unit/regression tests for init/reset, matching/skip semantics, dict-priming, eviction/trim invariants, rep/offset sync, and oversized-block invariants.
simple/mod.rs module restructuring
zstd/src/encoding/simple/mod.rs
Replace prior in-file Simple implementation with module routing for fast_kernel + fast_matcher; remove legacy SuffixStore, WindowEntry, MatchGenerator.
MatchGeneratorDriver imports and struct layout
zstd/src/encoding/match_generator.rs
Update imports to donor Fast constants and FastKernelMatcher; change MatcherStorage::Simple to hold FastKernelMatcher; remove suffix_pool field.
MatchGeneratorDriver constructor and accessors
zstd/src/encoding/match_generator.rs
Rewrite MatchGeneratorDriver::new to validate inputs, compute initial window size, initialize Simple via FastKernelMatcher::with_params, and add simple_mut() and recycle_simple_space().
Matcher::reset backend wiring for Simple
zstd/src/encoding/match_generator.rs
Adjust backend-switch behavior to drop legacy pre-drain for Simple, construct Simple with FastKernelMatcher::with_params, and call FastKernelMatcher::reset(...) on resets.
Driver per-operation integrations
zstd/src/encoding/match_generator.rs
get_last_space returns last_committed_space(); commit_space computes evicted bytes via history_len_for_eviction_accounting delta around accept_data; skip_matching_with_hint calls matcher skip and recycles buffers; prime_with_dictionary calls prime_offset_history.
Commit-time eviction and hot path changes
zstd/src/encoding/match_generator.rs
Eviction uses FastKernelMatcher::trim_to_window and accounting deltas; per-block matching uses a single start_matching call then recycles cleared history buffers (remove legacy per-sequence loop).
Documentation and dead-code markers
zstd/src/encoding/match_generator.rs
Remove FAST_HASH_FILL_STEP re-export and annotate LevelParams.hash_fill_step as retained/dead-code for future use.
Test adjustments
zstd/src/encoding/match_generator.rs
Disable many legacy SuffixStore/MatchGenerator tests with #[cfg(any())] and relax a regression test by removing the mandatory Triple emission assertion while preserving reconstruction correctness.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

"🐰 I munched old suffix vines, then leapt—
FastKernel sprouted, neat and apt.
Hashes hum, history stays bright,
Blocks match swiftly through the night.
Hooray — the simple path takes flight!"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly describes the main change: wiring the donor-shaped Fast kernel into MatchGeneratorDriver as phase 1b of issue #198.
Linked Issues check ✅ Passed The PR successfully implements the core objectives from #198 phase 1b: replaces SuffixStore with Vec hash table, implements donor-style matching loop, integrates compress_block_fast kernel, and achieves significant speedups while maintaining correctness.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the phase 1b integration: matcher refactoring, fast_matcher.rs module creation, and MatchGeneratorDriver updates. No unrelated changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#198-fast-matcher-port-phase1b

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 93.41737% with 47 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/simple/fast_matcher.rs 92.29% 46 Missing ⚠️
zstd/src/encoding/match_generator.rs 99.14% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/encoding/match_generator.rs (1)

598-613: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the initial reported window in sync with the instantiated matcher.

FastKernelMatcher is constructed from next_pow2, but reported_window_size keeps the unrounded max_window_size. For non-power-of-two constructor inputs, window_size() will report a smaller window than the active Simple backend actually has until the first reset().

Suggested fix
         Self {
             vec_pool: Vec::new(),
             storage: MatcherStorage::Simple(FastKernelMatcher::with_params(
                 window_log_init,
                 FAST_LEVEL_1_HASH_LOG,
                 FAST_LEVEL_1_MLS,
             )),
             strategy_tag: super::strategy::StrategyTag::Fast,
             slice_size,
             base_slice_size: slice_size,
-            reported_window_size: max_window_size,
+            reported_window_size: next_pow2,
             dictionary_retained_budget: 0,
             source_size_hint: None,
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/match_generator.rs` around lines 598 - 613, The
reported_window_size is set to the unrounded max_window_size while the Simple
backend (FastKernelMatcher) is initialized with the next power-of-two
(next_pow2/window_log_init), causing a mismatch; update the MatchGeneratorDriver
construction so reported_window_size is set to the rounded window (use next_pow2
or (1u64 << window_log_init)) to match the instantiated FastKernelMatcher
(constructor call FastKernelMatcher::with_params and field
reported_window_size).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 920-944: The Simple matcher instantiation is hardcoding donor
level-1 constants instead of using the resolved fast-level tuning computed by
resolve_level_params(); update the
MatcherStorage::Simple(FastKernelMatcher::with_params(...)) call(s) to pass the
resolved fast parameters from the params struct (e.g. use params.window_log and
the resolved params.hash_log / params.mls and also thread params.hash_fill_step
or any other fast-level knobs) instead of FAST_LEVEL_1_HASH_LOG and
FAST_LEVEL_1_MLS so CompressionLevel::Fastest, Level(1) and negative levels get
the per-level tuning; apply the same change to the other occurrences around
FastKernelMatcher::with_params to ensure the backend swap and reset paths both
use the resolved fast-level fields.

---

Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 598-613: The reported_window_size is set to the unrounded
max_window_size while the Simple backend (FastKernelMatcher) is initialized with
the next power-of-two (next_pow2/window_log_init), causing a mismatch; update
the MatchGeneratorDriver construction so reported_window_size is set to the
rounded window (use next_pow2 or (1u64 << window_log_init)) to match the
instantiated FastKernelMatcher (constructor call FastKernelMatcher::with_params
and field reported_window_size).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1fa461d0-9e51-40d8-b146-eae647c895a8

📥 Commits

Reviewing files that changed from the base of the PR and between fc63464 and d5056b6.

📒 Files selected for processing (3)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/simple/fast_matcher.rs
  • zstd/src/encoding/simple/mod.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wires the donor-shape Fast (ZSTD_fast) kernel into the production encoder path by replacing the legacy SuffixStore-based Simple matcher with FastKernelMatcher, and updating MatchGeneratorDriver to drive the new per-block matcher lifecycle.

Changes:

  • Replaces the Simple backend’s legacy MatchGenerator implementation with FastKernelMatcher and routes commit_space/start_matching/skip_matching_with_hint through it.
  • Simplifies encoding/simple/mod.rs to module declarations + updated backend documentation, removing the legacy matcher implementation.
  • Adds a new encoding/simple/fast_matcher.rs implementing the Simple/Fast-strategy matcher state machine (history buffer, hash table, eviction, dict priming, buffer recycling) around compress_block_fast.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
zstd/src/encoding/simple/mod.rs Removes the legacy Simple matcher implementation; leaves module wiring and updated backend docs.
zstd/src/encoding/simple/fast_matcher.rs Introduces FastKernelMatcher, managing flat history + flat hash table and adapting the donor kernel to the driver’s matcher lifecycle.
zstd/src/encoding/match_generator.rs Rewires MatcherStorage::Simple and MatchGeneratorDriver to use FastKernelMatcher, removes suffix-store pooling, and updates reset/commit/start/skip plumbing accordingly.

Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
…217 review round 1)

Four findings from CodeRabbit + Copilot's first review pass on the
squashed PR #217. All four touch comment / metadata accuracy or
a single-line correctness issue; no behavioural change beyond the
window_size sync.

**CR outside-diff (match_generator.rs:598-613)** — correctness:
`reported_window_size` was using the unrounded `max_window_size`
while the matcher itself was constructed from `next_pow2` (rounded
up). For non-power-of-two constructor inputs (e.g. `slice_size *
max_slices_in_window = 100_000`), `window_size()` would report
65_536 (un-rounded floor) while the active backend actually carried
131_072 (rounded-up next_pow2). The drift held until the first
`reset()` overwrote both sides from LevelParams. Fix: report
`next_pow2` so the two stay in lockstep at construction time.

**CR #1 (match_generator.rs:944)** — declined, deferred to phase 3:
Fast levels (Uncompressed, Fastest, Level(-7..=1)) all hard-code
donor level-1 cParams. The acceleration gradient between negative-
level fast modes and Level(1) lands when phase 3 ports donor's
4-cursor lookahead + cmov match-found + per-level kSearchStrength
dispatch (issue #198 items 2/3/5). Updated the inline code
comment to scrub the closed-PR reference and frame the deferral
against phase 3 directly.

**Copilot #2 (match_generator.rs:572)** — doc: the validation-guard
comment described `next_power_of_two` returning 0 on overflow,
which was old-Rust behaviour. Modern Rust panics; we now use
`checked_next_power_of_two` (commit landed in the squash). Rewrote
the comment to enumerate the three actual failure modes (zero
args, mul overflow, next-pow2 overflow) and the three guards that
catch them.

**Copilot #3 (fast_matcher.rs:303)** — doc: `last_committed_space`'s
pre-`accept_data` state description claimed `last_block_start = 0
/ history.len() = 0`, but post-RESERVED_PREFIX_BYTES-seed
construction (#216 / phase 1b) leaves both at
`RESERVED_PREFIX_BYTES`. The returned slice is still empty (the
`history[last_block_start..]` range is empty), just for a
different reason. Updated the doc to reflect the seeded-dummy
invariant.

573/573 tests pass; clippy clean.
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 21, 2026

@coderabbitai Re: match_generator.rs:598-613 outside-diff (reported_window_size) — fixed in commit 60a0076. Now reports next_pow2 so it matches the actual matcher's window from construction onward. The drift was a real bug for non-power-of-two slice_size × max_slices_in_window products.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz: Thanks for the confirmation! Using next_pow2 to align reported_window_size with the matcher's actual construction-time window is the right fix — without it, any caller inspecting the reported size for a non-power-of-two slice_size × max_slices_in_window product would see a value that diverges from what FastKernelMatcher is actually configured with, which could cause subtle mismatch bugs downstream (e.g. incorrect eviction accounting or window-bound checks). Good catch and clean resolution.

… compactness rule)

Doc-only change: collapse the verbose multi-paragraph
last_committed_space docstring (added during PR #216 review
rounds) into a 6-line three-bullet form. Same semantic content,
zero narrative.

Apply the new docstring compactness rule (one-two phrases default,
multi-paragraph only for non-obvious invariants).

573/573 tests pass; clippy clean.
Copilot AI review requested due to automatic review settings May 21, 2026 12:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
…RVED_PREFIX_BYTES seed (#217 Copilot #4, #5)

Doc-only update aligning the dict-prime boundary test
comment + inline note with the RESERVED_PREFIX_BYTES seed
(post-phase-1b range is [RESERVED..=RESERVED], not [0..=0]).
Compactness rule applied — concise two-block form.

573/573 tests pass; clippy clean.
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 21, 2026

@coderabbitai Re: match_generator.rs:598-613 outside-diff (reported_window_size) — stale. Fixed in commit 60a0076 (reported_window_size: next_pow2). Current code already matches your suggested diff.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

@polaz: Confirmed — the reported_window_size: next_pow2(...) fix is already live in commit 60a00764 and the current code matches the suggested diff. I'll mark this comment as resolved.

✅ Actions performed

Comments resolved. Approval is disabled; enable reviews.request_changes_workflow to allow explicit top-level @coderabbitai resolve or @coderabbitai approve commands.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
… bumps (#217 Copilot #6, #7)

Two doc comments described prefix_start_index as 'bumped forward'
or 'advances' as history is evicted, implying a monotonic absolute
index. Actual code (drain_real_prefix) rebases it back to
RESERVED_PREFIX_BYTES on every drain — the retained tail is
re-indexed in the new coordinate space.

Updated both sites (struct field doc + trim_to_window header) to
match. Compactness rule applied — trim_to_window header collapsed
from 10 lines to 5.

573/573 tests pass; clippy clean.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
… TODO marker (#217 Copilot #8, #9)

#8 — header for FAST_LEVEL_1_HASH_LOG said 'reset path rebinds
hash_log proportionally on source-size hint'. Untrue today: driver
passes only window_log per-level, hash_log + mls hard-coded. Pin
per-level scaling to phase 3.

#9 — same lit_len=0 / back-to-back-rep1 concern as the prior PR's
#21. Inline + module docs already explain, but verbose prose isn't
anchored. Collapsed inline comment to a single
'// TODO(#198 phase 3):' line so the deferral marker is
unambiguous.

573/573 tests pass; clippy clean.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
zstd/src/encoding/simple/fast_matcher.rs (2)

327-331: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on duplicate accept_data calls.

In release builds this debug_assert! disappears, and Line 370 will silently replace self.pending, dropping an entire committed block. This protocol violation should hard-fail instead of corrupting the stream state.

Suggested change
-        debug_assert!(
+        assert!(
             self.pending.is_none(),
             "FastKernelMatcher: accept_data called with a still-pending buffer; \
              the driver must run start_matching / skip_matching between commits",
         );

Also applies to: 370-370

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/simple/fast_matcher.rs` around lines 327 - 331, Replace the
non-failing debug_assert! in FastKernelMatcher::accept_data with a runtime check
that hard-fails on protocol violation: if self.pending.is_some() then panic!
with the same message (or use assert!), so duplicate accept_data calls cannot
silently overwrite self.pending; also add the same runtime check at the other
site where self.pending is being replaced (the code that currently overwrites
pending) to ensure any attempt to replace an existing pending buffer fails fast
rather than dropping committed data. Ensure messages reference the protocol
requirement to run start_matching / skip_matching between commits.

350-363: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce the oversized-block precondition here.

If space.len() > cap, retain_real becomes 0, but the full block is still appended later. That violates this module’s advertised post-append <= 2 * max_window_size bound and leaves the matcher outside its documented window discipline.

Suggested guard
         let real_len = self.history.len().saturating_sub(RESERVED_PREFIX_BYTES);
         let new_real_total = real_len.saturating_add(space.len());
         let cap = self.max_window_size.saturating_mul(2);
+        assert!(
+            space.len() <= cap,
+            "FastKernelMatcher requires block_size <= 2 * max_window_size \
+             (block_size={}, cap={})",
+            space.len(),
+            cap,
+        );
         if new_real_total > cap {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/simple/fast_matcher.rs` around lines 350 - 363, The code
computes retain_real using cap and space.len() but never enforces the
precondition that the incoming block fits the cap; if space.len() > cap we must
not append the full block because that breaks the module invariant. In
fast_matcher.rs, before computing or using retain_real (and before appending the
incoming block), add a guard that detects space.len() > cap and handles it
explicitly (e.g., early-return an error/result indicating an oversized block or
otherwise reject/truncate the input per the module’s API) so the matcher never
ends up with real_len + space.len() > 2 * self.max_window_size; reference
variables: retain_real, space.len(), cap, self.max_window_size.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 517-523: The call to encode_offset_with_history(...)
unconditionally rotates offset_hist on lit_len == 0 emits and breaks the rep ↔
offset_hist invariant; update the fast matcher to detect zero-literal
(literals.len() == 0) rep1 emits and either collapse/skip the
encode_offset_with_history call for those cases or explicitly keep self.rep and
offset_hist consistent (e.g., perform the same rotation on self.rep or avoid
mutating offset_hist) before proceeding; reference the
encode_offset_with_history call and the self.rep / offset_hist state in
fast_matcher.rs to implement the guard or state-sync logic.

---

Outside diff comments:
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 327-331: Replace the non-failing debug_assert! in
FastKernelMatcher::accept_data with a runtime check that hard-fails on protocol
violation: if self.pending.is_some() then panic! with the same message (or use
assert!), so duplicate accept_data calls cannot silently overwrite self.pending;
also add the same runtime check at the other site where self.pending is being
replaced (the code that currently overwrites pending) to ensure any attempt to
replace an existing pending buffer fails fast rather than dropping committed
data. Ensure messages reference the protocol requirement to run start_matching /
skip_matching between commits.
- Around line 350-363: The code computes retain_real using cap and space.len()
but never enforces the precondition that the incoming block fits the cap; if
space.len() > cap we must not append the full block because that breaks the
module invariant. In fast_matcher.rs, before computing or using retain_real (and
before appending the incoming block), add a guard that detects space.len() > cap
and handles it explicitly (e.g., early-return an error/result indicating an
oversized block or otherwise reject/truncate the input per the module’s API) so
the matcher never ends up with real_len + space.len() > 2 *
self.max_window_size; reference variables: retain_real, space.len(), cap,
self.max_window_size.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 34185c8b-d8d0-4c6a-b193-cf50284a5288

📥 Commits

Reviewing files that changed from the base of the PR and between c1c6a9e and 4e497d8.

📒 Files selected for processing (1)
  • zstd/src/encoding/simple/fast_matcher.rs

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
…n + hoist mls dispatch (#217 round 5)

Four findings from the latest CR + Copilot review pass.

**CR outside-diff (327-331) — assert!, not debug_assert!:** the
duplicate-pending guard in accept_data was debug_assert!, so in
release builds a double-commit would silently overwrite pending
and drop a committed block. Hard-fail instead.

**CR outside-diff (358-363) — block_size <= 2 × max_window_size:**
the eviction math computes retain_real via saturating_sub, which
silently collapses to 0 if space.len() > cap. The full block is
still appended afterwards, violating the documented
'history bounded by 2 × max_window_size' invariant. Added an
assert! precondition so callers see a clear panic at the boundary
instead of an invisible invariant break.

**CR #10 — lit_len == 0 offset_hist rotation (third raise):** the
prior round added a TODO marker explaining the divergence, but CR
correctly pushed back that documentation isn't a fix. Skip
encode_offset_with_history when literals.is_empty() so
offset_hist stays in lockstep with the kernel's unchanged rep —
no divergence on the back-to-back rep1 path. Wire encoder
downstream still sees the Triple with raw offset; its own
encoding stays correct (lit_len-0 absolute encoding per RFC
8878 §3.1.2.5). Module docstring 'Known divergence' section
collapsed accordingly.

**Copilot #11 — hoist mls dispatch outside prime_hash loop:**
moved the per-MLS match arm OUTSIDE the per-position loop. New
prime_hash_table_impl<const MLS: u32> is monomorphised per
matcher instance; the hot path is branch-free on mls.

573/573 tests pass; clippy clean.
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 21, 2026

@coderabbitai Outside-diff items fixed in a5b2f6a:

  • 327-331 — debug_assert!assert! on accept_data pending guard
  • 358-363 — added assert!(space.len() <= cap) precondition before the eviction math

Both panic with clear domain-specific messages on protocol / precondition violation instead of silently corrupting state.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz: Confirmed — both fixes look correct:

  • 327-331: Upgrading debug_assert!assert! ensures the pending-block guard fires in release builds too, which is appropriate since a violation here indicates a caller protocol error that should never be silently swallowed.
  • 358-363: The explicit assert!(space.len() <= cap) precondition before the eviction math gives a precise, early failure point with a meaningful message, rather than letting the arithmetic produce a silent underflow or wrong eviction count.

Good hardening — both changes make the contract-violation surface clear and immediate regardless of build profile.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
 Copilot #12, revert round 5 #10)

Round 5 skipped encode_offset_with_history when literals.is_empty()
to keep matcher.offset_hist 'in lockstep' with kernel rep. Copilot
correctly pushed back: Dfast / Row / HashChain matchers all call
it unconditionally (passing lit_len = 0 when applicable), and the
'lockstep' framing was wrong — matcher.offset_hist tracks the
WIRE ENCODER's history while matcher.rep tracks the KERNEL's
state. They're not supposed to be the same.

Revert the skip. Module docstring rewritten to call out that the
two fields reflect DIFFERENT state and may diverge on lit_len = 0
emits per RFC 8878 §3.1.2.5 — both halves stay self-consistent
within their own domain.

Round 5 #10 reply to CR was based on the wrong mental model;
that's reflected in this fix by reverting to the donor /
other-matchers behaviour.

573/573 tests pass; clippy clean.
Copilot AI review requested due to automatic review settings May 21, 2026 15:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@polaz polaz merged commit 6d9ac63 into main May 21, 2026
23 checks passed
@polaz polaz deleted the perf/#198-fast-matcher-port-phase1b branch May 21, 2026 15:22
polaz added a commit that referenced this pull request May 21, 2026
… + per-level cParams + cmov dispatch + window-correctness (#198 phase 3)

Closes phase 3 of #198: native Rust port of donor's
`ZSTD_compressBlock_fast_noDict_generic` (`zstd_fast.c:192-456`)
restoring Fast-strategy throughput after the phase 1b regression
and aligning encoder behaviour with donor parity on every front
that affects ratio / speed / format correctness.

## Headline results

`large-log-stream` (25 MiB dense corpus, i9-9900K):

| State                        | Time vs main | Throughput     |
|------------------------------|-------------:|---------------:|
| main (pre-phase-1b)          |      0%      | ~290 MiB/s     |
| #217 merged (phase 1b)       |    +122%     | ~235 MiB/s     |
| **This PR (phase 3, M1-M9)** |    **−22%**  | **~790 MiB/s** |

Net change phase 1b → phase 3: ~3.4× faster.

## Ratio audit on `decodecorpus-z000033`

- L-7..L-1 Fast: monotone acceleration gradient restored
  (−1.85% to −4.79% vs donor) — pre-M6 these levels all produced
  identical 585601 bytes
- L2-3 Dfast: parity or slight win (-0.94% to -4.61%)
- L5-15 Lazy: consistently beat donor by −5.7% to −6.18%
- L16-17 btopt: parity / win (−0.33% to −3.59%)
- L20-22 btultra2: parity (±0.2%)
- L1 Fast: +7.43% residual gap — tracked as follow-up #220

## Milestones (squashed)

- M1: per-level `fast_hash_log`/`fast_mls`/`fast_step_size`
  threading through `LevelParams`
- M2: full 4-cursor `ip0/ip1/ip2/ip3` lookahead body +
  immediate-rep2 inner loop ported from `zstd_fast.c`
- M3: `cmov` match-found variant + per-window dispatch surface
  (10 monomorphisations across `mls` 4..=8 × `use_cmov` true/false)
- M4: beyond-donor `fast_hash_log: 13 → 14` for negative levels
  (+32 KB memory, 2× fewer collisions on structured corpora)
- M5: reverted (adaptive mls peek did not pay off)
- M6: per-level `fast_step_size` from donor's
  `targetLength = -level` formula; restores acceleration gradient
- M7: added donor's missing `current0+2` hash insertion after
  each match emit (`zstd_fast.c:407`); raised L1/decodecorpus
  sequence-match-rate 43.1% → 57.7%
- M8: dropped `RESERVED_PREFIX_BYTES` dummy byte; history layout
  now donor-parity, sentinel-0 protection via
  `INITIAL_PREFIX_START_INDEX = 1` filter
- M9 (format correctness): sliding prefix floor at scan time
  enforces the advertised frame window (`1 << window_log`), NOT
  the dictionary-budget-inflated `max_window_size`. Prevents
  emitting offsets > advertised window during dictionary-primed
  compression — would otherwise produce format-invalid frames.
  Regression test included.

## API changes (driver)

- `FastKernelMatcher::with_params(window_log, hash_log, mls, step_size)`
  — step_size promoted to 4th positional arg; previous `set_step_size`
  post-init setter removed (closed the gap where new code paths
  could silently fall back to default step_size=2)
- `FastKernelMatcher::reset(window_log, hash_log, mls, step_size)`
  — same signature change
- `RESERVED_PREFIX_BYTES`: 1 → 0 (no dummy region); legacy name
  retained for drain-offset math
- New `INITIAL_PREFIX_START_INDEX = 1` constant for sentinel-0
  filter baseline

## Kernel structure

- `match_found<USE_CMOV>` — branchless cmov / branch dispatch via
  const generic; `# Safety` documented; bitwise `&` intentional
  to preserve cmov ordering (donor `__asm__("")` equivalent)
- `MatchFound::{Rep,Explicit}` variants carry explicit `current0`
  (donor's writeback position) — correct on both probe paths,
  pre-backward-extension
- `ip3 > ilimit` exit (not `>=`) — recovers the last hashable
  position at end-of-block
- Two hash insertions post-match-emit: at `current0+2` (donor
  zstd_fast.c:407, was missing) and `ip0-2`

## Tests

- 577 nextest, all passing
- New per-level dispatch test pinning hash_log / mls / step_size
  for L1, Fastest, Uncompressed, L-1..L-7
- cmov vs branch byte-for-byte equality + cmov out-of-window
  false-positive regression
- explicit_match_backward_extension deterministic via marker-byte
  layout (asserts match_len ≥ 5 + literals don't end with marker)
- start_matching_enforces_max_window_size_offset_bound
- start_matching_caps_offsets_at_window_log_not_inflated_max
  (M9 format-correctness regression)
- cross_validation FFI roundtrips on every level 1..=22 (dict +
  no-dict) — wire-format interop preserved

## Out of scope (follow-ups)

- #220 — Fast L1 +7.43% ratio residual on decodecorpus
- L18-19 btultra +3.5% — separate ratio investigation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(fast): port donor ZSTD_compressBlock_fast — 22× regression on Fast strategy

2 participants