Skip to content

perf(encoder): port donor ZSTD_compressBlock_fast — 4-cursor + per-level cParams + cmov + window-correctness (#198 phase 3)#221

Closed
polaz wants to merge 16 commits into
mainfrom
perf/#198-fast-matcher-phase3
Closed

perf(encoder): port donor ZSTD_compressBlock_fast — 4-cursor + per-level cParams + cmov + window-correctness (#198 phase 3)#221
polaz wants to merge 16 commits into
mainfrom
perf/#198-fast-matcher-phase3

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 21, 2026

Summary

Native Rust port of donor's ZSTD_compressBlock_fast_noDict_generic (zstd_fast.c:192-456) — closes phase 3 of #198. Restores Fast-strategy throughput after the phase 1b regression and aligns encoder behaviour with donor parity on every front that affects ratio / speed / format correctness.

Squashed from PR #219 (41 review threads addressed across 10 rounds) into a single coherent commit.

Headline results

large-log-stream (25 MiB dense corpus, i9-9900K):

State Time vs main Throughput
main (pre-phase-1b) 0% ~290 MiB/s
#217 merged (phase 1b) +122% ~235 MiB/s
This PR −22% ~790 MiB/s

Net change phase 1b → phase 3: ~3.4× faster.

Ratio audit on decodecorpus-z000033

  • L-7..L-1 Fast: monotone acceleration gradient restored (−1.85% to −4.79% vs donor)
  • L2-3 Dfast: parity or slight win (−0.94% to −4.61%)
  • L5-15 Lazy: consistently beat donor by −5.7..−6.18%
  • L16-17 btopt: parity / win (−0.33% to −3.59%)
  • L20-22 btultra2: parity (±0.2%)
  • L1 Fast: +7.43% residual gap → follow-up perf(encoder): Fast L1 +7.43% ratio gap vs donor on decodecorpus-z000033 #220

Milestones (squashed from PR #219)

  • M1 — per-level fast_hash_log/fast_mls/fast_step_size threading through LevelParams
  • M2 — full 4-cursor ip0/ip1/ip2/ip3 lookahead body + immediate-rep2 inner loop
  • M3cmov match-found variant + per-window dispatch (10 monomorphisations: mls 4..=8 × use_cmov true/false)
  • M4 — beyond-donor fast_hash_log: 13 → 14 for negative levels (+32 KB memory, 2× fewer collisions on structured corpora)
  • M5 — reverted (adaptive mls peek did not pay off)
  • M6 — per-level fast_step_size from donor's targetLength = -level formula; restores acceleration gradient
  • M7 — added donor's missing current0+2 hash insertion after each match emit (zstd_fast.c:407); raised L1/decodecorpus sequence-match-rate 43.1% → 57.7%
  • M8 — dropped RESERVED_PREFIX_BYTES dummy byte; history layout now donor-parity, sentinel-0 protection via INITIAL_PREFIX_START_INDEX = 1 filter
  • M9 — sliding prefix floor enforces the advertised frame window (1 << window_log), NOT the dictionary-budget-inflated max_window_size. Prevents emitting offsets > advertised window during dictionary-primed compression — would otherwise produce format-invalid frames.

API changes

  • FastKernelMatcher::with_params(window_log, hash_log, mls, step_size) — step_size promoted to 4th positional arg
  • FastKernelMatcher::reset(...) — same signature change
  • set_step_size post-init setter removed (closed gap where new code paths could silently fall back to step_size=2)
  • RESERVED_PREFIX_BYTES: 1 → 0; new INITIAL_PREFIX_START_INDEX = 1 constant

Tests

  • 577 nextest, all passing (+13 new tests covering per-level dispatch, cmov vs branch parity, cmov out-of-window false-positive, backward extension, sliding-floor enforcement on inflated max_window_size)
  • cross_validation FFI roundtrips on every level 1..=22 (dict + no-dict) — wire-format interop preserved
  • clippy clean

Out of scope (follow-ups)

Closes #198 phase 3.

Summary by CodeRabbit

  • Bug Fixes

    • Stronger window-bound enforcement and out-of-window match rejection so emitted offsets never exceed the advertised decoder window.
  • Refactor

    • Per-level fast tuning (including explicit step-size) exposed and threaded through match engines and driver.
    • History/sentinel layout simplified to remove reserved dummy bytes and tighten hash/drain invariants.
    • Inner compression loop reorganized into a donor-style pipeline with explicit step handling and safer bounds checks.
  • New Features

    • Optional branchless (cmov) match path with parity validation versus the branch variant.
  • Tests

    • New unit and regression tests for per-level tuning, driver wiring, cmov parity, match-found correctness, and window-bound behavior.

Review Change Stack

… + per-level cParams + cmov dispatch + window-correctness (#198 phase 3)

Closes phase 3 of #198: native Rust port of donor's
`ZSTD_compressBlock_fast_noDict_generic` (`zstd_fast.c:192-456`)
restoring Fast-strategy throughput after the phase 1b regression
and aligning encoder behaviour with donor parity on every front
that affects ratio / speed / format correctness.

## Headline results

`large-log-stream` (25 MiB dense corpus, i9-9900K):

| State                        | Time vs main | Throughput     |
|------------------------------|-------------:|---------------:|
| main (pre-phase-1b)          |      0%      | ~290 MiB/s     |
| #217 merged (phase 1b)       |    +122%     | ~235 MiB/s     |
| **This PR (phase 3, M1-M9)** |    **−22%**  | **~790 MiB/s** |

Net change phase 1b → phase 3: ~3.4× faster.

## Ratio audit on `decodecorpus-z000033`

- L-7..L-1 Fast: monotone acceleration gradient restored
  (−1.85% to −4.79% vs donor) — pre-M6 these levels all produced
  identical 585601 bytes
- L2-3 Dfast: parity or slight win (-0.94% to -4.61%)
- L5-15 Lazy: consistently beat donor by −5.7% to −6.18%
- L16-17 btopt: parity / win (−0.33% to −3.59%)
- L20-22 btultra2: parity (±0.2%)
- L1 Fast: +7.43% residual gap — tracked as follow-up #220

## Milestones (squashed)

- M1: per-level `fast_hash_log`/`fast_mls`/`fast_step_size`
  threading through `LevelParams`
- M2: full 4-cursor `ip0/ip1/ip2/ip3` lookahead body +
  immediate-rep2 inner loop ported from `zstd_fast.c`
- M3: `cmov` match-found variant + per-window dispatch surface
  (10 monomorphisations across `mls` 4..=8 × `use_cmov` true/false)
- M4: beyond-donor `fast_hash_log: 13 → 14` for negative levels
  (+32 KB memory, 2× fewer collisions on structured corpora)
- M5: reverted (adaptive mls peek did not pay off)
- M6: per-level `fast_step_size` from donor's
  `targetLength = -level` formula; restores acceleration gradient
- M7: added donor's missing `current0+2` hash insertion after
  each match emit (`zstd_fast.c:407`); raised L1/decodecorpus
  sequence-match-rate 43.1% → 57.7%
- M8: dropped `RESERVED_PREFIX_BYTES` dummy byte; history layout
  now donor-parity, sentinel-0 protection via
  `INITIAL_PREFIX_START_INDEX = 1` filter
- M9 (format correctness): sliding prefix floor at scan time
  enforces the advertised frame window (`1 << window_log`), NOT
  the dictionary-budget-inflated `max_window_size`. Prevents
  emitting offsets > advertised window during dictionary-primed
  compression — would otherwise produce format-invalid frames.
  Regression test included.

## API changes (driver)

- `FastKernelMatcher::with_params(window_log, hash_log, mls, step_size)`
  — step_size promoted to 4th positional arg; previous `set_step_size`
  post-init setter removed (closed the gap where new code paths
  could silently fall back to default step_size=2)
- `FastKernelMatcher::reset(window_log, hash_log, mls, step_size)`
  — same signature change
- `RESERVED_PREFIX_BYTES`: 1 → 0 (no dummy region); legacy name
  retained for drain-offset math
- New `INITIAL_PREFIX_START_INDEX = 1` constant for sentinel-0
  filter baseline

## Kernel structure

- `match_found<USE_CMOV>` — branchless cmov / branch dispatch via
  const generic; `# Safety` documented; bitwise `&` intentional
  to preserve cmov ordering (donor `__asm__("")` equivalent)
- `MatchFound::{Rep,Explicit}` variants carry explicit `current0`
  (donor's writeback position) — correct on both probe paths,
  pre-backward-extension
- `ip3 > ilimit` exit (not `>=`) — recovers the last hashable
  position at end-of-block
- Two hash insertions post-match-emit: at `current0+2` (donor
  zstd_fast.c:407, was missing) and `ip0-2`

## Tests

- 577 nextest, all passing
- New per-level dispatch test pinning hash_log / mls / step_size
  for L1, Fastest, Uncompressed, L-1..L-7
- cmov vs branch byte-for-byte equality + cmov out-of-window
  false-positive regression
- explicit_match_backward_extension deterministic via marker-byte
  layout (asserts match_len ≥ 5 + literals don't end with marker)
- start_matching_enforces_max_window_size_offset_bound
- start_matching_caps_offsets_at_window_log_not_inflated_max
  (M9 format-correctness regression)
- cross_validation FFI roundtrips on every level 1..=22 (dict +
  no-dict) — wire-format interop preserved

## Out of scope (follow-ups)

- #220 — Fast L1 +7.43% ratio residual on decodecorpus
- L18-19 btultra +3.5% — separate ratio investigation
Copilot AI review requested due to automatic review settings May 21, 2026 23:02
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4dc128a3-ed6b-4b09-85e3-e82872fee342

📥 Commits

Reviewing files that changed from the base of the PR and between 3d150f1 and e6153f2.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

📝 Walkthrough

Walkthrough

LevelParams now exposes per-level Fast tuning and resolve_level_params/LEVEL_TABLE populate them; MatchGeneratorDriver threads fast_* into FastKernelMatcher::with_params/reset; FastKernelMatcher stores step_size/use_cmov and removes the dummy-prefix model; compress_block_fast becomes a 4-cursor donor pipeline with a USE_CMOV const generic; tests updated.

Changes

Per-level Fast kernel parameter dispatch and donor pipeline

Layer / File(s) Summary
LevelParams schema and per-level fast tuning
zstd/src/encoding/match_generator.rs
LevelParams removes hash_fill_step and adds fast_hash_log, fast_mls, fast_step_size; LEVEL_TABLE and resolve_level_params updated (Uncompressed, Fastest, negative levels, Level(22) tier); unit tests validate resolved fast params.
MatchGeneratorDriver parameter threading
zstd/src/encoding/match_generator.rs
Driver new/reset now pass per-level params.fast_* into FastKernelMatcher::with_params/reset instead of hardcoded donor-level defaults.
Fast kernel 4-cursor donor pipeline and cmov variant
zstd/src/encoding/simple/fast_kernel/kernel.rs
compress_block_fast gains const generic USE_CMOV and step_size parameter; K_STEP_INCR and CMOV_DUMMY introduced; match_found<const USE_CMOV> replaces prior helper; main loop rewritten to a 4-cursor donor pipeline with MatchFound-driven probes, backward/forward extension, and immediate rep2 inner loop; tests updated (cmov vs branch equivalence, sentinel rejection).
FastKernelMatcher state and per-level threading
zstd/src/encoding/simple/fast_matcher.rs
Matcher stores use_cmov (derived from window_log) and step_size; with_params/reset signatures accept step_size and validate it; history/prefix invariants change (HISTORY_DRAIN_BASE=0, INITIAL_PREFIX_START_INDEX=1); drain_real_prefix, eviction, and start_matching reworked; kernel dispatch selects compress_block_fast::<MLS, use_cmov> at runtime and passes step_size.
Prefix/layout and window safety tests
zstd/src/encoding/simple/fast_matcher.rs, zstd/src/encoding/simple/fast_kernel/kernel.rs
Reserved dummy prefix removed and many tests updated to the new baseline; added regression tests asserting emitted offsets are capped by the advertised frame window after dictionary priming; test harness call sites updated to new kernel signature and matcher ctor/reset.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

"I hop through hashes, four cursors in flight,
No dummy prefix now — the window's kept tight,
Step-size and cmov hum tuned per level,
Sequences march out, precise and on bevel,
A rabbit's small cheer for a kernel done right." 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: porting donor ZSTD_compressBlock_fast with 4-cursor optimization, per-level compression parameters, cmov variant, and window correctness fix. It is specific, concise, and clearly summarizes the primary contribution.
Linked Issues check ✅ Passed All primary objectives from issue #198 are met: Vec hash table [✓], 4-cursor pipeline [✓], donor-style matching [✓], backward extension [✓], cmov variant [✓], per-level parameter threading [✓], and window-correctness enforcement [✓].
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #198: match_generator.rs refactors level parameter handling, kernel.rs ports the 4-cursor fast compression logic, and fast_matcher.rs integrates per-level parameters and window-correctness fixes. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#198-fast-matcher-phase3

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports donor’s ZSTD_compressBlock_fast_noDict_generic behavior into the Rust Fast encoder path to restore Fast-strategy throughput and improve donor-parity correctness (including window/offset correctness under dictionary priming).

Changes:

  • Extend FastKernelMatcher to carry per-level Fast cParams (hash_log, mls, step_size) and derive use_cmov from window_log.
  • Replace the Fast kernel’s phase-1 single-cursor loop with the donor-style 4-cursor (ip0/ip1/ip2/ip3) pipeline, step acceleration, immediate-rep2 loop, and branch/cmov match-found variants via const generics.
  • Wire per-level Fast parameters through LevelParams and MatchGeneratorDriver reset/backend swap paths; add regression tests for per-level dispatch.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
zstd/src/encoding/simple/fast_matcher.rs Threads per-level params into matcher lifecycle, adds cmov + step_size plumbing, and enforces advertised-window prefix flooring for format correctness.
zstd/src/encoding/simple/fast_kernel/kernel.rs Implements donor 4-cursor fast block compressor, adds cmov/branch match-found variants, and step_size-driven acceleration + immediate-rep2 logic.
zstd/src/encoding/match_generator.rs Introduces Fast-specific per-level parameters in LevelParams, wires them into Simple backend creation/reset, and adds dispatch tests.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 91.51625% with 47 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/simple/fast_matcher.rs 79.58% 39 Missing ⚠️
zstd/src/encoding/simple/fast_kernel/kernel.rs 97.04% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 21, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 8157-8204: Extend the test to actually exercise the driver wiring
by creating a MatchGeneratorDriver, for several Fast levels call the same
FastKernelMatcher::with_params(...) / m.reset(...) path used in the code (the
sites where you wire resolve_level_params into the matcher) and then assert the
inner matcher observed tuple equals (fast_hash_log, fast_mls, fast_step_size)
from resolve_level_params; in practice, for each chosen CompressionLevel call
resolve_level_params(level, None), reset the driver/matcher the same way the
production code does (the FastKernelMatcher::with_params(...) / m.reset(...)
calls), then read the matcher's effective (hash_log, mls, step_size) fields and
assert they match p.fast_hash_log, p.fast_mls, p.fast_step_size to catch any
swapping or stale values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 65027f1a-c31d-4489-837e-16337515e0bf

📥 Commits

Reviewing files that changed from the base of the PR and between 6d9ac63 and 123cf23.

📒 Files selected for processing (3)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/simple/fast_kernel/kernel.rs
  • zstd/src/encoding/simple/fast_matcher.rs

Comment thread zstd/src/encoding/match_generator.rs
polaz added 2 commits May 22, 2026 03:06
…atorDriver::reset (#221 CR nitpick)

CR noted the existing per-level dispatch test only validated
resolve_level_params() output. If with_params(...) / m.reset(...)
on the Fast wiring sites swapped fast_hash_log and fast_mls, or
kept a stale fast_step_size across resets, the existing test
would still pass — it never inspected the inner matcher.

This adds fast_levels_driver_wiring_threads_cparams_into_inner_matcher:
for every Fast level (Level(1), Fastest, Uncompressed, L-1..L-7)
the test calls Matcher::reset on a real MatchGeneratorDriver,
then reaches into the simple-backend FastKernelMatcher and
asserts (hash_log, mls, step_size) == resolve_level_params(level).

Added test-only crate getters on FastKernelMatcher
(step_size / hash_log / mls) so the assertion can read the
effective tuple without piercing private fields.

578/578 tests pass (+1 new); clippy clean.

Part of #198 phase 3.
Copilot AI review requested due to automatic review settings May 22, 2026 04:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Doc comment listed (window_log, hash_log, mls) triple but the function
signature includes step_size as 4th positional argument.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/encoding/simple/fast_matcher.rs (1)

319-322: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Minor: test helper returns max_window_size, not 1 << window_log.

The docstring says "Equals 1 << window_log", but the implementation returns self.max_window_size. These diverge after dictionary priming inflates max_window_size. Since this is test-only and existing tests only call it before any inflation, this works correctly today, but the implementation could be made consistent with the docstring by returning 1u64 << self.window_log.

Align implementation with docstring
     pub(crate) fn window_size(&self) -> u64 {
-        self.max_window_size as u64
+        1u64 << self.window_log
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/simple/fast_matcher.rs` around lines 319 - 322, The test
helper window_size currently returns self.max_window_size which diverges from
the docstring; change it to compute and return the canonical window size as 1u64
<< self.window_log (use the struct's window_log field) so window_size() matches
the documented "Equals `1 << window_log`" behavior instead of returning
max_window_size.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 319-322: The test helper window_size currently returns
self.max_window_size which diverges from the docstring; change it to compute and
return the canonical window size as 1u64 << self.window_log (use the struct's
window_log field) so window_size() matches the documented "Equals `1 <<
window_log`" behavior instead of returning max_window_size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a45e61e5-6ae4-443a-b46f-4212e0484273

📥 Commits

Reviewing files that changed from the base of the PR and between 21887be and f0e8f48.

📒 Files selected for processing (1)
  • zstd/src/encoding/simple/fast_matcher.rs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs
polaz added 2 commits May 22, 2026 07:38
window_size docstring states 'Equals 1 << window_log' but body returned
self.max_window_size. Both fields are kept in sync, but reading
window_log makes the helper literally match the documented contract.
…_BASE + hot-path debug_assert

The constant was set to 0 (M8 dropped the dummy-prefix region) so the
old name no longer reflected what the value represents. Renaming to
HISTORY_DRAIN_BASE matches its actual role: the drain start offset
used by eviction paths, with sentinel-0 protection living entirely in
INITIAL_PREFIX_START_INDEX.

Also downgrade the step_size >= 2 check inside compress_block_fast
from assert! to debug_assert! — with_params / reset already validate
this at matcher-construction time, so the release hot path no longer
needs the per-block branch + formatting machinery.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/encoding/simple/fast_kernel/kernel.rs (1)

253-263: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep step_size validated in release builds.

compress_block_fast is still a safe pub(crate) boundary. Downgrading this to debug_assert! makes invalid step_size inputs silently build-dependent for any direct caller that bypasses FastKernelMatcher, even though the loop cadence assumes donor's step_size >= 2 contract.

Suggested fix
-    debug_assert!(
+    assert!(
         step_size >= 2,
         "Fast kernel requires step_size >= 2 (got {step_size}); \
          the donor formula clamps to a min of 2",
     );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/simple/fast_kernel/kernel.rs` around lines 253 - 263, The
check that enforces step_size >= 2 must run in release builds too: replace the
debug-only check in compress_block_fast with a runtime validation (e.g., assert!
or an explicit if that panics with the same message) so invalid callers that
bypass FastKernelMatcher (and its with_params/reset guards) cannot supply
step_size < 2; update the message to match the existing text ("Fast kernel
requires step_size >= 2 (got {step_size}); the donor formula clamps to a min of
2") so callers of compress_block_fast get a clear, release-time failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 253-263: The check that enforces step_size >= 2 must run in
release builds too: replace the debug-only check in compress_block_fast with a
runtime validation (e.g., assert! or an explicit if that panics with the same
message) so invalid callers that bypass FastKernelMatcher (and its
with_params/reset guards) cannot supply step_size < 2; update the message to
match the existing text ("Fast kernel requires step_size >= 2 (got {step_size});
the donor formula clamps to a min of 2") so callers of compress_block_fast get a
clear, release-time failure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 55dbb7ed-3e5b-4bd3-90e0-44c4aca3f9c6

📥 Commits

Reviewing files that changed from the base of the PR and between f0e8f48 and 492e05b.

📒 Files selected for processing (2)
  • zstd/src/encoding/simple/fast_kernel/kernel.rs
  • zstd/src/encoding/simple/fast_matcher.rs

compress_block_fast is a safe pub(crate) boundary, so the step_size
contract must hold in release builds too — a direct caller that
bypasses FastKernelMatcher::with_params / reset would otherwise
silently mis-iterate the loop cadence under release. The earlier
downgrade to debug_assert! traded too much safety for a once-per-
block branch that the predictor and per-block hash/probe work make
negligible.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Comment thread zstd/src/encoding/simple/fast_matcher.rs
…omments with M8 layout

- prefix_start_index is pinned to INITIAL_PREFIX_START_INDEX (1),
  not HISTORY_DRAIN_BASE (0). The '1' floor is what rejects the hash
  table's all-zero empty-slot value, not a dummy prefix byte.
- Pre-accept_data path: history is genuinely empty under M8 (no
  dummy seeding), so the empty slice comes from a zero-length range,
  not from a HISTORY_DRAIN_BASE-seeded dummy region.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
Comment thread zstd/src/encoding/simple/fast_matcher.rs Outdated
polaz added 2 commits May 22, 2026 10:01
…ertions and comments

Three test sites still spelled HISTORY_DRAIN_BASE 'dummy' / 'RESERVED
dummy' even though M8 dropped the dummy region (HISTORY_DRAIN_BASE = 0).
Reword each so the failure message describes what's actually being
checked (history layout = retained prefix + last block / hashed range
in absolute indices), not a layout that no longer exists.
Copilot AI review requested due to automatic review settings May 22, 2026 07:02
Two related correctness gaps surfaced by CR:

- with_params / reset accepted any window_log u8, but the kernel
  stores indices as u32 and computes max_window_size = 1usize <<
  window_log. window_log >= usize::BITS would panic at runtime;
  values above 31 are unrepresentable as u32 indices regardless.
  Reject at construction with assert!(window_log <= 31).

- compress_block_fast's 4-cursor loop computed ip2 = ip0 + step /
  ip3 = ip2 + 1 / next_step = ip0 + K_STEP_INCR with plain usize
  addition. Under a misbehaving caller (huge step_size) or after
  enough step-doubling iterations, the additions could wrap past
  usize::MAX, bypassing the ip3 > ilimit guard and turning the
  range check into a false-pass — then base.add(ip2) / hash reads
  would dereference an out-of-buffer position. Switch the cursor
  advances to checked_add (break the loop on overflow, identical
  drain path as the normal 'ran out of room' exit) and the
  step-doubling / restart paths to saturating_add.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs
Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs
Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs
Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/simple/fast_matcher.rs
…s below u32::MAX

with_params / reset previously accepted window_log up to 31, allowing
max_window_size = 2^31. accept_data lets history grow to roughly
2 * max_window_size before draining (eager-eviction band, donor
parity), so window_log = 31 could push history.len() to 2^32 and
trip the kernel's data.len() > u32::MAX panic.

Lower the cap to 30 — matches donor zstd's ZSTD_WINDOWLOG_MAX_64 —
so the eviction band tops out at 2^31 < u32::MAX with headroom for
the pending block.
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 22, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/simple/fast_matcher.rs
Comment thread zstd/src/encoding/simple/fast_matcher.rs
…kip unreachable index 0 on rehash

Two CR findings on the post-window_log-cap revision:

- accept_data function-level docs still described the window-budget
  eviction as 'enforced lazily on the actual append (inside
  extend_history_with_pending)'. The actual code does the eviction
  EAGERLY in accept_data so the driver's commit_space can observe
  the pre/post history.len() delta — without that visibility the
  dict-budget retire never fires. Rewrite the doc paragraph to
  match.

- drain_real_prefix rebases prefix_start_index to
  INITIAL_PREFIX_START_INDEX (= 1) so the kernel's
  match_idx >= prefix_start_index filter rejects sentinel-0 lookups.
  The rehash loop was then primed from HISTORY_DRAIN_BASE (= 0),
  inserting one entry at position 0 that the kernel can never read
  back. Start the rehash at INITIAL_PREFIX_START_INDEX instead — same
  range coverage for every kernel-reachable byte, one fewer hash op
  and zero pollution of the freshly-cleared table.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 22, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Full review triggered.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 22, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
zstd/src/encoding/match_generator.rs (1)

8215-8262: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Exercise the backend-switch Fast path in this test.

This loop never leaves BackendTag::Simple, so it only validates FastKernelMatcher::reset(...); the FastKernelMatcher::with_params(...) wiring at Line 963 is still untested. Reset through any non-Fast level before each Fast case so the swap-in path is covered too.

🔧 Minimal test adjustment
     for &level in &fast_levels {
         let p = resolve_level_params(level, None);
@@
-        crate::encoding::Matcher::reset(&mut driver, level);
+        crate::encoding::Matcher::reset(&mut driver, CompressionLevel::Default);
+        crate::encoding::Matcher::reset(&mut driver, level);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/match_generator.rs` around lines 8215 - 8262, The loop only
ever lives in BackendTag::Simple so it never exercises the backend-switch path
(FastKernelMatcher::with_params); before calling
crate::encoding::Matcher::reset(&mut driver, level) for each fast_levels entry,
first reset the driver with a non-Fast compression level (e.g., call
crate::encoding::Matcher::reset(&mut driver, CompressionLevel::Level(9)) or any
level that resolves to a non-Fast strategy) to force the matcher to swap back
into the Fast backend; this ensures the FastKernelMatcher::with_params wiring is
exercised for each case resolved by resolve_level_params and validates the
backend-switch path used by MatchGeneratorDriver::new / simple_mut().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 8215-8262: The loop only ever lives in BackendTag::Simple so it
never exercises the backend-switch path (FastKernelMatcher::with_params); before
calling crate::encoding::Matcher::reset(&mut driver, level) for each fast_levels
entry, first reset the driver with a non-Fast compression level (e.g., call
crate::encoding::Matcher::reset(&mut driver, CompressionLevel::Level(9)) or any
level that resolves to a non-Fast strategy) to force the matcher to swap back
into the Fast backend; this ensures the FastKernelMatcher::with_params wiring is
exercised for each case resolved by resolve_level_params and validates the
backend-switch path used by MatchGeneratorDriver::new / simple_mut().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b3355e2b-5232-4664-acf6-ef1ca7a4d10a

📥 Commits

Reviewing files that changed from the base of the PR and between 70cb3e4 and 3d150f1.

📒 Files selected for processing (3)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/simple/fast_kernel/kernel.rs
  • zstd/src/encoding/simple/fast_matcher.rs

…ckend-switch hop

The Fast-levels wiring test stayed inside BackendTag::Simple for the
entire loop, so every iteration validated only
FastKernelMatcher::reset — never the with_params path that
MatchGeneratorDriver::new / simple_mut take when swapping into the
Fast variant from a different backend. Bounce through a non-Fast
strategy (Level(9) → Lazy) before each Fast reset so the next reset
actually triggers the backend-switch + with_params construction,
covering both reset and with_params wiring per Fast level.
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 22, 2026

@coderabbitai Fixed in e6153f2 — wiring test now bounces through CompressionLevel::Level(9) (resolves to Lazy strategy) before each Fast reset, so the loop exercises the backend-switch path that goes through FastKernelMatcher::with_params on top of the reset-only path it was already covering.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz: Thanks for the fix — bouncing through Level(9) before each Fast reset is a clean way to exercise the with_params backend-switch path without duplicating test structure. The wiring test now covers both construction-from-scratch and reset-only paths together.

(∧˃ᵕ˂∧)♡

@polaz polaz closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(fast): port donor ZSTD_compressBlock_fast — 22× regression on Fast strategy

2 participants