perf(encoder): port donor ZSTD_compressBlock_fast — 4-cursor + per-level cParams + cmov + window-correctness (#198 phase 3)#221
Conversation
… + per-level cParams + cmov dispatch + window-correctness (#198 phase 3) Closes phase 3 of #198: native Rust port of donor's `ZSTD_compressBlock_fast_noDict_generic` (`zstd_fast.c:192-456`) restoring Fast-strategy throughput after the phase 1b regression and aligning encoder behaviour with donor parity on every front that affects ratio / speed / format correctness. ## Headline results `large-log-stream` (25 MiB dense corpus, i9-9900K): | State | Time vs main | Throughput | |------------------------------|-------------:|---------------:| | main (pre-phase-1b) | 0% | ~290 MiB/s | | #217 merged (phase 1b) | +122% | ~235 MiB/s | | **This PR (phase 3, M1-M9)** | **−22%** | **~790 MiB/s** | Net change phase 1b → phase 3: ~3.4× faster. ## Ratio audit on `decodecorpus-z000033` - L-7..L-1 Fast: monotone acceleration gradient restored (−1.85% to −4.79% vs donor) — pre-M6 these levels all produced identical 585601 bytes - L2-3 Dfast: parity or slight win (-0.94% to -4.61%) - L5-15 Lazy: consistently beat donor by −5.7% to −6.18% - L16-17 btopt: parity / win (−0.33% to −3.59%) - L20-22 btultra2: parity (±0.2%) - L1 Fast: +7.43% residual gap — tracked as follow-up #220 ## Milestones (squashed) - M1: per-level `fast_hash_log`/`fast_mls`/`fast_step_size` threading through `LevelParams` - M2: full 4-cursor `ip0/ip1/ip2/ip3` lookahead body + immediate-rep2 inner loop ported from `zstd_fast.c` - M3: `cmov` match-found variant + per-window dispatch surface (10 monomorphisations across `mls` 4..=8 × `use_cmov` true/false) - M4: beyond-donor `fast_hash_log: 13 → 14` for negative levels (+32 KB memory, 2× fewer collisions on structured corpora) - M5: reverted (adaptive mls peek did not pay off) - M6: per-level `fast_step_size` from donor's `targetLength = -level` formula; restores acceleration gradient - M7: added donor's missing `current0+2` hash insertion after each match emit (`zstd_fast.c:407`); raised L1/decodecorpus sequence-match-rate 43.1% → 57.7% - M8: dropped `RESERVED_PREFIX_BYTES` dummy byte; history layout now donor-parity, sentinel-0 protection via `INITIAL_PREFIX_START_INDEX = 1` filter - M9 (format correctness): sliding prefix floor at scan time enforces the advertised frame window (`1 << window_log`), NOT the dictionary-budget-inflated `max_window_size`. Prevents emitting offsets > advertised window during dictionary-primed compression — would otherwise produce format-invalid frames. Regression test included. ## API changes (driver) - `FastKernelMatcher::with_params(window_log, hash_log, mls, step_size)` — step_size promoted to 4th positional arg; previous `set_step_size` post-init setter removed (closed the gap where new code paths could silently fall back to default step_size=2) - `FastKernelMatcher::reset(window_log, hash_log, mls, step_size)` — same signature change - `RESERVED_PREFIX_BYTES`: 1 → 0 (no dummy region); legacy name retained for drain-offset math - New `INITIAL_PREFIX_START_INDEX = 1` constant for sentinel-0 filter baseline ## Kernel structure - `match_found<USE_CMOV>` — branchless cmov / branch dispatch via const generic; `# Safety` documented; bitwise `&` intentional to preserve cmov ordering (donor `__asm__("")` equivalent) - `MatchFound::{Rep,Explicit}` variants carry explicit `current0` (donor's writeback position) — correct on both probe paths, pre-backward-extension - `ip3 > ilimit` exit (not `>=`) — recovers the last hashable position at end-of-block - Two hash insertions post-match-emit: at `current0+2` (donor zstd_fast.c:407, was missing) and `ip0-2` ## Tests - 577 nextest, all passing - New per-level dispatch test pinning hash_log / mls / step_size for L1, Fastest, Uncompressed, L-1..L-7 - cmov vs branch byte-for-byte equality + cmov out-of-window false-positive regression - explicit_match_backward_extension deterministic via marker-byte layout (asserts match_len ≥ 5 + literals don't end with marker) - start_matching_enforces_max_window_size_offset_bound - start_matching_caps_offsets_at_window_log_not_inflated_max (M9 format-correctness regression) - cross_validation FFI roundtrips on every level 1..=22 (dict + no-dict) — wire-format interop preserved ## Out of scope (follow-ups) - #220 — Fast L1 +7.43% ratio residual on decodecorpus - L18-19 btultra +3.5% — separate ratio investigation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughLevelParams now exposes per-level Fast tuning and resolve_level_params/LEVEL_TABLE populate them; MatchGeneratorDriver threads fast_* into FastKernelMatcher::with_params/reset; FastKernelMatcher stores step_size/use_cmov and removes the dummy-prefix model; compress_block_fast becomes a 4-cursor donor pipeline with a USE_CMOV const generic; tests updated. ChangesPer-level Fast kernel parameter dispatch and donor pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Ports donor’s ZSTD_compressBlock_fast_noDict_generic behavior into the Rust Fast encoder path to restore Fast-strategy throughput and improve donor-parity correctness (including window/offset correctness under dictionary priming).
Changes:
- Extend
FastKernelMatcherto carry per-level Fast cParams (hash_log,mls,step_size) and deriveuse_cmovfromwindow_log. - Replace the Fast kernel’s phase-1 single-cursor loop with the donor-style 4-cursor (
ip0/ip1/ip2/ip3) pipeline, step acceleration, immediate-rep2 loop, and branch/cmov match-found variants via const generics. - Wire per-level Fast parameters through
LevelParamsandMatchGeneratorDriverreset/backend swap paths; add regression tests for per-level dispatch.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
zstd/src/encoding/simple/fast_matcher.rs |
Threads per-level params into matcher lifecycle, adds cmov + step_size plumbing, and enforces advertised-window prefix flooring for format correctness. |
zstd/src/encoding/simple/fast_kernel/kernel.rs |
Implements donor 4-cursor fast block compressor, adds cmov/branch match-found variants, and step_size-driven acceleration + immediate-rep2 logic. |
zstd/src/encoding/match_generator.rs |
Introduces Fast-specific per-level parameters in LevelParams, wires them into Simple backend creation/reset, and adds dispatch tests. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 8157-8204: Extend the test to actually exercise the driver wiring
by creating a MatchGeneratorDriver, for several Fast levels call the same
FastKernelMatcher::with_params(...) / m.reset(...) path used in the code (the
sites where you wire resolve_level_params into the matcher) and then assert the
inner matcher observed tuple equals (fast_hash_log, fast_mls, fast_step_size)
from resolve_level_params; in practice, for each chosen CompressionLevel call
resolve_level_params(level, None), reset the driver/matcher the same way the
production code does (the FastKernelMatcher::with_params(...) / m.reset(...)
calls), then read the matcher's effective (hash_log, mls, step_size) fields and
assert they match p.fast_hash_log, p.fast_mls, p.fast_step_size to catch any
swapping or stale values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 65027f1a-c31d-4489-837e-16337515e0bf
📒 Files selected for processing (3)
zstd/src/encoding/match_generator.rszstd/src/encoding/simple/fast_kernel/kernel.rszstd/src/encoding/simple/fast_matcher.rs
…atorDriver::reset (#221 CR nitpick) CR noted the existing per-level dispatch test only validated resolve_level_params() output. If with_params(...) / m.reset(...) on the Fast wiring sites swapped fast_hash_log and fast_mls, or kept a stale fast_step_size across resets, the existing test would still pass — it never inspected the inner matcher. This adds fast_levels_driver_wiring_threads_cparams_into_inner_matcher: for every Fast level (Level(1), Fastest, Uncompressed, L-1..L-7) the test calls Matcher::reset on a real MatchGeneratorDriver, then reaches into the simple-backend FastKernelMatcher and asserts (hash_log, mls, step_size) == resolve_level_params(level). Added test-only crate getters on FastKernelMatcher (step_size / hash_log / mls) so the assertion can read the effective tuple without piercing private fields. 578/578 tests pass (+1 new); clippy clean. Part of #198 phase 3.
Doc comment listed (window_log, hash_log, mls) triple but the function signature includes step_size as 4th positional argument.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/encoding/simple/fast_matcher.rs (1)
319-322: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueMinor: test helper returns
max_window_size, not1 << window_log.The docstring says "Equals
1 << window_log", but the implementation returnsself.max_window_size. These diverge after dictionary priming inflatesmax_window_size. Since this is test-only and existing tests only call it before any inflation, this works correctly today, but the implementation could be made consistent with the docstring by returning1u64 << self.window_log.Align implementation with docstring
pub(crate) fn window_size(&self) -> u64 { - self.max_window_size as u64 + 1u64 << self.window_log }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/simple/fast_matcher.rs` around lines 319 - 322, The test helper window_size currently returns self.max_window_size which diverges from the docstring; change it to compute and return the canonical window size as 1u64 << self.window_log (use the struct's window_log field) so window_size() matches the documented "Equals `1 << window_log`" behavior instead of returning max_window_size.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@zstd/src/encoding/simple/fast_matcher.rs`:
- Around line 319-322: The test helper window_size currently returns
self.max_window_size which diverges from the docstring; change it to compute and
return the canonical window size as 1u64 << self.window_log (use the struct's
window_log field) so window_size() matches the documented "Equals `1 <<
window_log`" behavior instead of returning max_window_size.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a45e61e5-6ae4-443a-b46f-4212e0484273
📒 Files selected for processing (1)
zstd/src/encoding/simple/fast_matcher.rs
window_size docstring states 'Equals 1 << window_log' but body returned self.max_window_size. Both fields are kept in sync, but reading window_log makes the helper literally match the documented contract.
…_BASE + hot-path debug_assert The constant was set to 0 (M8 dropped the dummy-prefix region) so the old name no longer reflected what the value represents. Renaming to HISTORY_DRAIN_BASE matches its actual role: the drain start offset used by eviction paths, with sentinel-0 protection living entirely in INITIAL_PREFIX_START_INDEX. Also downgrade the step_size >= 2 check inside compress_block_fast from assert! to debug_assert! — with_params / reset already validate this at matcher-construction time, so the release hot path no longer needs the per-block branch + formatting machinery.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/encoding/simple/fast_kernel/kernel.rs (1)
253-263:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
step_sizevalidated in release builds.
compress_block_fastis still a safepub(crate)boundary. Downgrading this todebug_assert!makes invalidstep_sizeinputs silently build-dependent for any direct caller that bypassesFastKernelMatcher, even though the loop cadence assumes donor'sstep_size >= 2contract.Suggested fix
- debug_assert!( + assert!( step_size >= 2, "Fast kernel requires step_size >= 2 (got {step_size}); \ the donor formula clamps to a min of 2", );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/simple/fast_kernel/kernel.rs` around lines 253 - 263, The check that enforces step_size >= 2 must run in release builds too: replace the debug-only check in compress_block_fast with a runtime validation (e.g., assert! or an explicit if that panics with the same message) so invalid callers that bypass FastKernelMatcher (and its with_params/reset guards) cannot supply step_size < 2; update the message to match the existing text ("Fast kernel requires step_size >= 2 (got {step_size}); the donor formula clamps to a min of 2") so callers of compress_block_fast get a clear, release-time failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 253-263: The check that enforces step_size >= 2 must run in
release builds too: replace the debug-only check in compress_block_fast with a
runtime validation (e.g., assert! or an explicit if that panics with the same
message) so invalid callers that bypass FastKernelMatcher (and its
with_params/reset guards) cannot supply step_size < 2; update the message to
match the existing text ("Fast kernel requires step_size >= 2 (got {step_size});
the donor formula clamps to a min of 2") so callers of compress_block_fast get a
clear, release-time failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 55dbb7ed-3e5b-4bd3-90e0-44c4aca3f9c6
📒 Files selected for processing (2)
zstd/src/encoding/simple/fast_kernel/kernel.rszstd/src/encoding/simple/fast_matcher.rs
compress_block_fast is a safe pub(crate) boundary, so the step_size contract must hold in release builds too — a direct caller that bypasses FastKernelMatcher::with_params / reset would otherwise silently mis-iterate the loop cadence under release. The earlier downgrade to debug_assert! traded too much safety for a once-per- block branch that the predictor and per-block hash/probe work make negligible.
…omments with M8 layout - prefix_start_index is pinned to INITIAL_PREFIX_START_INDEX (1), not HISTORY_DRAIN_BASE (0). The '1' floor is what rejects the hash table's all-zero empty-slot value, not a dummy prefix byte. - Pre-accept_data path: history is genuinely empty under M8 (no dummy seeding), so the empty slice comes from a zero-length range, not from a HISTORY_DRAIN_BASE-seeded dummy region.
…ertions and comments Three test sites still spelled HISTORY_DRAIN_BASE 'dummy' / 'RESERVED dummy' even though M8 dropped the dummy region (HISTORY_DRAIN_BASE = 0). Reword each so the failure message describes what's actually being checked (history layout = retained prefix + last block / hashed range in absolute indices), not a layout that no longer exists.
Two related correctness gaps surfaced by CR: - with_params / reset accepted any window_log u8, but the kernel stores indices as u32 and computes max_window_size = 1usize << window_log. window_log >= usize::BITS would panic at runtime; values above 31 are unrepresentable as u32 indices regardless. Reject at construction with assert!(window_log <= 31). - compress_block_fast's 4-cursor loop computed ip2 = ip0 + step / ip3 = ip2 + 1 / next_step = ip0 + K_STEP_INCR with plain usize addition. Under a misbehaving caller (huge step_size) or after enough step-doubling iterations, the additions could wrap past usize::MAX, bypassing the ip3 > ilimit guard and turning the range check into a false-pass — then base.add(ip2) / hash reads would dereference an out-of-buffer position. Switch the cursor advances to checked_add (break the loop on overflow, identical drain path as the normal 'ran out of room' exit) and the step-doubling / restart paths to saturating_add.
…s below u32::MAX with_params / reset previously accepted window_log up to 31, allowing max_window_size = 2^31. accept_data lets history grow to roughly 2 * max_window_size before draining (eager-eviction band, donor parity), so window_log = 31 could push history.len() to 2^32 and trip the kernel's data.len() > u32::MAX panic. Lower the cap to 30 — matches donor zstd's ZSTD_WINDOWLOG_MAX_64 — so the eviction band tops out at 2^31 < u32::MAX with headroom for the pending block.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
…kip unreachable index 0 on rehash Two CR findings on the post-window_log-cap revision: - accept_data function-level docs still described the window-budget eviction as 'enforced lazily on the actual append (inside extend_history_with_pending)'. The actual code does the eviction EAGERLY in accept_data so the driver's commit_space can observe the pre/post history.len() delta — without that visibility the dict-budget retire never fires. Rewrite the doc paragraph to match. - drain_real_prefix rebases prefix_start_index to INITIAL_PREFIX_START_INDEX (= 1) so the kernel's match_idx >= prefix_start_index filter rejects sentinel-0 lookups. The rehash loop was then primed from HISTORY_DRAIN_BASE (= 0), inserting one entry at position 0 that the kernel can never read back. Start the rehash at INITIAL_PREFIX_START_INDEX instead — same range coverage for every kernel-reachable byte, one fewer hash op and zero pollution of the freshly-cleared table.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
zstd/src/encoding/match_generator.rs (1)
8215-8262:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winExercise the backend-switch Fast path in this test.
This loop never leaves
BackendTag::Simple, so it only validatesFastKernelMatcher::reset(...); theFastKernelMatcher::with_params(...)wiring at Line 963 is still untested. Reset through any non-Fast level before each Fast case so the swap-in path is covered too.🔧 Minimal test adjustment
for &level in &fast_levels { let p = resolve_level_params(level, None); @@ - crate::encoding::Matcher::reset(&mut driver, level); + crate::encoding::Matcher::reset(&mut driver, CompressionLevel::Default); + crate::encoding::Matcher::reset(&mut driver, level);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/match_generator.rs` around lines 8215 - 8262, The loop only ever lives in BackendTag::Simple so it never exercises the backend-switch path (FastKernelMatcher::with_params); before calling crate::encoding::Matcher::reset(&mut driver, level) for each fast_levels entry, first reset the driver with a non-Fast compression level (e.g., call crate::encoding::Matcher::reset(&mut driver, CompressionLevel::Level(9)) or any level that resolves to a non-Fast strategy) to force the matcher to swap back into the Fast backend; this ensures the FastKernelMatcher::with_params wiring is exercised for each case resolved by resolve_level_params and validates the backend-switch path used by MatchGeneratorDriver::new / simple_mut().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 8215-8262: The loop only ever lives in BackendTag::Simple so it
never exercises the backend-switch path (FastKernelMatcher::with_params); before
calling crate::encoding::Matcher::reset(&mut driver, level) for each fast_levels
entry, first reset the driver with a non-Fast compression level (e.g., call
crate::encoding::Matcher::reset(&mut driver, CompressionLevel::Level(9)) or any
level that resolves to a non-Fast strategy) to force the matcher to swap back
into the Fast backend; this ensures the FastKernelMatcher::with_params wiring is
exercised for each case resolved by resolve_level_params and validates the
backend-switch path used by MatchGeneratorDriver::new / simple_mut().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b3355e2b-5232-4664-acf6-ef1ca7a4d10a
📒 Files selected for processing (3)
zstd/src/encoding/match_generator.rszstd/src/encoding/simple/fast_kernel/kernel.rszstd/src/encoding/simple/fast_matcher.rs
…ckend-switch hop The Fast-levels wiring test stayed inside BackendTag::Simple for the entire loop, so every iteration validated only FastKernelMatcher::reset — never the with_params path that MatchGeneratorDriver::new / simple_mut take when swapping into the Fast variant from a different backend. Bounce through a non-Fast strategy (Level(9) → Lazy) before each Fast reset so the next reset actually triggers the backend-switch + with_params construction, covering both reset and with_params wiring per Fast level.
|
@coderabbitai Fixed in e6153f2 — wiring test now bounces through CompressionLevel::Level(9) (resolves to Lazy strategy) before each Fast reset, so the loop exercises the backend-switch path that goes through FastKernelMatcher::with_params on top of the reset-only path it was already covering. |
|
Tip For best results, initiate chat on the files or code changes.
(∧˃ᵕ˂∧)♡ |
Summary
Native Rust port of donor's
ZSTD_compressBlock_fast_noDict_generic(zstd_fast.c:192-456) — closes phase 3 of #198. Restores Fast-strategy throughput after the phase 1b regression and aligns encoder behaviour with donor parity on every front that affects ratio / speed / format correctness.Squashed from PR #219 (41 review threads addressed across 10 rounds) into a single coherent commit.
Headline results
large-log-stream(25 MiB dense corpus, i9-9900K):Net change phase 1b → phase 3: ~3.4× faster.
Ratio audit on
decodecorpus-z000033Milestones (squashed from PR #219)
fast_hash_log/fast_mls/fast_step_sizethreading throughLevelParamsip0/ip1/ip2/ip3lookahead body + immediate-rep2 inner loopcmovmatch-found variant + per-window dispatch (10 monomorphisations:mls4..=8 ×use_cmovtrue/false)fast_hash_log: 13 → 14for negative levels (+32 KB memory, 2× fewer collisions on structured corpora)fast_step_sizefrom donor'stargetLength = -levelformula; restores acceleration gradientcurrent0+2hash insertion after each match emit (zstd_fast.c:407); raised L1/decodecorpus sequence-match-rate 43.1% → 57.7%RESERVED_PREFIX_BYTESdummy byte; history layout now donor-parity, sentinel-0 protection viaINITIAL_PREFIX_START_INDEX = 1filter1 << window_log), NOT the dictionary-budget-inflatedmax_window_size. Prevents emitting offsets > advertised window during dictionary-primed compression — would otherwise produce format-invalid frames.API changes
FastKernelMatcher::with_params(window_log, hash_log, mls, step_size)— step_size promoted to 4th positional argFastKernelMatcher::reset(...)— same signature changeset_step_sizepost-init setter removed (closed gap where new code paths could silently fall back to step_size=2)RESERVED_PREFIX_BYTES: 1 → 0; newINITIAL_PREFIX_START_INDEX = 1constantTests
max_window_size)Out of scope (follow-ups)
Closes #198 phase 3.
Summary by CodeRabbit
Bug Fixes
Refactor
New Features
Tests