perf(encoder): donor-shape Fast kernel modules (#198 phase 1a) by polaz · Pull Request #215 · structured-world/structured-zstd

polaz · 2026-05-20T16:28:49Z

Summary

Phase 1a of the Fast strategy donor port (issue #198 — 22× regression
vs C zstd on Fast levels). This PR lands the donor-shape kernel
modules with full unit-test coverage; the wiring into
MatchGeneratorDriver / MatcherStorage and the i9-9900K bench land
in the follow-up PR on the next branch.

What's in this PR

Three new modules under zstd/src/encoding/simple/fast_kernel/:

hash_table.rs — FastHashTable
Flat Vec<u32> indexed by donor-parity ZSTD_hash{4,5,6,7,8}Ptr
(multiply-shift on the first mls bytes of the suffix at ptr,
output reduced to hash_log bits). Constants bit-identical to
lib/compress/zstd_compress_internal.h. Monomorphised over MLS
via const-generic so each mls instantiation compiles to a
dedicated body with the prime / shift baked in.
count.rs — count_forward
Port of ZSTD_count from lib/compress/zstd_compress_internal.h.
8-byte chunked XOR loop with trailing_zeros() / 8 to locate the
diverging byte, then u32 / u16 / u8 tail fall-throughs
matching donor's exact progression. Replaces the byte-by-byte
common_prefix_len used by the current SuffixStore-based Fast
matcher.
kernel.rs — compress_block_fast<MLS>
Port of ZSTD_compressBlock_fast_noDict_generic from
lib/compress/zstd_fast.c. Phase 1 keeps the single-ip0 cursor
with raw 4-byte match probe (MEM_read32(ip) == MEM_read32(base + matchIdx)), step-based skip acceleration tied to
kSearchStrength = 6, repcode-at-ip check, backward extension
bounded by anchor / prefix_start, and count_forward for
forward extension. The 4-cursor pipelining (ip0 / ip1 / ip2 / ip3) and cmov match-found variant from donor's full generic
body are deferred to phase 3 per the issue's staged plan — phase 1
exists to validate the data structures and close the bulk of the
22× gap before adding pipelining complexity.

The kernel's compress_block_fast is monomorphised over const MLS: u32 (4..=8), invokes handle_sequence per emitted sequence
(replacing donor's ZSTD_storeSeq), and returns
FastBlockResult { rep, tail_literals_len } so the caller can
thread repcode history across blocks and flush trailing literals.

What's NOT in this PR (lands in follow-up)

Wiring into MatchGeneratorDriver::compress_block::<Fast> — phase
1b. Requires:
- Adding FastHashTable field to MatchGenerator (alongside the
  existing SuffixStore during transition).
- Flat-history-buffer materialisation (concat Vec<WindowEntry>
  into a single Vec<u8> per start_matching call, or replace
  the multi-window layout with a single ring buffer).
- Hash-table alloc / clear in reset keyed off the resolved
  LevelParams (hash_log, mls) — donor uses hash_log = 14,
  mls = 7 for default level 1.
Encoder/decoder roundtrip validation (current cargo nextest run --workspace covers existing SuffixStore path; the kernel hasn't
been wired yet so it can't break the e2e suite from this PR).
i9-9900K bench on compare_ffi level_1_fast — measures the actual
perf delta vs main and against C donor. Phase 1b commit will land
the numbers in its PR body.

Phase 3 (separate later branch per the issue):

4-cursor ip0/ip1/ip2/ip3 pipelining with software-prefetch hash
reads — gives the bulk of donor's noDict_generic throughput on
modern superscalar cores.
cmov match-found variant gated on windowLog < 19.
Cross-block matches via prefix history (phase 1a's kernel is
single-block; cross-block matching at the window boundary is part
of phase 1b's flat-buffer work).

Tests

16 new unit tests covering:

hash_table::tests: donor formula parity on known inputs for both
mls=4 and mls=5, get/put round-trip, clear() resets every
entry to sentinel.
count::tests: empty / full / chunk-boundary / tail-fall-through
divergence at every transition point (u32 → u16 → u8), thousand-
byte long match.
kernel::tests: short-input → literals-only callback path, long
repeat pattern → at least one Triple emitted with match_len ≥ MIN_MATCH (4), plus an accounting invariant
(sum(literals + matches) + tail_literals_len == input.len())
that catches off-by-ones in any future refactor.

All 16 pass; full workspace nextest run remains green at 555/555.

Module gate

fast_kernel/mod.rs opens with a narrowly-scoped
#![allow(dead_code)] (without unused_imports) for this commit
only — every item is consumed by the phase 1b wiring commit and
the allow is removed there. The unused_imports lint stays ACTIVE
across the module so any accidentally-unused use in sibling code
surfaces immediately. Re-exports of the kernel's public surface
(FastHashTable, FastBlockResult, compress_block_fast) are
deferred to the wiring commit since adding them now would force
the wider lint relaxation. Without the dead_code allow, baseline
cargo clippy -D warnings would flag the unused kernel-internal
items; splitting the kernel and the wiring into separate PRs is
the right tradeoff for review clarity, but the cost is the
temporary dead_code mute on this module.

Related: #198 (Fast strategy 22× regression — multi-phase port).

Summary by CodeRabbit

New Features
- Added a donor-style "Fast" block compressor to the simple encoder for improved match finding, rep/offset handling, and faster compression.
- Added an optimized forward match counter and a compact power-of-two hash table to improve match-extension accuracy and throughput.
Tests
- Added unit tests covering matching behavior, hash-table operations, compressor outcomes, and edge cases (empty inputs, early mismatches, chunk/tail boundaries, repcode behavior, long full-match).

…ount) First commit of the #198 Fast strategy port (22× regression vs C zstd on Fast levels). The kernel itself is implemented and unit-tested in isolation; wiring into MatchGeneratorDriver / MatcherStorage lands in the next commit on this branch. What's in this commit: - zstd/src/encoding/simple/fast_kernel/hash_table.rs: FastHashTable — flat Vec<u32> indexed by donor-parity ZSTD_hash{4,5,6,7,8}Ptr (multiply-shift on the first `mls` bytes of the suffix at `ptr`, output reduced to `hash_log` bits). Constants bit-identical to lib/compress/zstd_compress_internal.h. - zstd/src/encoding/simple/fast_kernel/count.rs: count_forward — port of ZSTD_count from lib/compress/zstd_compress_internal.h. 8-byte chunked XOR loop with trailing_zeros / 8 to locate the diverging byte, then u32 / u16 / u8 tail fall-throughs matching donor's exact progression. - zstd/src/encoding/simple/fast_kernel/kernel.rs: compress_block_fast<MLS> — port of ZSTD_compressBlock_fast_noDict_generic from lib/compress/zstd_fast.c. Phase 1 keeps the single-ip0 cursor with raw 4-byte match probe, step-based skip acceleration, repcode- at-ip check, backward extension, and ZSTD_count for forward extension. The 4-cursor (ip0/ip1/ip2/ip3) pipelining and cmov match-found variant from donor's full generic body are deferred to phase 3; this commit exists to validate the data structures and close the bulk of the 22× gap before adding pipelining complexity. Tests: 16 new unit tests cover the hash function donor parity, the count_forward chunk/tail progression, and the kernel's accounting invariant (sum(literals + matches) + tail_literals_len == input.len()). All pass. The module is gated with `#![allow(dead_code, unused_imports)]` for this commit only — every item is used in the wiring commit. The allow is removed there. Related: #198 (Fast strategy 22× regression).

coderabbitai · 2026-05-20T16:29:03Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4692f8aa-a289-4b41-8468-d35f3fafc0b5

📥 Commits

Reviewing files that changed from the base of the PR and between 12e6751 and aa3faf5.

📒 Files selected for processing (2)

zstd/src/encoding/simple/fast_kernel/count.rs
zstd/src/encoding/simple/fast_kernel/hash_table.rs

📝 Walkthrough

Walkthrough

Adds a donor-compatible Fast compression kernel: count_forward for forward match-length counting, FastHashTable for multiply-shift hashing and position storage, and compress_block_fast which probes the table, extends matches, emits sequences, and reports trailing literals. Includes unit tests for each component.

Changes

Fast Compression Kernel Implementation

Layer / File(s)	Summary
Module structure and integration `zstd/src/encoding/simple/mod.rs`, `zstd/src/encoding/simple/fast_kernel/mod.rs`	Adds `fast_kernel` submodule, declares `count`, `hash_table`, and `kernel` submodules, and applies crate-local `#![allow(dead_code)]`.
Match-length counting primitive `zstd/src/encoding/simple/fast_kernel/count.rs`	`count_forward` scans forward using usize-sized XOR chunks and endian-specific zero-counting to locate first mismatch, then checks u32 (on 64-bit), u16, and u8 tails. Returns matched byte length and includes comprehensive unit tests.
Hash table structure and hashing `zstd/src/encoding/simple/fast_kernel/hash_table.rs`	`FastHashTable` is a flat `Vec<u32>` sized by `1 << hash_log` with `0` sentinel, validates parameters, provides `clear`, const-generic `hash_ptr::<MLS>` for MLS=4..=8 using donor primes and unaligned little-endian loads, and unsafe `get`/`put`. Tests verify hashing, get/put/clear, and constructor bounds.
Fast block compression kernel `zstd/src/encoding/simple/fast_kernel/kernel.rs`	`compress_block_fast` implements donor-style fast block compression: 4-byte probing, `match_found` hardening, repcode-at-ip and explicit-match paths, backward/forward extension via `count_forward`, repcode save/restore, sequence emission via handler, and unit tests covering accounting and hardening.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant compress_block_fast
  participant FastHashTable
  participant count_forward
  participant SequenceHandler
  Caller->>compress_block_fast: run_block(data, block_start, prefix_start_index, hash_table, rep, handle_sequence)
  compress_block_fast->>FastHashTable: get(hash) / put(hash, pos)
  compress_block_fast->>count_forward: count_forward(ip, match_ptr, iend)
  compress_block_fast->>SequenceHandler: handle_sequence(Sequence::Triple / Sequence::Literals)
  compress_block_fast-->>Caller: FastBlockResult(rep, tail_literals_len)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

perf(fast): port donor ZSTD_compressBlock_fast — 22× regression on Fast strategy #198: Implements the donor Fast path (ZSTD_count → count_forward, Vec hash table → FastHashTable, and ZSTD_compressBlock_fast → compress_block_fast) described in the issue.

Poem

A rabbit hops through hash tables bright, 🐇
Counting forward, byte by byte at night,
XOR and CTZ find where the bytes part,
Sequences stitched while repcodes keep heart,
Tail literals nap as the kernel takes flight.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding donor-shaped Fast kernel modules to the encoder, with clear reference to the phase (1a) and ticket number (`#198`).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#198-fast-matcher-port-phase1

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-20T16:31:30Z

Codecov Report

❌ Patch coverage is 93.84359% with 37 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/encoding/simple/fast_kernel/kernel.rs	91.57%	30 Missing ⚠️
zstd/src/encoding/simple/fast_kernel/hash_table.rs	94.92%	7 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 122-133: The early-return branch in the fast kernel currently
calls handle_sequence(Sequence::Literals { literals: &data[block_start..] }) but
then returns FastBlockResult { rep, tail_literals_len: 0 }, which is
inconsistent with the normal path; change this branch to NOT call
handle_sequence and instead return FastBlockResult { rep, tail_literals_len:
data.len() - block_start } so the caller is responsible for emitting the tail
literals (preserve symbols: data, block_start, HASH_READ_SIZE, handle_sequence,
FastBlockResult, tail_literals_len).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a014b89a-2926-44d6-9245-df8effda3c58

📥 Commits

Reviewing files that changed from the base of the PR and between 969cf64 and b69b48e.

📒 Files selected for processing (5)

zstd/src/encoding/simple/fast_kernel/count.rs
zstd/src/encoding/simple/fast_kernel/hash_table.rs
zstd/src/encoding/simple/fast_kernel/kernel.rs
zstd/src/encoding/simple/fast_kernel/mod.rs
zstd/src/encoding/simple/mod.rs

Copilot

Pull request overview

Adds the “donor-shape” Fast strategy kernel building blocks (flat u32 hash table, ZSTD_count-style forward match counter, and a phase-1 single-cursor fast block compressor) under zstd/src/encoding/simple/fast_kernel/. This is a preparatory step toward wiring a faster Fast-strategy matcher into the encoder in a follow-up PR.

Changes:

Introduce encoding/simple/fast_kernel module and export the new kernel/hash-table APIs.
Implement FastHashTable with donor-parity multiply-shift hashing for MLS 4..=8.
Implement count_forward and a phase-1 compress_block_fast<MLS> plus unit tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
zstd/src/encoding/simple/mod.rs	Exposes the new `fast_kernel` submodule under the Simple encoder.
zstd/src/encoding/simple/fast_kernel/mod.rs	Adds the new fast-kernel module umbrella and re-exports key types/functions.
zstd/src/encoding/simple/fast_kernel/hash_table.rs	Implements the flat `Vec<u32>` hash table and donor-parity hash functions.
zstd/src/encoding/simple/fast_kernel/count.rs	Implements the donor-style forward byte-match counter (`ZSTD_count` port).
zstd/src/encoding/simple/fast_kernel/kernel.rs	Implements the phase-1 fast block compression loop and unit tests.

…tion Four reviewer findings collapsed into one commit because every fix lives in the kernel modules already introduced by the previous baseline commit, and they're not strictly independent (the short-input contract update changes how the test helper observes accounting). 1. Short-input early-return path (kernel.rs:122-133, threads #1 + #5): Previously called handle_sequence(Sequence::Literals { .. }) and then returned tail_literals_len = 0, conflicting with the main-loop exit which never emits the terminal Literals and returns the tail count. Caller wrapping the kernel had no consistent rule for whether to emit the trailing Literals or not. Unify both branches: the kernel NEVER emits the terminal Literals; the caller emits it from tail_literals_len. Updated the function doc to spell out the sequence-emission contract. 2. count_forward endianness (count.rs:68, thread #2): The 8-byte chunked XOR loop used diff.trailing_zeros() / 8 unconditionally. That matches donor's ZSTD_NbCommonBytes only on little-endian (where the first byte of the input is in the low byte of the u64); on big-endian the first byte is in the high byte and the common-bytes count needs leading_zeros instead. Gate via cfg(target_endian) so BE targets get the correct match lengths. 3. hash_log validation (hash_table.rs:50, threads #3 + #4): FastHashTable::new previously only debug_asserted hash_log <= 32. In release, hash_log == 0 would make hash_ptr shift by the full word width (32 for mls=4, 64 for mls>=5) which is panic/UB; and 1usize << hash_log overflows on 32-bit targets for hash_log >= 32. Replace with real asserts: hash_log > 0 AND hash_log < usize::BITS. Donor's ZSTD_HASHLOG_MAX is 30 so sane callers stay well inside this band; the assertion catches outright misuse. 4. Test helper run_block (kernel.rs:339-405, thread #6): The previous helper accumulated tuples but then dropped them and returned an empty seqs Vec, so finds_long_repeat_in_simple_pattern only verified the accounting invariant — never that the kernel actually produced a match. Rewrote the helper to return the captured (tuples, FastBlockResult) tuple; the long-repeat test now asserts at least one Triple with match_len >= 4 and a non-zero offset. short_input_reports_tail_without_emission updated to reflect the new contract from item 1: the kernel emits zero sequences and returns the full input length as tail_literals_len.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 46-63: The constructor FastHashTable::new currently allows
hash_log values up to usize::BITS-1 which still permits invalid values for the
MLS=4 path; change the validation in FastHashTable::new to enforce the donor cap
by requiring hash_log in 1..=ZSTD_HASHLOG_MAX (e.g. <= 30) and keep the existing
check that mls is in 4..=8 so calls like hash_ptr::<4> cannot later overflow or
panic; update the panic message to mention ZSTD_HASHLOG_MAX for clarity.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 80bd89b2-9cef-4361-b50d-45e98d8de065

📥 Commits

Reviewing files that changed from the base of the PR and between b69b48e and f841416.

📒 Files selected for processing (3)

zstd/src/encoding/simple/fast_kernel/count.rs
zstd/src/encoding/simple/fast_kernel/hash_table.rs
zstd/src/encoding/simple/fast_kernel/kernel.rs

Previous `hash_log < usize::BITS` check still admitted invalid bands: on 64-bit `hash_log \in 33..=63` passed the constructor but later caused `hash_ptr::<4>` to shift by `32 - hash_log` (panics for `hash_log >= 32`). Pin to donor's `ZSTD_HASHLOG_MAX = 30` so all four MLS instantiations are safe by construction; the per-level table-driven caller (donor `ZSTD_defaultCParameters`) never goes above 14 for Fast strategy so this assertion only catches outright misuse.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Four findings from review round 3 — all in fast_kernel, all about correctness contracts (no functional changes to the algorithm yet, just hardening + docs). 1. hash_ptr safety contract (hash_table.rs:113): The doc said "≥ mls readable bytes" but mls=5/6/7 paths perform an unaligned u64 load (8 bytes), shifting off the unused top bits. Following the doc literally for mls=5/6/7 with only 5/6/7 readable bytes would read past the caller's range — UB. Updated to require ≥4 bytes when MLS=4 and ≥8 bytes when MLS≥5, matching what the implementation actually does. The kernel's ilimit = iend - HASH_READ_SIZE (8) cap already satisfies this uniformly. 2. count_forward safety contract (count.rs:21): The previous doc said match_ptr needs "ip_len + 7 readable bytes" — overspecified. The implementation only reads from match_ptr in chunks for the same byte ranges it reads from ip, and bails on the first mismatching byte. The real requirement is that match_ptr has at least as many readable bytes as ip up to iend, naturally satisfied when match_ptr <= ip within the same buffer (a backward match into the encoder's history). Reworded. 3. Module-level lint allow scope (mod.rs:14): #![allow(dead_code, unused_imports)] disabled unused_imports crate-wide for fast_kernel, masking real unused-import regressions in sibling code. Narrowed to #![allow(dead_code)] only, since the kernel-internal items will be reached in one shot by the wiring commit. Removed the unused pub(crate) use re-exports (FastHashTable, FastBlockResult, compress_block_fast) — they'll be re-added by the wiring commit when there's an actual caller; adding them now would force the wider lint relaxation we're trying to avoid. 4. match_found bounds hardening (kernel.rs:63): The previous filter only rejected match_idx < prefix_start_index, which is necessary but not sufficient. Two additional bands of stale-entry exposure exist if FastHashTable is reused across calls with a shrunk data slice (the kernel itself doesn't do this today, but future cross-block / shared-table call shapes would): - (match_idx + 4) > data_len: 4-byte read at base + match_idx past the buffer end → OOB / UB. - match_idx >= ip_pos: stale forward-pointing entry would make offset = ip_pos - match_idx underflow and produce an invalid offset code downstream. Added explicit branches for both. Strongly biased "not taken" on the well-behaved single-block path (every entry was written by this same scan, indices monotonically below ip0), so branch predictor amortises them to zero cost. Tests: 16/16 pass; the only call site (the explicit-match probe in the kernel main loop) updated to pass the new ip_pos / data_len arguments.

…entries, boundaries Phase 1a follow-up addressing previously-undertested branches. Closes the gap surfaced by an internal audit ("happy path only, no edge cases, no regression test for round-3 hardening"). Adds 16 new unit tests (32 total across the module). Coverage now reaches every behavioural branch in the kernel that does not require wired-up integration: kernel.rs (+8 tests): - repcode_match_emits_with_rep_offset_one — exercises the rep_check branch on the very first loop iteration, asserts a Triple at offset=1 with substantial match_len on uniform data. - explicit_match_backward_extension_extends_by_marker_byte — constructs an "XAAAA … XAAAA" pattern so the explicit-match branch's backward-extension while-loop walks back across the 'X' marker; asserts match_len ≥ 5 and that the literal run does not end with 'X' (proof the backward extension absorbed it). - prefix_start_index_filter_rejects_below_window — pre-populates the hash slot with index 0, runs with prefix_start_index=5, and asserts the kernel never emits a Triple referencing the out-of-window position. - match_found_rejects_stale_forward_entry — REGRESSION TEST for round 3 finding #11. Engineered scenario: uniform data + stale hash entry pointing forward of ip0 (idx=150 when ip0 starts at 1). Without the `match_pos < ip_pos` filter the first iteration would compute offset = 1 - 150 → underflow → huge wrap → emit a bogus Triple. The test asserts every emitted offset stays inside the buffer; before the round 3 hardening this assertion would fail. (The companion data-len bounds check is not directly observable in stable Rust — without it the OOB read returns garbage that almost always fails the raw cmp; Miri / sanitizers would catch the underlying UB.) - block_exactly_hash_read_size_emits_no_sequences — boundary where ilimit = data.len() - HASH_READ_SIZE = 0, the main loop runs zero iterations, the kernel returns the whole input as tail. - block_just_below_hash_read_size_emits_no_sequences — the short-input early-return branch fires; verifies the round-1 uniform tail contract (kernel never emits the terminal Literals, caller emits from tail_literals_len). - rep_offset_save_restore_when_out_of_range — out-of-range rep values (huge=9999, secondary=7) on a 64-byte block must be stashed to offset_saved{1,2} on entry and restored verbatim into result.rep by the _cleanup path. Asserts both slots round-trip. hash_table.rs (+8 tests): - hash6/hash7/hash8 donor-formula parity tests, matching the existing hash4/hash5 pattern (each instantiates the relevant MLS, computes the donor multiply-shift by hand, asserts equality). - hash_log_minimum_one_constructs_two_entry_table — boundary at the lower end (1 is the smallest accepted hash_log; bit-width asserted via the produced hash being < 2). - hash_log_maximum_thirty_is_accepted_by_constructor — boundary at the upper end (ZSTD_HASHLOG_MAX). Allocates the full ~4 GiB table once; if the host can't allocate it the test OOM-skips rather than misreports. - Four `#[should_panic]` tests covering each branch of the round 2 constructor validation: hash_log=0, hash_log=31 (> cap), mls=3 (< 4), mls=9 (> 8). The panic messages are matched as substrings so the asserts catch a wrong-message refactor. All 32 fast_kernel tests pass. Full workspace nextest run remains green at 571/571. Clippy clean.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 321-329: The test
hash_log_maximum_thirty_is_accepted_by_constructor() currently triggers a 2^30
allocation via FastHashTable::new; instead add (or expose) a non-allocating
validation function (e.g., FastHashTable::validate_params(hash_log: usize,
entry_size: usize) -> Result<(), Error> or a private
FastHashTable::check_hash_log) that performs the same parameter assertions
without allocating, update the test to call that helper with 30 and 4 to verify
the accept path, and keep FastHashTable::new unchanged to perform real
allocation only in normal code paths.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7dcfbd0f-d1d8-4707-b28b-423b436ff3ef

📥 Commits

Reviewing files that changed from the base of the PR and between 509374f and 9813007.

📒 Files selected for processing (2)

zstd/src/encoding/simple/fast_kernel/hash_table.rs
zstd/src/encoding/simple/fast_kernel/kernel.rs

…condition rename Three review findings, one commit (all in the same module, all non-functional / docs / test refactor — no algorithm change): 1. hash_log_maximum_thirty_is_accepted_by_constructor (round 4 findings #12 and #15, hash_table.rs): The previous test instantiated FastHashTable::new(30, 4) just to prove the validation accepts the cap value. That allocates a 1<<30 entries × 4 bytes = ~4 GiB Vec<u32> on every test run — above per-test budgets on most CI runners. Extracted FastHashTable::new's two asserts into a private validate_params( hash_log, mls) helper; the test now exercises the same accept path against the helper directly without touching the allocator (validate_params(ZSTD_HASHLOG_MAX, 4) + ...= 8)). FastHashTable:: new delegates to validate_params so the actual constructor still panics identically on bad inputs. 2. count_forward chunk type (round 4 finding #13, count.rs): The previous loop unconditionally loaded u64 chunks. Donor's ZSTD_count uses MEM_readST which is sizeof(size_t) — so on 32-bit targets the donor uses u32 chunks and our u64 loads would degrade to a pair of 32-bit memory ops + a 64-bit XOR. Switched to usize-typed loads with CHUNK_SIZE = size_of::<usize>() (8 on 64-bit, 4 on 32-bit). On 32-bit the u32 tail block becomes dead code (the chunk loop already strides in 4-byte chunks), so cfg(target_pointer_width = "64") gates that block to mirror donor's MEM_64bits() guard. trailing_zeros() / leading_zeros() are already endian-cfg-gated from round 3 and work correctly on either chunk width. 32/32 fast_kernel tests pass; the functional behaviour (match length per input) is unchanged on 64-bit hosts. 3. compress_block_fast "# Safety contract" rename (round 4 finding #14, kernel.rs): compress_block_fast is a safe (non-unsafe) fn, but its rustdoc was titled "# Safety contract" with a MUST requirement. Safe APIs must not put memory-safety preconditions on callers (that's what unsafe is for). The contract is actually about algorithmic correctness: a too-short input is well-defined Rust (falls into the short-input early-return), it just won't produce match sequences. Renamed the section to "# Preconditions / algorithm invariants" and clarified the distinction in the prose: the kernel is safe for every input; the constraint is what the caller usually wants from a Fast strategy compressor.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

…nsion Adds the `(new_ip - 1 - rep_off) >= prefix_start_index` guard to the rep-path single-byte backward extension, matching the explicit-match path's symmetric bound. In the current single-block kernel the guard is provably redundant — the block-entry save/restore zeroes `rep_offset1` whenever `rep_offset1 > ip0 - prefix_start` at block start, and `anchor <= new_ip` keeps `new_ip - 1 >= block_start >= prefix_start`, so the byte the extension would dereference is always inside the window. Future call shapes (cross-block reuse, shared hash table across resets) could legitimately probe at the `ip0 == prefix_start` boundary where the single extension byte would otherwise reach below the window; the explicit check makes the kernel correct under those broader invariants without changing observable behaviour today. Costs a predicted-not-taken branch on the hot rep path; the inline code comment documents the implicit invariants and the symmetry rationale. No regression test lands with this commit: per the bug-reports-get-a- regression-test-first protocol, a test must FAIL on the unguarded code to prove the defect exists. I attempted such a test in a prior session (rep_backward_extension_respects_prefix_start_index) and confirmed by running it that it cannot fail with the current save/restore + anchor invariants — i.e., the unguarded code is already correct in practice. That makes the finding a defensive- hardening / code-symmetry issue, not a logic bug, so the test-first gate does not apply and the no-longer-relevant vacuous test was removed alongside this change.

…refix guard Replaces the redundant prefix_start guard added in the previous commit with a comprehensive doc comment explaining donor parity and the invariant chain that makes the guard unnecessary. Donor parity check: noDict rep path at zstd_fast.c:278 omits any prefix_start bound — just `ip0[-1] == match0[-1]`. Adding a guard here was unjustified divergence from donor. Invariant chain (now spelled out in the inline comment): 1. Block-entry save/restore zeroes rep_offset1 whenever the incoming value exceeds ip0 - prefix_start_index at block start, so any surviving non-zero rep satisfies ip0_start - rep_offset1 >= prefix_start. 2. The explicit-match path's backward extension uses a STRICT `match_pos > prefix_start_index` bound when it promotes a fresh offset into rep_offset1, so new_rep = ip0_promote - match_pos < ip0_promote - prefix_start (strict). At any later iteration ip0' > ip0_promote and rep_offset1 is unchanged, so ip0' - rep_offset1 >= prefix_start + 1, i.e. (ip0' - 1) - rep_offset1 >= prefix_start. Combined: every loop iteration that enters the rep branch already satisfies the bound the dropped guard would have checked. Adding it costs a predicted-not-taken branch on the hot rep path with zero behavioural benefit under the current call shape and donor-parity semantics. The inline comment also names the symmetric bound in the explicit-match path (`match_pos > prefix_start_index`) so a future maintainer changing either invariant has an explicit pointer to the structurally identical guard.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 261-290: Update the long comment explaining why the explicit
prefix-window guard is omitted to also account for the first-iteration "anchor"
gate: mention that offset_saved1 (rep_offset1) alone can only guarantee
ip0_start - rep_offset1 >= prefix_start_index, which may be equal, and therefore
the first loop iteration still relies on the block-entry anchor check (new_ip >
anchor) to avoid a below-window read; either fold this anchor condition into the
invariants or soften the universal claim that every branch entry satisfies
(new_ip - 1 - rep_off) >= prefix_start_index, and point to the related
explicit-match promotion guard (match_pos > prefix_start_index) and the
anchor/offset_saved1 interaction (ip0_start, rep_offset1, new_ip, anchor) as the
structural reasons this is safe today.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9bd5f84c-ffb8-4c5e-88d0-82e6aa67b5a8

📥 Commits

Reviewing files that changed from the base of the PR and between 7ff93ea and b9c5c4b.

📒 Files selected for processing (1)

zstd/src/encoding/simple/fast_kernel/kernel.rs

CodeRabbit (round 5 review) caught a hole in the previous comment: the save/restore at block entry guarantees only NON-STRICT `ip0_start - rep_offset1 >= prefix_start`. When that holds with equality, backward extension by 1 byte would dereference `prefix_start - 1`. The explicit-match path's strict bound (`match_pos > prefix_start_index`) covers iterations after a rep promotion, but does NOT cover the very first loop iteration before any promotion. What actually keeps the first iteration safe is the runtime `new_ip > anchor` gate: at block entry `ip0 == block_start == anchor`, so the gate fails and the backward extension is skipped entirely. From the second iteration onward (after the first emitted sequence), anchor has moved past block_start AND/OR the explicit-match promotion bound from (2) gives the strict version of the invariant. Rewrote the inline comment to make this three-piece argument explicit: (1) non-strict bound from save/restore, (2) strict bound from promotion (kicks in after first promote), (3) anchor gate (covers the boundary iteration). Added an explicit note that a future call shape weakening any of the three (e.g. custom anchor initialisation) requires the explicit guard to be re-added. Pure doc change — no behavioural diff. 32/32 kernel tests pass.

CodeRabbit (outside-diff finding): compress_block_fast is a safe pub entry point but only used debug_assert_eq! to check MLS matches the hash_table's mls. In release builds debug_assert! is stripped, so a mismatched call like: compress_block_fast::<5>(..., &mut FastHashTable::new(_, 4), ...) routes to the kernel's mls=5 monomorphisation (different hash formula, different prime constant, different shift count) but reads entries indexed by mls=4 — garbage match candidates, miscompression instead of a clean panic. The const-generic MLS prevents compile- time misuse for nonexistent values but cannot catch this runtime table/kernel mismatch. Replaced the debug_assert_eq! with two real assert!s: 1. Range check (4..=8).contains(&MLS) — logically redundant with the arm in FastHashTable::hash_ptr, but surfaces the contract at the entry point with a clearer message than the hash-table-internal one. 2. assert_eq!(MLS, hash_table.mls()) — fails immediately on table/kernel mismatch, with both values in the panic message for diagnostics. All 32 fast_kernel tests pass (they instantiate matching MLS/table pairs in run_block; no test exercised the mismatch path).

polaz · 2026-05-20T20:45:50Z

@coderabbitai Fixed — replaced debug_assert_eq!(MLS, hash_table.mls()) with two real assert!s in compress_block_fast: a range check (4..=8).contains(&MLS) and a strict assert_eq! matching MLS against the table's mls. Both run in every build now, so a release-mode compress_block_fast::<5>(..., FastHashTable::new(_, 4), ...) panics with a clear diagnostic instead of silently miscompressing.

coderabbitai · 2026-05-20T20:46:09Z

Tip

For best results, initiate chat on the files or code changes.

@polaz: That's exactly the right call. Promoting those to unconditional assert!s means the mismatched-MLS footgun is caught loudly in release builds too — the (4..=8).contains(&MLS) guard covers the const-generic validity, and the strict assert_eq! against hash_table.mls() catches the caller pairing the wrong instantiation with a mis-configured table. Clear, cheap, and impossible to miss. Thanks for the fix!

(ﾉ◕ヮ◕)ﾉ*:･ﾟ✧

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Round 11 Copilot findings — both doc-only: 1. kernel.rs:149 (#30): compress_block_fast docs claimed memory-safety for every input and didn't enumerate the entry-time assert!s introduced in round 7 + round 10. Added a # Panics section listing all four assertions (block_start out of range, data.len() > u32::MAX, MLS out of range, MLS != hash_table.mls()), each with a one-line explanation of what goes wrong without the check. Also reworded the preceding 'SAFE function' paragraph to qualify it with 'for every input that doesn't trigger one of the entry-time asserts'. 2. hash_table.rs:73 (#31): FastHashTable::new's # Panics section listed only the parameter-range failures (hash_log out of band, mls out of band). The round 8 alloc guard added checked_shl + checked_mul that can panic on 32-bit targets, and the underlying allocator can fail too — both undocumented. Restructured the section into two groups: parameter-range failures (deterministic, depend only on inputs) and target-size/allocation failures (depend on host architecture and runtime memory state). Three new bullets cover usize overflow on shift, usize overflow on multiply, and global allocator failure. Pure documentation — no behaviour change. 32/32 tests pass. clippy clean.

polaz · 2026-05-20T20:56:25Z

@coderabbitai Re: outside-diff MLS contract finding at kernel.rs:188-220 — this is STALE, already fixed in commit 183c36f (round 10). The kernel now does assert!((4..=8).contains(&MLS)) + assert_eq!(MLS, hash_table.mls()) instead of debug_assert_eq!. Both run in every build and panic with diagnostic messages on mismatch.

coderabbitai · 2026-05-20T20:57:44Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find and resolve the stale review comment about MLS contract at kernel.rs:188-220
gh pr view 215 --repo structured-world/structured-zstd --json reviews,comments

# List review comments made by coderabbitai on the PR
gh api repos/structured-world/structured-zstd/pulls/215/comments --jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: .body[0:120], path: .path, line: .line}'

Length of output: 51202

@polaz: Confirmed — that outside-diff finding is stale. The round-10 review on commit 183c36f3 already came back clean with no actionable comments on kernel.rs, which means the two unconditional assert!s closed that gap. Since the original finding landed in the "outside diff range" section of the review body rather than as a standalone resolvable inline comment, there's no thread to dismiss on the GitHub side — but the substance is fully addressed.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 89-105: Update the comment to clarify that the earlier checked
multiplication guard (the `entries * size_of::<u32>()` `checked_mul`) already
rejects `hash_log = 30` on 32-bit, so the documented “borderline” / OOM path for
`hash_log = 30` no longer applies; explicitly mention `checked_mul` prevents
that overflow/OOM and therefore the first two failure modes are deterministic
and caught before the `vec![]` allocation while only the global allocator
failure during `Vec::with_capacity` (or `vec![]`) remains dependent on runtime
memory state.

In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 157-159: Update the comment in kernel.rs to replace the narrow
"block_start > data.len()" wraparound explanation with a description of the
broader invalid-range failure: when data.len() < block_start + HASH_READ_SIZE
the early-return branch runs and computing tail_literals_len = data.len() -
block_start underflows (causing panic); reference the same symbols used in this
file (block_start, data.len(), HASH_READ_SIZE, tail_literals_len) so readers
understand the real panic cause rather than only the wraparound corner case.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 027d11cf-e9ec-4bc0-bde2-f9566378cda3

📥 Commits

Reviewing files that changed from the base of the PR and between 183c36f and 0ebe7d9.

📒 Files selected for processing (2)

zstd/src/encoding/simple/fast_kernel/hash_table.rs
zstd/src/encoding/simple/fast_kernel/kernel.rs

…ock_start Round 12 CodeRabbit follow-ups on the round 11 Panics sections — both pure documentation fixes: 1. hash_table.rs #32: round 11 said hash_log=30 was 'borderline' and 'would OOM the allocator' on 32-bit. With the checked_mul guard added in round 8, hash_log=30 is now caught deterministically BEFORE vec! is called (1 << 30 entries * 4 bytes = 4 GiB overflows the usize multiply). Rewrote the bullet to spell out the deterministic-failure framing: checked_shl tripwire is unreachable today (validate_params pins hash_log <= 30, so 1<<30 fits usize), checked_mul is the actual 32-bit hash_log=30 rejection point, and only allocator failure remains runtime-state-dependent. 2. kernel.rs #33: the block_start>data.len() panic rationale only described the main-loop wrap case (block_start + HASH_READ_SIZE wrap → skipped short-input guard → OOB read). It missed the more common path: when data.len() < block_start + HASH_READ_SIZE the short-input branch IS taken and computes tail_literals_len = data.len() - block_start which underflows. Reworded to cover both paths explicitly. Pure documentation — no behaviour change. 32/32 tests pass. clippy clean.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 107-110: The comment wrongly states all three guards fire before
reaching vec![]; update the wording in the comment near the three-guard
explanation so it clarifies that only the first two checks are deterministic
pre-allocation guards, while the third bullet is the actual vec![] allocation
and therefore the only panic that depends on runtime memory/allocator state
(i.e., allocator failure occurs at the vec![] allocation). Keep mention of the
"three guards" and "vec![]" to make the distinction explicit.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8ae0e07f-4d89-4a55-aa51-4f0f73c965f9

📥 Commits

Reviewing files that changed from the base of the PR and between 0ebe7d9 and 12e6751.

📒 Files selected for processing (2)

zstd/src/encoding/simple/fast_kernel/hash_table.rs
zstd/src/encoding/simple/fast_kernel/kernel.rs

…MLS contract Two new Copilot findings from round 13, both legitimate: 1. count.rs:93 (#34): count_forward's two `offset_from(p_start) as usize` calls return `isize`, which is UB when the byte distance exceeds isize::MAX. On 32-bit hosts isize::MAX = 2 GiB - 1, but the kernel's u32-truncation guard allows data.len() up to u32::MAX (4 GiB). A long match spanning >2 GiB would hit the UB path. Switched both call sites to plain `as usize` pointer arithmetic (`(ip as usize) - (p_start as usize)`) — well-defined integer subtraction regardless of distance, no `isize` round trip. Comment explains the rationale at the in-loop site and refers back from the function-tail site. 2. hash_table.rs:167 (#35): hash_ptr's # Safety section only covered the readable-bytes promise on `ptr` — it didn't mention that the const-generic MLS must equal self.mls() and be in 4..=8. Today only debug_assert_eq! checks the equality, so a release-build mismatched call routes to the wrong hash formula (different multiply prime, different shift) and probes a table indexed by a different formula. Extended the # Safety section with the MLS contract, called out that compress_block_fast enforces both invariants with real assert!s before any hash_ptr call (so kernel-mediated callers are safe by construction), and noted that direct callers (tests, future helpers) are responsible for upholding the contract themselves. All 32 fast_kernel tests pass; clippy clean. count.rs change is behaviour-preserving (subtraction result identical to offset_from when distance ≤ isize::MAX, and now also defined for larger distances). hash_table.rs change is doc-only.

Round 13 CodeRabbit follow-up: my previous wording 'All three guards fire BEFORE control reaches vec![]' was self-contradictory because the third bullet IS the vec![] allocation itself. Reworded to distinguish the two pre-allocation deterministic guards from the in-allocation runtime-state-dependent one.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

…#198 phase 1b) Phase 1b of #198 — wires the donor-shape kernel modules from phase 1a (#215, merged in fc63464) into the production hot path. The legacy SuffixStore-based MatchGenerator in simple/mod.rs is fully removed; MatcherStorage::Simple now holds a FastKernelMatcher that drives compress_block_fast<MLS> once per block. Selected for every Fast-strategy level — CompressionLevel::Uncompressed, CompressionLevel::Fastest, CompressionLevel::Level(1), and the negative CompressionLevel::Level(-7..=-1) variants. All Fast levels currently resolve to the same matcher with donor level-1 hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found) land in phase 3. ## What changes - simple/mod.rs collapses from 496 → 14 lines (only module declarations + docstring remain). SuffixStore, WindowEntry, MatchGenerator, repcode_candidate, add_data, next_sequence, add_suffixes_till, insert_suffix_if_absent, add_suffixes_interleaved_fast, offset_match_len, reserve — all gone with the wiring. - simple/fast_matcher.rs is the new active matcher: full inherent surface (accept_data, start_matching, skip_matching_with_hint, trim_to_window, last_committed_space, reset, prime_offset_history, take_recycled_space). - MatchGeneratorDriver Simple-arm wiring: commit_space → m.accept_data(space) with eager pre-commit eviction; start_matching::<Fast> → m.start_matching(handler); skip_matching_with_hint → m.skip_matching_with_hint(hint); reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window(); prime_with_dictionary → m.prime_offset_history(offset_hist). - Per-block input Vec recycled via take_recycled_space() → vec_pool (zero zero-fill cost — buffer pushed with len=0, get_next_space resizes on pop). ## Invariants - prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The first byte of history is a reserved dummy (sentinel-0 guard); real input data starts at history[1]. Donor C zstd achieves the same effect via a virtual base pointer; the flat Vec<u8> model here pays one byte of memory overhead for the same correctness property (no missed matches at segment boundaries). - history.len() bounded by 2 × max_window_size post-append, even for oversize committed blocks (retain_real = cap.saturating_sub( space.len()).min(max_window_size)). - Eviction preserves the dummy AND rebases prefix_start_index back to RESERVED on every drain — cumulative growth would push the filter past every valid history index and reject all match candidates wholesale. - Hash table rehashed after drain so retained tail bytes stay matchable. Amortised O(1) per byte of input. - rep[0..2] ↔ offset_hist[0..2] in lockstep on the common (lit_len > 0) path. Known divergence on back-to-back repcode matches (lit_len == 0 emits): kernel's rep unchanged, wire encoder per RFC 8878 §3.1.2.5 remaps codes and rotates offset_hist — marginal compression hit, output still correct. Phase 3 collapses these at the kernel level. - prime_offset_history seeds BOTH rep[0..2] and offset_hist atomically from a dictionary load. ## Defensive validation - MatchGeneratorDriver::new asserts slice_size > 0, max_slices_in_window > 0, checked_mul for the product, and checked_next_power_of_two for window_log_init derivation — catches all four overflow / degenerate paths with a clear domain-specific panic instead of a deep matcher-internal failure. - FastHashTable construction-time mls / hash_log validation unchanged from phase 1a. ## Tests 573/573 pass on the full workspace nextest suite: - 21 unit tests on FastKernelMatcher (lifecycle, accept + start, skip flavors, dict-prime hash population, eviction, boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync, prefix-eviction during dict-priming, drain prefix_start_index runaway, trim_to_window/last_block_start drift, oversize-block eviction bound). - 32 unit tests on the underlying kernel (donor-formula parity, prefix-filter, repcode backward extension three-piece proof, short-input early-return uniformity). - All frame_compressor integration tests (raw-block detection, hinted source-size matrix, level-1 round-trips through both the in-tree decoder and FFI decode). - All cross_validation Rust-encoded → FFI-decoded round-trips (every level 1..=22, dict + no-dict, encoded by Rust then read by the C reference decoder verbatim). 18 legacy tests in match_generator.rs that exercised SuffixStore-specific behavior or required block.len() < HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory comments — their substance either has equivalent coverage in the new tests or relied on algorithm-specific quirks the donor-shape kernel doesn't reproduce by design. ## Benchmark (i9-9900K) cargo bench deltas vs main (fc63464) on compress/level_1_fast/*/matrix/pure_rust: | Scenario | Δ time | Throughput | Note | |----------|-------:|-----------:|------| | low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster | | decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster | | high-entropy-1m | -67.3% | 633 MiB/s | 3× faster | | small-10k-random | -36.0% | 768 MiB/s | 1.6× faster | | small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster | | small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead | | large-log-stream | +122.3% | 235 MiB/s | regression — see below | Large-log-stream regression is expected at this phase: legacy SuffixStore used effectively window_log-sized hash slots (512K for level-1 window_log=19), while phase 1b uses donor-parity hash_log=14 (16K slots). 25 MiB dense-compressible log content hits ~1500 collisions/slot. Donor C zstd shows the same trade-off (~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining + cmov) closes the gap per the documented roadmap (#198 items 2/3/5). ## What's NOT in this PR - Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead, cmov match-found variant, per-level mls dispatch, kSearchStrength acceleration gradient for negative Fast levels. - LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes donor level-1 defaults (14/7) today. - hash_fill_step stride for dict-priming — still hard-coded to 1 (LevelParams field is wired in but the Fast matcher always strides at 1). Closes #198 phase 1b. Related: #178 (umbrella regression issue), #215 (phase 1a — kernel modules, merged in fc63464).

…#198 phase 1b) (#217) * perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b) Phase 1b of #198 — wires the donor-shape kernel modules from phase 1a (#215, merged in fc63464) into the production hot path. The legacy SuffixStore-based MatchGenerator in simple/mod.rs is fully removed; MatcherStorage::Simple now holds a FastKernelMatcher that drives compress_block_fast<MLS> once per block. Selected for every Fast-strategy level — CompressionLevel::Uncompressed, CompressionLevel::Fastest, CompressionLevel::Level(1), and the negative CompressionLevel::Level(-7..=-1) variants. All Fast levels currently resolve to the same matcher with donor level-1 hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found) land in phase 3. ## What changes - simple/mod.rs collapses from 496 → 14 lines (only module declarations + docstring remain). SuffixStore, WindowEntry, MatchGenerator, repcode_candidate, add_data, next_sequence, add_suffixes_till, insert_suffix_if_absent, add_suffixes_interleaved_fast, offset_match_len, reserve — all gone with the wiring. - simple/fast_matcher.rs is the new active matcher: full inherent surface (accept_data, start_matching, skip_matching_with_hint, trim_to_window, last_committed_space, reset, prime_offset_history, take_recycled_space). - MatchGeneratorDriver Simple-arm wiring: commit_space → m.accept_data(space) with eager pre-commit eviction; start_matching::<Fast> → m.start_matching(handler); skip_matching_with_hint → m.skip_matching_with_hint(hint); reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window(); prime_with_dictionary → m.prime_offset_history(offset_hist). - Per-block input Vec recycled via take_recycled_space() → vec_pool (zero zero-fill cost — buffer pushed with len=0, get_next_space resizes on pop). ## Invariants - prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The first byte of history is a reserved dummy (sentinel-0 guard); real input data starts at history[1]. Donor C zstd achieves the same effect via a virtual base pointer; the flat Vec<u8> model here pays one byte of memory overhead for the same correctness property (no missed matches at segment boundaries). - history.len() bounded by 2 × max_window_size post-append, even for oversize committed blocks (retain_real = cap.saturating_sub( space.len()).min(max_window_size)). - Eviction preserves the dummy AND rebases prefix_start_index back to RESERVED on every drain — cumulative growth would push the filter past every valid history index and reject all match candidates wholesale. - Hash table rehashed after drain so retained tail bytes stay matchable. Amortised O(1) per byte of input. - rep[0..2] ↔ offset_hist[0..2] in lockstep on the common (lit_len > 0) path. Known divergence on back-to-back repcode matches (lit_len == 0 emits): kernel's rep unchanged, wire encoder per RFC 8878 §3.1.2.5 remaps codes and rotates offset_hist — marginal compression hit, output still correct. Phase 3 collapses these at the kernel level. - prime_offset_history seeds BOTH rep[0..2] and offset_hist atomically from a dictionary load. ## Defensive validation - MatchGeneratorDriver::new asserts slice_size > 0, max_slices_in_window > 0, checked_mul for the product, and checked_next_power_of_two for window_log_init derivation — catches all four overflow / degenerate paths with a clear domain-specific panic instead of a deep matcher-internal failure. - FastHashTable construction-time mls / hash_log validation unchanged from phase 1a. ## Tests 573/573 pass on the full workspace nextest suite: - 21 unit tests on FastKernelMatcher (lifecycle, accept + start, skip flavors, dict-prime hash population, eviction, boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync, prefix-eviction during dict-priming, drain prefix_start_index runaway, trim_to_window/last_block_start drift, oversize-block eviction bound). - 32 unit tests on the underlying kernel (donor-formula parity, prefix-filter, repcode backward extension three-piece proof, short-input early-return uniformity). - All frame_compressor integration tests (raw-block detection, hinted source-size matrix, level-1 round-trips through both the in-tree decoder and FFI decode). - All cross_validation Rust-encoded → FFI-decoded round-trips (every level 1..=22, dict + no-dict, encoded by Rust then read by the C reference decoder verbatim). 18 legacy tests in match_generator.rs that exercised SuffixStore-specific behavior or required block.len() < HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory comments — their substance either has equivalent coverage in the new tests or relied on algorithm-specific quirks the donor-shape kernel doesn't reproduce by design. ## Benchmark (i9-9900K) cargo bench deltas vs main (fc63464) on compress/level_1_fast/*/matrix/pure_rust: | Scenario | Δ time | Throughput | Note | |----------|-------:|-----------:|------| | low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster | | decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster | | high-entropy-1m | -67.3% | 633 MiB/s | 3× faster | | small-10k-random | -36.0% | 768 MiB/s | 1.6× faster | | small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster | | small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead | | large-log-stream | +122.3% | 235 MiB/s | regression — see below | Large-log-stream regression is expected at this phase: legacy SuffixStore used effectively window_log-sized hash slots (512K for level-1 window_log=19), while phase 1b uses donor-parity hash_log=14 (16K slots). 25 MiB dense-compressible log content hits ~1500 collisions/slot. Donor C zstd shows the same trade-off (~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining + cmov) closes the gap per the documented roadmap (#198 items 2/3/5). ## What's NOT in this PR - Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead, cmov match-found variant, per-level mls dispatch, kSearchStrength acceleration gradient for negative Fast levels. - LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes donor level-1 defaults (14/7) today. - hash_fill_step stride for dict-priming — still hard-coded to 1 (LevelParams field is wired in but the Fast matcher always strides at 1). Closes #198 phase 1b. Related: #178 (umbrella regression issue), #215 (phase 1a — kernel modules, merged in fc63464). * fix(match_generator,fast_matcher): correctness + doc accuracy fixes (#217 review round 1) Four findings from CodeRabbit + Copilot's first review pass on the squashed PR #217. All four touch comment / metadata accuracy or a single-line correctness issue; no behavioural change beyond the window_size sync. **CR outside-diff (match_generator.rs:598-613)** — correctness: `reported_window_size` was using the unrounded `max_window_size` while the matcher itself was constructed from `next_pow2` (rounded up). For non-power-of-two constructor inputs (e.g. `slice_size * max_slices_in_window = 100_000`), `window_size()` would report 65_536 (un-rounded floor) while the active backend actually carried 131_072 (rounded-up next_pow2). The drift held until the first `reset()` overwrote both sides from LevelParams. Fix: report `next_pow2` so the two stay in lockstep at construction time. **CR #1 (match_generator.rs:944)** — declined, deferred to phase 3: Fast levels (Uncompressed, Fastest, Level(-7..=1)) all hard-code donor level-1 cParams. The acceleration gradient between negative- level fast modes and Level(1) lands when phase 3 ports donor's 4-cursor lookahead + cmov match-found + per-level kSearchStrength dispatch (issue #198 items 2/3/5). Updated the inline code comment to scrub the closed-PR reference and frame the deferral against phase 3 directly. **Copilot #2 (match_generator.rs:572)** — doc: the validation-guard comment described `next_power_of_two` returning 0 on overflow, which was old-Rust behaviour. Modern Rust panics; we now use `checked_next_power_of_two` (commit landed in the squash). Rewrote the comment to enumerate the three actual failure modes (zero args, mul overflow, next-pow2 overflow) and the three guards that catch them. **Copilot #3 (fast_matcher.rs:303)** — doc: `last_committed_space`'s pre-`accept_data` state description claimed `last_block_start = 0 / history.len() = 0`, but post-RESERVED_PREFIX_BYTES-seed construction (#216 / phase 1b) leaves both at `RESERVED_PREFIX_BYTES`. The returned slice is still empty (the `history[last_block_start..]` range is empty), just for a different reason. Updated the doc to reflect the seeded-dummy invariant. 573/573 tests pass; clippy clean. * docs(fast_matcher): compact last_committed_space docstring (apply new compactness rule) Doc-only change: collapse the verbose multi-paragraph last_committed_space docstring (added during PR #216 review rounds) into a 6-line three-bullet form. Same semantic content, zero narrative. Apply the new docstring compactness rule (one-two phrases default, multi-paragraph only for non-obvious invariants). 573/573 tests pass; clippy clean. * docs(fast_matcher): align dict-prime boundary test comments with RESERVED_PREFIX_BYTES seed (#217 Copilot #4, #5) Doc-only update aligning the dict-prime boundary test comment + inline note with the RESERVED_PREFIX_BYTES seed (post-phase-1b range is [RESERVED..=RESERVED], not [0..=0]). Compactness rule applied — concise two-block form. 573/573 tests pass; clippy clean. * docs(fast_matcher): correct prefix_start_index wording — rebases, not bumps (#217 Copilot #6, #7) Two doc comments described prefix_start_index as 'bumped forward' or 'advances' as history is evicted, implying a monotonic absolute index. Actual code (drain_real_prefix) rebases it back to RESERVED_PREFIX_BYTES on every drain — the retained tail is re-indexed in the new coordinate space. Updated both sites (struct field doc + trim_to_window header) to match. Compactness rule applied — trim_to_window header collapsed from 10 lines to 5. 573/573 tests pass; clippy clean. * docs(fast_matcher): drop stale hash_log scaling claim + add lit_len=0 TODO marker (#217 Copilot #8, #9) #8 — header for FAST_LEVEL_1_HASH_LOG said 'reset path rebinds hash_log proportionally on source-size hint'. Untrue today: driver passes only window_log per-level, hash_log + mls hard-coded. Pin per-level scaling to phase 3. #9 — same lit_len=0 / back-to-back-rep1 concern as the prior PR's #21. Inline + module docs already explain, but verbose prose isn't anchored. Collapsed inline comment to a single '// TODO(#198 phase 3):' line so the deferral marker is unambiguous. 573/573 tests pass; clippy clean. * fix(fast_matcher): assert hard preconditions + skip lit_len=0 rotation + hoist mls dispatch (#217 round 5) Four findings from the latest CR + Copilot review pass. **CR outside-diff (327-331) — assert!, not debug_assert!:** the duplicate-pending guard in accept_data was debug_assert!, so in release builds a double-commit would silently overwrite pending and drop a committed block. Hard-fail instead. **CR outside-diff (358-363) — block_size <= 2 × max_window_size:** the eviction math computes retain_real via saturating_sub, which silently collapses to 0 if space.len() > cap. The full block is still appended afterwards, violating the documented 'history bounded by 2 × max_window_size' invariant. Added an assert! precondition so callers see a clear panic at the boundary instead of an invisible invariant break. **CR #10 — lit_len == 0 offset_hist rotation (third raise):** the prior round added a TODO marker explaining the divergence, but CR correctly pushed back that documentation isn't a fix. Skip encode_offset_with_history when literals.is_empty() so offset_hist stays in lockstep with the kernel's unchanged rep — no divergence on the back-to-back rep1 path. Wire encoder downstream still sees the Triple with raw offset; its own encoding stays correct (lit_len-0 absolute encoding per RFC 8878 §3.1.2.5). Module docstring 'Known divergence' section collapsed accordingly. **Copilot #11 — hoist mls dispatch outside prime_hash loop:** moved the per-MLS match arm OUTSIDE the per-position loop. New prime_hash_table_impl<const MLS: u32> is monomorphised per matcher instance; the hot path is branch-free on mls. 573/573 tests pass; clippy clean. * fix(fast_matcher): call encode_offset_with_history unconditionally (#217 Copilot #12, revert round 5 #10) Round 5 skipped encode_offset_with_history when literals.is_empty() to keep matcher.offset_hist 'in lockstep' with kernel rep. Copilot correctly pushed back: Dfast / Row / HashChain matchers all call it unconditionally (passing lit_len = 0 when applicable), and the 'lockstep' framing was wrong — matcher.offset_hist tracks the WIRE ENCODER's history while matcher.rep tracks the KERNEL's state. They're not supposed to be the same. Revert the skip. Module docstring rewritten to call out that the two fields reflect DIFFERENT state and may diverge on lit_len = 0 emits per RFC 8878 §3.1.2.5 — both halves stay self-consistent within their own domain. Round 5 #10 reply to CR was based on the wrong mental model; that's reflected in this fix by reverting to the donor / other-matchers behaviour. 573/573 tests pass; clippy clean.

Copilot AI review requested due to automatic review settings May 20, 2026 16:28

Copilot started reviewing on behalf of polaz May 20, 2026 16:29 View session

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs

Copilot AI reviewed May 20, 2026

View reviewed changes

polaz mentioned this pull request May 20, 2026

perf(fast): port donor ZSTD_compressBlock_fast — 22× regression on Fast strategy #198

Open

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated

Copilot AI review requested due to automatic review settings May 20, 2026 16:59

Copilot started reviewing on behalf of polaz May 20, 2026 16:59 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated

Comment thread zstd/src/encoding/simple/fast_kernel/count.rs Outdated

Comment thread zstd/src/encoding/simple/fast_kernel/mod.rs Outdated

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated

polaz added 2 commits May 20, 2026 20:25

Copilot AI review requested due to automatic review settings May 20, 2026 17:34

Copilot started reviewing on behalf of polaz May 20, 2026 17:34 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated

Comment thread zstd/src/encoding/simple/fast_kernel/count.rs Outdated

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs

polaz requested a review from Copilot May 20, 2026 17:51

Copilot started reviewing on behalf of polaz May 20, 2026 17:51 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs

polaz added 2 commits May 20, 2026 21:44

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated

Copilot AI review requested due to automatic review settings May 20, 2026 18:59

Copilot started reviewing on behalf of polaz May 20, 2026 18:59 View session

Copilot AI review requested due to automatic review settings May 20, 2026 20:43

Copilot started reviewing on behalf of polaz May 20, 2026 20:44 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated

polaz requested a review from Copilot May 20, 2026 21:01

Copilot started reviewing on behalf of polaz May 20, 2026 21:01 View session

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/count.rs

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated

polaz added 2 commits May 21, 2026 00:09

polaz requested a review from Copilot May 20, 2026 21:18

Copilot started reviewing on behalf of polaz May 20, 2026 21:19 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

polaz merged commit fc63464 into main May 20, 2026
23 checks passed

polaz deleted the perf/#198-fast-matcher-port-phase1 branch May 20, 2026 21:25

sw-release-bot Bot mentioned this pull request May 20, 2026

chore: release v0.0.23 #203

Open

polaz mentioned this pull request May 21, 2026

perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b) #216

Closed

polaz mentioned this pull request May 21, 2026

perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b) #217

Merged

Conversation

polaz commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

What's NOT in this PR (lands in follow-up)

Tests

Module gate

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

polaz commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

polaz commented May 20, 2026

polaz commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading