Skip to content

perf(encoder): donor-shape Fast kernel modules (#198 phase 1a)#215

Merged
polaz merged 17 commits into
mainfrom
perf/#198-fast-matcher-port-phase1
May 20, 2026
Merged

perf(encoder): donor-shape Fast kernel modules (#198 phase 1a)#215
polaz merged 17 commits into
mainfrom
perf/#198-fast-matcher-port-phase1

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 20, 2026

Summary

Phase 1a of the Fast strategy donor port (issue #198 — 22× regression
vs C zstd on Fast levels). This PR lands the donor-shape kernel
modules with full unit-test coverage; the wiring into
MatchGeneratorDriver / MatcherStorage and the i9-9900K bench land
in the follow-up PR on the next branch.

What's in this PR

Three new modules under zstd/src/encoding/simple/fast_kernel/:

  • hash_table.rsFastHashTable
    Flat Vec<u32> indexed by donor-parity ZSTD_hash{4,5,6,7,8}Ptr
    (multiply-shift on the first mls bytes of the suffix at ptr,
    output reduced to hash_log bits). Constants bit-identical to
    lib/compress/zstd_compress_internal.h. Monomorphised over MLS
    via const-generic so each mls instantiation compiles to a
    dedicated body with the prime / shift baked in.

  • count.rscount_forward
    Port of ZSTD_count from lib/compress/zstd_compress_internal.h.
    8-byte chunked XOR loop with trailing_zeros() / 8 to locate the
    diverging byte, then u32 / u16 / u8 tail fall-throughs
    matching donor's exact progression. Replaces the byte-by-byte
    common_prefix_len used by the current SuffixStore-based Fast
    matcher.

  • kernel.rscompress_block_fast<MLS>
    Port of ZSTD_compressBlock_fast_noDict_generic from
    lib/compress/zstd_fast.c. Phase 1 keeps the single-ip0 cursor
    with raw 4-byte match probe (MEM_read32(ip) == MEM_read32(base + matchIdx)), step-based skip acceleration tied to
    kSearchStrength = 6, repcode-at-ip check, backward extension
    bounded by anchor / prefix_start, and count_forward for
    forward extension. The 4-cursor pipelining (ip0 / ip1 / ip2 / ip3) and cmov match-found variant from donor's full generic
    body are deferred to phase 3 per the issue's staged plan — phase 1
    exists to validate the data structures and close the bulk of the
    22× gap before adding pipelining complexity.

The kernel's compress_block_fast is monomorphised over const MLS: u32 (4..=8), invokes handle_sequence per emitted sequence
(replacing donor's ZSTD_storeSeq), and returns
FastBlockResult { rep, tail_literals_len } so the caller can
thread repcode history across blocks and flush trailing literals.

What's NOT in this PR (lands in follow-up)

  • Wiring into MatchGeneratorDriver::compress_block::<Fast> — phase
    1b. Requires:
    • Adding FastHashTable field to MatchGenerator (alongside the
      existing SuffixStore during transition).
    • Flat-history-buffer materialisation (concat Vec<WindowEntry>
      into a single Vec<u8> per start_matching call, or replace
      the multi-window layout with a single ring buffer).
    • Hash-table alloc / clear in reset keyed off the resolved
      LevelParams (hash_log, mls) — donor uses hash_log = 14,
      mls = 7 for default level 1.
  • Encoder/decoder roundtrip validation (current cargo nextest run --workspace covers existing SuffixStore path; the kernel hasn't
    been wired yet so it can't break the e2e suite from this PR).
  • i9-9900K bench on compare_ffi level_1_fast — measures the actual
    perf delta vs main and against C donor. Phase 1b commit will land
    the numbers in its PR body.

Phase 3 (separate later branch per the issue):

  • 4-cursor ip0/ip1/ip2/ip3 pipelining with software-prefetch hash
    reads — gives the bulk of donor's noDict_generic throughput on
    modern superscalar cores.
  • cmov match-found variant gated on windowLog < 19.
  • Cross-block matches via prefix history (phase 1a's kernel is
    single-block; cross-block matching at the window boundary is part
    of phase 1b's flat-buffer work).

Tests

16 new unit tests covering:

  • hash_table::tests: donor formula parity on known inputs for both
    mls=4 and mls=5, get/put round-trip, clear() resets every
    entry to sentinel.
  • count::tests: empty / full / chunk-boundary / tail-fall-through
    divergence at every transition point (u32 → u16 → u8), thousand-
    byte long match.
  • kernel::tests: short-input → literals-only callback path, long
    repeat pattern → at least one Triple emitted with match_len ≥ MIN_MATCH (4), plus an accounting invariant
    (sum(literals + matches) + tail_literals_len == input.len())
    that catches off-by-ones in any future refactor.

All 16 pass; full workspace nextest run remains green at 555/555.

Module gate

fast_kernel/mod.rs opens with a narrowly-scoped
#![allow(dead_code)] (without unused_imports) for this commit
only — every item is consumed by the phase 1b wiring commit and
the allow is removed there. The unused_imports lint stays ACTIVE
across the module so any accidentally-unused use in sibling code
surfaces immediately. Re-exports of the kernel's public surface
(FastHashTable, FastBlockResult, compress_block_fast) are
deferred to the wiring commit since adding them now would force
the wider lint relaxation. Without the dead_code allow, baseline
cargo clippy -D warnings would flag the unused kernel-internal
items; splitting the kernel and the wiring into separate PRs is
the right tradeoff for review clarity, but the cost is the
temporary dead_code mute on this module.

Related: #198 (Fast strategy 22× regression — multi-phase port).

Summary by CodeRabbit

  • New Features

    • Added a donor-style "Fast" block compressor to the simple encoder for improved match finding, rep/offset handling, and faster compression.
    • Added an optimized forward match counter and a compact power-of-two hash table to improve match-extension accuracy and throughput.
  • Tests

    • Added unit tests covering matching behavior, hash-table operations, compressor outcomes, and edge cases (empty inputs, early mismatches, chunk/tail boundaries, repcode behavior, long full-match).

Review Change Stack

…ount)

First commit of the #198 Fast strategy port (22× regression vs C zstd
on Fast levels). The kernel itself is implemented and unit-tested in
isolation; wiring into MatchGeneratorDriver / MatcherStorage lands in
the next commit on this branch.

What's in this commit:

- zstd/src/encoding/simple/fast_kernel/hash_table.rs:
  FastHashTable — flat Vec<u32> indexed by donor-parity
  ZSTD_hash{4,5,6,7,8}Ptr (multiply-shift on the first `mls` bytes of
  the suffix at `ptr`, output reduced to `hash_log` bits). Constants
  bit-identical to lib/compress/zstd_compress_internal.h.

- zstd/src/encoding/simple/fast_kernel/count.rs:
  count_forward — port of ZSTD_count from
  lib/compress/zstd_compress_internal.h. 8-byte chunked XOR loop with
  trailing_zeros / 8 to locate the diverging byte, then u32 / u16 /
  u8 tail fall-throughs matching donor's exact progression.

- zstd/src/encoding/simple/fast_kernel/kernel.rs:
  compress_block_fast<MLS> — port of ZSTD_compressBlock_fast_noDict_generic
  from lib/compress/zstd_fast.c. Phase 1 keeps the single-ip0 cursor
  with raw 4-byte match probe, step-based skip acceleration, repcode-
  at-ip check, backward extension, and ZSTD_count for forward
  extension. The 4-cursor (ip0/ip1/ip2/ip3) pipelining and cmov
  match-found variant from donor's full generic body are deferred to
  phase 3; this commit exists to validate the data structures and
  close the bulk of the 22× gap before adding pipelining complexity.

Tests: 16 new unit tests cover the hash function donor parity, the
count_forward chunk/tail progression, and the kernel's accounting
invariant (sum(literals + matches) + tail_literals_len == input.len()).
All pass.

The module is gated with `#![allow(dead_code, unused_imports)]` for
this commit only — every item is used in the wiring commit. The
allow is removed there.

Related: #198 (Fast strategy 22× regression).
Copilot AI review requested due to automatic review settings May 20, 2026 16:28
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4692f8aa-a289-4b41-8468-d35f3fafc0b5

📥 Commits

Reviewing files that changed from the base of the PR and between 12e6751 and aa3faf5.

📒 Files selected for processing (2)
  • zstd/src/encoding/simple/fast_kernel/count.rs
  • zstd/src/encoding/simple/fast_kernel/hash_table.rs

📝 Walkthrough

Walkthrough

Adds a donor-compatible Fast compression kernel: count_forward for forward match-length counting, FastHashTable for multiply-shift hashing and position storage, and compress_block_fast which probes the table, extends matches, emits sequences, and reports trailing literals. Includes unit tests for each component.

Changes

Fast Compression Kernel Implementation

Layer / File(s) Summary
Module structure and integration
zstd/src/encoding/simple/mod.rs, zstd/src/encoding/simple/fast_kernel/mod.rs
Adds fast_kernel submodule, declares count, hash_table, and kernel submodules, and applies crate-local #![allow(dead_code)].
Match-length counting primitive
zstd/src/encoding/simple/fast_kernel/count.rs
count_forward scans forward using usize-sized XOR chunks and endian-specific zero-counting to locate first mismatch, then checks u32 (on 64-bit), u16, and u8 tails. Returns matched byte length and includes comprehensive unit tests.
Hash table structure and hashing
zstd/src/encoding/simple/fast_kernel/hash_table.rs
FastHashTable is a flat Vec<u32> sized by 1 << hash_log with 0 sentinel, validates parameters, provides clear, const-generic hash_ptr::<MLS> for MLS=4..=8 using donor primes and unaligned little-endian loads, and unsafe get/put. Tests verify hashing, get/put/clear, and constructor bounds.
Fast block compression kernel
zstd/src/encoding/simple/fast_kernel/kernel.rs
compress_block_fast implements donor-style fast block compression: 4-byte probing, match_found hardening, repcode-at-ip and explicit-match paths, backward/forward extension via count_forward, repcode save/restore, sequence emission via handler, and unit tests covering accounting and hardening.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant compress_block_fast
  participant FastHashTable
  participant count_forward
  participant SequenceHandler
  Caller->>compress_block_fast: run_block(data, block_start, prefix_start_index, hash_table, rep, handle_sequence)
  compress_block_fast->>FastHashTable: get(hash) / put(hash, pos)
  compress_block_fast->>count_forward: count_forward(ip, match_ptr, iend)
  compress_block_fast->>SequenceHandler: handle_sequence(Sequence::Triple / Sequence::Literals)
  compress_block_fast-->>Caller: FastBlockResult(rep, tail_literals_len)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Poem

A rabbit hops through hash tables bright, 🐇
Counting forward, byte by byte at night,
XOR and CTZ find where the bytes part,
Sequences stitched while repcodes keep heart,
Tail literals nap as the kernel takes flight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding donor-shaped Fast kernel modules to the encoder, with clear reference to the phase (1a) and ticket number (#198).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#198-fast-matcher-port-phase1

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 93.84359% with 37 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/simple/fast_kernel/kernel.rs 91.57% 30 Missing ⚠️
zstd/src/encoding/simple/fast_kernel/hash_table.rs 94.92% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 122-133: The early-return branch in the fast kernel currently
calls handle_sequence(Sequence::Literals { literals: &data[block_start..] }) but
then returns FastBlockResult { rep, tail_literals_len: 0 }, which is
inconsistent with the normal path; change this branch to NOT call
handle_sequence and instead return FastBlockResult { rep, tail_literals_len:
data.len() - block_start } so the caller is responsible for emitting the tail
literals (preserve symbols: data, block_start, HASH_READ_SIZE, handle_sequence,
FastBlockResult, tail_literals_len).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a014b89a-2926-44d6-9245-df8effda3c58

📥 Commits

Reviewing files that changed from the base of the PR and between 969cf64 and b69b48e.

📒 Files selected for processing (5)
  • zstd/src/encoding/simple/fast_kernel/count.rs
  • zstd/src/encoding/simple/fast_kernel/hash_table.rs
  • zstd/src/encoding/simple/fast_kernel/kernel.rs
  • zstd/src/encoding/simple/fast_kernel/mod.rs
  • zstd/src/encoding/simple/mod.rs

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the “donor-shape” Fast strategy kernel building blocks (flat u32 hash table, ZSTD_count-style forward match counter, and a phase-1 single-cursor fast block compressor) under zstd/src/encoding/simple/fast_kernel/. This is a preparatory step toward wiring a faster Fast-strategy matcher into the encoder in a follow-up PR.

Changes:

  • Introduce encoding/simple/fast_kernel module and export the new kernel/hash-table APIs.
  • Implement FastHashTable with donor-parity multiply-shift hashing for MLS 4..=8.
  • Implement count_forward and a phase-1 compress_block_fast<MLS> plus unit tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
zstd/src/encoding/simple/mod.rs Exposes the new fast_kernel submodule under the Simple encoder.
zstd/src/encoding/simple/fast_kernel/mod.rs Adds the new fast-kernel module umbrella and re-exports key types/functions.
zstd/src/encoding/simple/fast_kernel/hash_table.rs Implements the flat Vec<u32> hash table and donor-parity hash functions.
zstd/src/encoding/simple/fast_kernel/count.rs Implements the donor-style forward byte-match counter (ZSTD_count port).
zstd/src/encoding/simple/fast_kernel/kernel.rs Implements the phase-1 fast block compression loop and unit tests.

Comment thread zstd/src/encoding/simple/fast_kernel/count.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs
Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs
Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated
…tion

Four reviewer findings collapsed into one commit because every fix
lives in the kernel modules already introduced by the previous baseline
commit, and they're not strictly independent (the short-input contract
update changes how the test helper observes accounting).

1. Short-input early-return path (kernel.rs:122-133, threads #1 + #5):
   Previously called handle_sequence(Sequence::Literals { .. }) and
   then returned tail_literals_len = 0, conflicting with the main-loop
   exit which never emits the terminal Literals and returns the tail
   count. Caller wrapping the kernel had no consistent rule for
   whether to emit the trailing Literals or not. Unify both branches:
   the kernel NEVER emits the terminal Literals; the caller emits it
   from tail_literals_len. Updated the function doc to spell out the
   sequence-emission contract.

2. count_forward endianness (count.rs:68, thread #2):
   The 8-byte chunked XOR loop used diff.trailing_zeros() / 8
   unconditionally. That matches donor's ZSTD_NbCommonBytes only on
   little-endian (where the first byte of the input is in the low
   byte of the u64); on big-endian the first byte is in the high
   byte and the common-bytes count needs leading_zeros instead.
   Gate via cfg(target_endian) so BE targets get the correct match
   lengths.

3. hash_log validation (hash_table.rs:50, threads #3 + #4):
   FastHashTable::new previously only debug_asserted hash_log <= 32.
   In release, hash_log == 0 would make hash_ptr shift by the full
   word width (32 for mls=4, 64 for mls>=5) which is panic/UB; and
   1usize << hash_log overflows on 32-bit targets for hash_log >= 32.
   Replace with real asserts: hash_log > 0 AND hash_log < usize::BITS.
   Donor's ZSTD_HASHLOG_MAX is 30 so sane callers stay well inside
   this band; the assertion catches outright misuse.

4. Test helper run_block (kernel.rs:339-405, thread #6):
   The previous helper accumulated tuples but then dropped them and
   returned an empty seqs Vec, so finds_long_repeat_in_simple_pattern
   only verified the accounting invariant — never that the kernel
   actually produced a match. Rewrote the helper to return the captured
   (tuples, FastBlockResult) tuple; the long-repeat test now asserts
   at least one Triple with match_len >= 4 and a non-zero offset.
   short_input_reports_tail_without_emission updated to reflect the
   new contract from item 1: the kernel emits zero sequences and
   returns the full input length as tail_literals_len.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 46-63: The constructor FastHashTable::new currently allows
hash_log values up to usize::BITS-1 which still permits invalid values for the
MLS=4 path; change the validation in FastHashTable::new to enforce the donor cap
by requiring hash_log in 1..=ZSTD_HASHLOG_MAX (e.g. <= 30) and keep the existing
check that mls is in 4..=8 so calls like hash_ptr::<4> cannot later overflow or
panic; update the panic message to mention ZSTD_HASHLOG_MAX for clarity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 80bd89b2-9cef-4361-b50d-45e98d8de065

📥 Commits

Reviewing files that changed from the base of the PR and between b69b48e and f841416.

📒 Files selected for processing (3)
  • zstd/src/encoding/simple/fast_kernel/count.rs
  • zstd/src/encoding/simple/fast_kernel/hash_table.rs
  • zstd/src/encoding/simple/fast_kernel/kernel.rs

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated
Previous `hash_log < usize::BITS` check still admitted invalid bands:
on 64-bit `hash_log \in 33..=63` passed the constructor but later
caused `hash_ptr::<4>` to shift by `32 - hash_log` (panics for
`hash_log >= 32`). Pin to donor's `ZSTD_HASHLOG_MAX = 30` so all
four MLS instantiations are safe by construction; the per-level
table-driven caller (donor `ZSTD_defaultCParameters`) never goes
above 14 for Fast strategy so this assertion only catches outright
misuse.
Copilot AI review requested due to automatic review settings May 20, 2026 16:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/count.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/mod.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated
polaz added 2 commits May 20, 2026 20:25
Four findings from review round 3 — all in fast_kernel, all about
correctness contracts (no functional changes to the algorithm yet,
just hardening + docs).

1. hash_ptr safety contract (hash_table.rs:113):
   The doc said "≥ mls readable bytes" but mls=5/6/7 paths perform
   an unaligned u64 load (8 bytes), shifting off the unused top
   bits. Following the doc literally for mls=5/6/7 with only 5/6/7
   readable bytes would read past the caller's range — UB. Updated
   to require ≥4 bytes when MLS=4 and ≥8 bytes when MLS≥5, matching
   what the implementation actually does. The kernel's
   ilimit = iend - HASH_READ_SIZE (8) cap already satisfies this
   uniformly.

2. count_forward safety contract (count.rs:21):
   The previous doc said match_ptr needs "ip_len + 7 readable bytes"
   — overspecified. The implementation only reads from match_ptr in
   chunks for the same byte ranges it reads from ip, and bails on
   the first mismatching byte. The real requirement is that match_ptr
   has at least as many readable bytes as ip up to iend, naturally
   satisfied when match_ptr <= ip within the same buffer (a backward
   match into the encoder's history). Reworded.

3. Module-level lint allow scope (mod.rs:14):
   #![allow(dead_code, unused_imports)] disabled unused_imports
   crate-wide for fast_kernel, masking real unused-import
   regressions in sibling code. Narrowed to #![allow(dead_code)]
   only, since the kernel-internal items will be reached in one
   shot by the wiring commit. Removed the unused pub(crate) use
   re-exports (FastHashTable, FastBlockResult, compress_block_fast)
   — they'll be re-added by the wiring commit when there's an
   actual caller; adding them now would force the wider lint
   relaxation we're trying to avoid.

4. match_found bounds hardening (kernel.rs:63):
   The previous filter only rejected match_idx < prefix_start_index,
   which is necessary but not sufficient. Two additional bands of
   stale-entry exposure exist if FastHashTable is reused across
   calls with a shrunk data slice (the kernel itself doesn't do
   this today, but future cross-block / shared-table call shapes
   would):

   - (match_idx + 4) > data_len: 4-byte read at base + match_idx
     past the buffer end → OOB / UB.
   - match_idx >= ip_pos: stale forward-pointing entry would make
     offset = ip_pos - match_idx underflow and produce an invalid
     offset code downstream.

   Added explicit branches for both. Strongly biased "not taken" on
   the well-behaved single-block path (every entry was written by
   this same scan, indices monotonically below ip0), so branch
   predictor amortises them to zero cost. Tests: 16/16 pass; the
   only call site (the explicit-match probe in the kernel main
   loop) updated to pass the new ip_pos / data_len arguments.
…entries, boundaries

Phase 1a follow-up addressing previously-undertested branches. Closes
the gap surfaced by an internal audit ("happy path only, no edge
cases, no regression test for round-3 hardening").

Adds 16 new unit tests (32 total across the module). Coverage now
reaches every behavioural branch in the kernel that does not require
wired-up integration:

kernel.rs (+8 tests):

- repcode_match_emits_with_rep_offset_one — exercises the rep_check
  branch on the very first loop iteration, asserts a Triple at
  offset=1 with substantial match_len on uniform data.
- explicit_match_backward_extension_extends_by_marker_byte —
  constructs an "XAAAA … XAAAA" pattern so the explicit-match
  branch's backward-extension while-loop walks back across the 'X'
  marker; asserts match_len ≥ 5 and that the literal run does not
  end with 'X' (proof the backward extension absorbed it).
- prefix_start_index_filter_rejects_below_window — pre-populates
  the hash slot with index 0, runs with prefix_start_index=5, and
  asserts the kernel never emits a Triple referencing the
  out-of-window position.
- match_found_rejects_stale_forward_entry — REGRESSION TEST for
  round 3 finding #11. Engineered scenario: uniform data + stale
  hash entry pointing forward of ip0 (idx=150 when ip0 starts at 1).
  Without the `match_pos < ip_pos` filter the first iteration
  would compute offset = 1 - 150 → underflow → huge wrap → emit a
  bogus Triple. The test asserts every emitted offset stays inside
  the buffer; before the round 3 hardening this assertion would
  fail. (The companion data-len bounds check is not directly
  observable in stable Rust — without it the OOB read returns
  garbage that almost always fails the raw cmp; Miri / sanitizers
  would catch the underlying UB.)
- block_exactly_hash_read_size_emits_no_sequences — boundary where
  ilimit = data.len() - HASH_READ_SIZE = 0, the main loop runs zero
  iterations, the kernel returns the whole input as tail.
- block_just_below_hash_read_size_emits_no_sequences — the
  short-input early-return branch fires; verifies the round-1
  uniform tail contract (kernel never emits the terminal Literals,
  caller emits from tail_literals_len).
- rep_offset_save_restore_when_out_of_range — out-of-range rep
  values (huge=9999, secondary=7) on a 64-byte block must be
  stashed to offset_saved{1,2} on entry and restored verbatim into
  result.rep by the _cleanup path. Asserts both slots round-trip.

hash_table.rs (+8 tests):

- hash6/hash7/hash8 donor-formula parity tests, matching the
  existing hash4/hash5 pattern (each instantiates the relevant
  MLS, computes the donor multiply-shift by hand, asserts equality).
- hash_log_minimum_one_constructs_two_entry_table — boundary at
  the lower end (1 is the smallest accepted hash_log; bit-width
  asserted via the produced hash being < 2).
- hash_log_maximum_thirty_is_accepted_by_constructor — boundary at
  the upper end (ZSTD_HASHLOG_MAX). Allocates the full ~4 GiB
  table once; if the host can't allocate it the test OOM-skips
  rather than misreports.
- Four `#[should_panic]` tests covering each branch of the round 2
  constructor validation: hash_log=0, hash_log=31 (> cap),
  mls=3 (< 4), mls=9 (> 8). The panic messages are matched as
  substrings so the asserts catch a wrong-message refactor.

All 32 fast_kernel tests pass. Full workspace nextest run remains
green at 571/571. Clippy clean.
Copilot AI review requested due to automatic review settings May 20, 2026 17:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/count.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 321-329: The test
hash_log_maximum_thirty_is_accepted_by_constructor() currently triggers a 2^30
allocation via FastHashTable::new; instead add (or expose) a non-allocating
validation function (e.g., FastHashTable::validate_params(hash_log: usize,
entry_size: usize) -> Result<(), Error> or a private
FastHashTable::check_hash_log) that performs the same parameter assertions
without allocating, update the test to call that helper with 30 and 4 to verify
the accept path, and keep FastHashTable::new unchanged to perform real
allocation only in normal code paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7dcfbd0f-d1d8-4707-b28b-423b436ff3ef

📥 Commits

Reviewing files that changed from the base of the PR and between 509374f and 9813007.

📒 Files selected for processing (2)
  • zstd/src/encoding/simple/fast_kernel/hash_table.rs
  • zstd/src/encoding/simple/fast_kernel/kernel.rs

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs
…condition rename

Three review findings, one commit (all in the same module, all
non-functional / docs / test refactor — no algorithm change):

1. hash_log_maximum_thirty_is_accepted_by_constructor (round 4
   findings #12 and #15, hash_table.rs):
   The previous test instantiated FastHashTable::new(30, 4) just to
   prove the validation accepts the cap value. That allocates a
   1<<30 entries × 4 bytes = ~4 GiB Vec<u32> on every test run —
   above per-test budgets on most CI runners. Extracted
   FastHashTable::new's two asserts into a private validate_params(
   hash_log, mls) helper; the test now exercises the same accept
   path against the helper directly without touching the allocator
   (validate_params(ZSTD_HASHLOG_MAX, 4) + ...= 8)). FastHashTable::
   new delegates to validate_params so the actual constructor still
   panics identically on bad inputs.

2. count_forward chunk type (round 4 finding #13, count.rs):
   The previous loop unconditionally loaded u64 chunks. Donor's
   ZSTD_count uses MEM_readST which is sizeof(size_t) — so on 32-bit
   targets the donor uses u32 chunks and our u64 loads would
   degrade to a pair of 32-bit memory ops + a 64-bit XOR.
   Switched to usize-typed loads with CHUNK_SIZE = size_of::<usize>()
   (8 on 64-bit, 4 on 32-bit). On 32-bit the u32 tail block becomes
   dead code (the chunk loop already strides in 4-byte chunks),
   so cfg(target_pointer_width = "64") gates that block to mirror
   donor's MEM_64bits() guard. trailing_zeros() / leading_zeros()
   are already endian-cfg-gated from round 3 and work correctly on
   either chunk width. 32/32 fast_kernel tests pass; the
   functional behaviour (match length per input) is unchanged on
   64-bit hosts.

3. compress_block_fast "# Safety contract" rename (round 4
   finding #14, kernel.rs):
   compress_block_fast is a safe (non-unsafe) fn, but its rustdoc
   was titled "# Safety contract" with a MUST requirement. Safe
   APIs must not put memory-safety preconditions on callers (that's
   what unsafe is for). The contract is actually about algorithmic
   correctness: a too-short input is well-defined Rust (falls into
   the short-input early-return), it just won't produce match
   sequences. Renamed the section to "# Preconditions / algorithm
   invariants" and clarified the distinction in the prose: the
   kernel is safe for every input; the constraint is what the
   caller usually wants from a Fast strategy compressor.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs
polaz added 2 commits May 20, 2026 21:44
…nsion

Adds the `(new_ip - 1 - rep_off) >= prefix_start_index` guard to the
rep-path single-byte backward extension, matching the explicit-match
path's symmetric bound. In the current single-block kernel the guard
is provably redundant — the block-entry save/restore zeroes
`rep_offset1` whenever `rep_offset1 > ip0 - prefix_start` at block
start, and `anchor <= new_ip` keeps `new_ip - 1 >= block_start >=
prefix_start`, so the byte the extension would dereference is always
inside the window. Future call shapes (cross-block reuse, shared hash
table across resets) could legitimately probe at the
`ip0 == prefix_start` boundary where the single extension byte would
otherwise reach below the window; the explicit check makes the kernel
correct under those broader invariants without changing observable
behaviour today. Costs a predicted-not-taken branch on the hot rep
path; the inline code comment documents the implicit invariants and
the symmetry rationale.

No regression test lands with this commit: per the bug-reports-get-a-
regression-test-first protocol, a test must FAIL on the unguarded
code to prove the defect exists. I attempted such a test in a prior
session (rep_backward_extension_respects_prefix_start_index) and
confirmed by running it that it cannot fail with the current
save/restore + anchor invariants — i.e., the unguarded code is
already correct in practice. That makes the finding a defensive-
hardening / code-symmetry issue, not a logic bug, so the test-first
gate does not apply and the no-longer-relevant vacuous test was
removed alongside this change.
…refix guard

Replaces the redundant prefix_start guard added in the previous
commit with a comprehensive doc comment explaining donor parity and
the invariant chain that makes the guard unnecessary.

Donor parity check: noDict rep path at zstd_fast.c:278 omits any
prefix_start bound — just `ip0[-1] == match0[-1]`. Adding a guard
here was unjustified divergence from donor.

Invariant chain (now spelled out in the inline comment):

1. Block-entry save/restore zeroes rep_offset1 whenever the incoming
   value exceeds ip0 - prefix_start_index at block start, so any
   surviving non-zero rep satisfies ip0_start - rep_offset1 >=
   prefix_start.
2. The explicit-match path's backward extension uses a STRICT
   `match_pos > prefix_start_index` bound when it promotes a fresh
   offset into rep_offset1, so new_rep = ip0_promote - match_pos
   < ip0_promote - prefix_start (strict). At any later iteration
   ip0' > ip0_promote and rep_offset1 is unchanged, so
   ip0' - rep_offset1 >= prefix_start + 1, i.e.
   (ip0' - 1) - rep_offset1 >= prefix_start.

Combined: every loop iteration that enters the rep branch already
satisfies the bound the dropped guard would have checked. Adding it
costs a predicted-not-taken branch on the hot rep path with zero
behavioural benefit under the current call shape and donor-parity
semantics.

The inline comment also names the symmetric bound in the
explicit-match path (`match_pos > prefix_start_index`) so a future
maintainer changing either invariant has an explicit pointer to
the structurally identical guard.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 261-290: Update the long comment explaining why the explicit
prefix-window guard is omitted to also account for the first-iteration "anchor"
gate: mention that offset_saved1 (rep_offset1) alone can only guarantee
ip0_start - rep_offset1 >= prefix_start_index, which may be equal, and therefore
the first loop iteration still relies on the block-entry anchor check (new_ip >
anchor) to avoid a below-window read; either fold this anchor condition into the
invariants or soften the universal claim that every branch entry satisfies
(new_ip - 1 - rep_off) >= prefix_start_index, and point to the related
explicit-match promotion guard (match_pos > prefix_start_index) and the
anchor/offset_saved1 interaction (ip0_start, rep_offset1, new_ip, anchor) as the
structural reasons this is safe today.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9bd5f84c-ffb8-4c5e-88d0-82e6aa67b5a8

📥 Commits

Reviewing files that changed from the base of the PR and between 7ff93ea and b9c5c4b.

📒 Files selected for processing (1)
  • zstd/src/encoding/simple/fast_kernel/kernel.rs

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated
CodeRabbit (round 5 review) caught a hole in the previous comment:
the save/restore at block entry guarantees only NON-STRICT
`ip0_start - rep_offset1 >= prefix_start`. When that holds with
equality, backward extension by 1 byte would dereference
`prefix_start - 1`. The explicit-match path's strict bound
(`match_pos > prefix_start_index`) covers iterations after a rep
promotion, but does NOT cover the very first loop iteration before
any promotion.

What actually keeps the first iteration safe is the runtime
`new_ip > anchor` gate: at block entry `ip0 == block_start ==
anchor`, so the gate fails and the backward extension is skipped
entirely. From the second iteration onward (after the first emitted
sequence), anchor has moved past block_start AND/OR the
explicit-match promotion bound from (2) gives the strict version of
the invariant.

Rewrote the inline comment to make this three-piece argument
explicit: (1) non-strict bound from save/restore, (2) strict bound
from promotion (kicks in after first promote), (3) anchor gate
(covers the boundary iteration). Added an explicit note that a
future call shape weakening any of the three (e.g. custom anchor
initialisation) requires the explicit guard to be re-added.

Pure doc change — no behavioural diff. 32/32 kernel tests pass.
Copilot AI review requested due to automatic review settings May 20, 2026 18:59
Copilot AI review requested due to automatic review settings May 20, 2026 20:43
CodeRabbit (outside-diff finding): compress_block_fast is a safe pub
entry point but only used debug_assert_eq! to check MLS matches the
hash_table's mls. In release builds debug_assert! is stripped, so a
mismatched call like:

  compress_block_fast::<5>(..., &mut FastHashTable::new(_, 4), ...)

routes to the kernel's mls=5 monomorphisation (different hash
formula, different prime constant, different shift count) but reads
entries indexed by mls=4 — garbage match candidates, miscompression
instead of a clean panic. The const-generic MLS prevents compile-
time misuse for nonexistent values but cannot catch this runtime
table/kernel mismatch.

Replaced the debug_assert_eq! with two real assert!s:

1. Range check (4..=8).contains(&MLS) — logically redundant with the
    arm in FastHashTable::hash_ptr, but
   surfaces the contract at the entry point with a clearer message
   than the hash-table-internal one.
2. assert_eq!(MLS, hash_table.mls()) — fails immediately on
   table/kernel mismatch, with both values in the panic message for
   diagnostics.

All 32 fast_kernel tests pass (they instantiate matching MLS/table
pairs in run_block; no test exercised the mismatch path).
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 20, 2026

@coderabbitai Fixed — replaced debug_assert_eq!(MLS, hash_table.mls()) with two real assert!s in compress_block_fast: a range check (4..=8).contains(&MLS) and a strict assert_eq! matching MLS against the table's mls. Both run in every build now, so a release-mode compress_block_fast::<5>(..., FastHashTable::new(_, 4), ...) panics with a clear diagnostic instead of silently miscompressing.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz: That's exactly the right call. Promoting those to unconditional assert!s means the mismatched-MLS footgun is caught loudly in release builds too — the (4..=8).contains(&MLS) guard covers the const-generic validity, and the strict assert_eq! against hash_table.mls() catches the caller pairing the wrong instantiation with a mis-configured table. Clear, cheap, and impossible to miss. Thanks for the fix!

(ノ◕ヮ◕)ノ*:・゚✧

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated
Round 11 Copilot findings — both doc-only:

1. kernel.rs:149 (#30): compress_block_fast docs claimed
   memory-safety for every input and didn't enumerate the
   entry-time assert!s introduced in round 7 + round 10.
   Added a # Panics section listing all four assertions
   (block_start out of range, data.len() > u32::MAX, MLS out of
   range, MLS != hash_table.mls()), each with a one-line
   explanation of what goes wrong without the check. Also
   reworded the preceding 'SAFE function' paragraph to qualify
   it with 'for every input that doesn't trigger one of the
   entry-time asserts'.

2. hash_table.rs:73 (#31): FastHashTable::new's # Panics
   section listed only the parameter-range failures (hash_log
   out of band, mls out of band). The round 8 alloc guard
   added checked_shl + checked_mul that can panic on 32-bit
   targets, and the underlying allocator can fail too — both
   undocumented. Restructured the section into two groups:
   parameter-range failures (deterministic, depend only on
   inputs) and target-size/allocation failures (depend on host
   architecture and runtime memory state). Three new bullets
   cover usize overflow on shift, usize overflow on multiply,
   and global allocator failure.

Pure documentation — no behaviour change. 32/32 tests pass.
clippy clean.
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 20, 2026

@coderabbitai Re: outside-diff MLS contract finding at kernel.rs:188-220 — this is STALE, already fixed in commit 183c36f (round 10). The kernel now does assert!((4..=8).contains(&MLS)) + assert_eq!(MLS, hash_table.mls()) instead of debug_assert_eq!. Both run in every build and panic with diagnostic messages on mismatch.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find and resolve the stale review comment about MLS contract at kernel.rs:188-220
gh pr view 215 --repo structured-world/structured-zstd --json reviews,comments

# List review comments made by coderabbitai on the PR
gh api repos/structured-world/structured-zstd/pulls/215/comments --jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: .body[0:120], path: .path, line: .line}'

Length of output: 51202


@polaz: Confirmed — that outside-diff finding is stale. The round-10 review on commit 183c36f3 already came back clean with no actionable comments on kernel.rs, which means the two unconditional assert!s closed that gap. Since the original finding landed in the "outside diff range" section of the review body rather than as a standalone resolvable inline comment, there's no thread to dismiss on the GitHub side — but the substance is fully addressed.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 89-105: Update the comment to clarify that the earlier checked
multiplication guard (the `entries * size_of::<u32>()` `checked_mul`) already
rejects `hash_log = 30` on 32-bit, so the documented “borderline” / OOM path for
`hash_log = 30` no longer applies; explicitly mention `checked_mul` prevents
that overflow/OOM and therefore the first two failure modes are deterministic
and caught before the `vec![]` allocation while only the global allocator
failure during `Vec::with_capacity` (or `vec![]`) remains dependent on runtime
memory state.

In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 157-159: Update the comment in kernel.rs to replace the narrow
"block_start > data.len()" wraparound explanation with a description of the
broader invalid-range failure: when data.len() < block_start + HASH_READ_SIZE
the early-return branch runs and computing tail_literals_len = data.len() -
block_start underflows (causing panic); reference the same symbols used in this
file (block_start, data.len(), HASH_READ_SIZE, tail_literals_len) so readers
understand the real panic cause rather than only the wraparound corner case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 027d11cf-e9ec-4bc0-bde2-f9566378cda3

📥 Commits

Reviewing files that changed from the base of the PR and between 183c36f and 0ebe7d9.

📒 Files selected for processing (2)
  • zstd/src/encoding/simple/fast_kernel/hash_table.rs
  • zstd/src/encoding/simple/fast_kernel/kernel.rs

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated
Comment thread zstd/src/encoding/simple/fast_kernel/kernel.rs Outdated
…ock_start

Round 12 CodeRabbit follow-ups on the round 11 Panics sections — both
pure documentation fixes:

1. hash_table.rs #32: round 11 said hash_log=30 was 'borderline' and
   'would OOM the allocator' on 32-bit. With the checked_mul guard
   added in round 8, hash_log=30 is now caught deterministically
   BEFORE vec! is called (1 << 30 entries * 4 bytes = 4 GiB
   overflows the usize multiply). Rewrote the bullet to spell out
   the deterministic-failure framing: checked_shl tripwire is
   unreachable today (validate_params pins hash_log <= 30, so
   1<<30 fits usize), checked_mul is the actual 32-bit
   hash_log=30 rejection point, and only allocator failure
   remains runtime-state-dependent.

2. kernel.rs #33: the block_start>data.len() panic rationale only
   described the main-loop wrap case (block_start +
   HASH_READ_SIZE wrap → skipped short-input guard → OOB read).
   It missed the more common path: when data.len() <
   block_start + HASH_READ_SIZE the short-input branch IS taken
   and computes tail_literals_len = data.len() - block_start
   which underflows. Reworded to cover both paths explicitly.

Pure documentation — no behaviour change. 32/32 tests pass.
clippy clean.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/simple/fast_kernel/count.rs
Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 107-110: The comment wrongly states all three guards fire before
reaching vec![]; update the wording in the comment near the three-guard
explanation so it clarifies that only the first two checks are deterministic
pre-allocation guards, while the third bullet is the actual vec![] allocation
and therefore the only panic that depends on runtime memory/allocator state
(i.e., allocator failure occurs at the vec![] allocation). Keep mention of the
"three guards" and "vec![]" to make the distinction explicit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8ae0e07f-4d89-4a55-aa51-4f0f73c965f9

📥 Commits

Reviewing files that changed from the base of the PR and between 0ebe7d9 and 12e6751.

📒 Files selected for processing (2)
  • zstd/src/encoding/simple/fast_kernel/hash_table.rs
  • zstd/src/encoding/simple/fast_kernel/kernel.rs

Comment thread zstd/src/encoding/simple/fast_kernel/hash_table.rs Outdated
polaz added 2 commits May 21, 2026 00:09
…MLS contract

Two new Copilot findings from round 13, both legitimate:

1. count.rs:93 (#34): count_forward's two `offset_from(p_start) as
   usize` calls return `isize`, which is UB when the byte distance
   exceeds isize::MAX. On 32-bit hosts isize::MAX = 2 GiB - 1, but
   the kernel's u32-truncation guard allows data.len() up to
   u32::MAX (4 GiB). A long match spanning >2 GiB would hit the UB
   path. Switched both call sites to plain `as usize` pointer
   arithmetic (`(ip as usize) - (p_start as usize)`) — well-defined
   integer subtraction regardless of distance, no `isize` round
   trip. Comment explains the rationale at the in-loop site and
   refers back from the function-tail site.

2. hash_table.rs:167 (#35): hash_ptr's # Safety section only
   covered the readable-bytes promise on `ptr` — it didn't mention
   that the const-generic MLS must equal self.mls() and be in
   4..=8. Today only debug_assert_eq! checks the equality, so a
   release-build mismatched call routes to the wrong hash formula
   (different multiply prime, different shift) and probes a table
   indexed by a different formula. Extended the # Safety section
   with the MLS contract, called out that compress_block_fast
   enforces both invariants with real assert!s before any
   hash_ptr call (so kernel-mediated callers are safe by
   construction), and noted that direct callers (tests, future
   helpers) are responsible for upholding the contract themselves.

All 32 fast_kernel tests pass; clippy clean. count.rs change is
behaviour-preserving (subtraction result identical to offset_from
when distance ≤ isize::MAX, and now also defined for larger
distances). hash_table.rs change is doc-only.
Round 13 CodeRabbit follow-up: my previous wording 'All three guards
fire BEFORE control reaches vec![]' was self-contradictory because
the third bullet IS the vec![] allocation itself. Reworded to
distinguish the two pre-allocation deterministic guards from the
in-allocation runtime-state-dependent one.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@polaz polaz merged commit fc63464 into main May 20, 2026
23 checks passed
@polaz polaz deleted the perf/#198-fast-matcher-port-phase1 branch May 20, 2026 21:25
polaz added a commit that referenced this pull request May 21, 2026
…#198 phase 1b)

Phase 1b of #198 — wires the donor-shape kernel modules from
phase 1a (#215, merged in fc63464) into the production hot path.
The legacy SuffixStore-based MatchGenerator in simple/mod.rs is
fully removed; MatcherStorage::Simple now holds a
FastKernelMatcher that drives compress_block_fast<MLS> once per
block.

Selected for every Fast-strategy level — CompressionLevel::Uncompressed,
CompressionLevel::Fastest, CompressionLevel::Level(1), and the
negative CompressionLevel::Level(-7..=-1) variants. All Fast
levels currently resolve to the same matcher with donor level-1
hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength
dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found)
land in phase 3.

## What changes

- simple/mod.rs collapses from 496 → 14 lines (only module
  declarations + docstring remain). SuffixStore, WindowEntry,
  MatchGenerator, repcode_candidate, add_data, next_sequence,
  add_suffixes_till, insert_suffix_if_absent,
  add_suffixes_interleaved_fast, offset_match_len, reserve — all
  gone with the wiring.
- simple/fast_matcher.rs is the new active matcher: full inherent
  surface (accept_data, start_matching, skip_matching_with_hint,
  trim_to_window, last_committed_space, reset,
  prime_offset_history, take_recycled_space).
- MatchGeneratorDriver Simple-arm wiring: commit_space →
  m.accept_data(space) with eager pre-commit eviction;
  start_matching::<Fast> → m.start_matching(handler);
  skip_matching_with_hint → m.skip_matching_with_hint(hint);
  reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG,
  FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window();
  prime_with_dictionary → m.prime_offset_history(offset_hist).
- Per-block input Vec recycled via take_recycled_space() →
  vec_pool (zero zero-fill cost — buffer pushed with len=0,
  get_next_space resizes on pop).

## Invariants

- prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The
  first byte of history is a reserved dummy (sentinel-0 guard);
  real input data starts at history[1]. Donor C zstd achieves
  the same effect via a virtual base pointer; the flat Vec<u8>
  model here pays one byte of memory overhead for the same
  correctness property (no missed matches at segment boundaries).
- history.len() bounded by 2 × max_window_size post-append, even
  for oversize committed blocks (retain_real = cap.saturating_sub(
  space.len()).min(max_window_size)).
- Eviction preserves the dummy AND rebases prefix_start_index back
  to RESERVED on every drain — cumulative growth would push the
  filter past every valid history index and reject all match
  candidates wholesale.
- Hash table rehashed after drain so retained tail bytes stay
  matchable. Amortised O(1) per byte of input.
- rep[0..2] ↔ offset_hist[0..2] in lockstep on the common
  (lit_len > 0) path. Known divergence on back-to-back repcode
  matches (lit_len == 0 emits): kernel's rep unchanged, wire
  encoder per RFC 8878 §3.1.2.5 remaps codes and rotates
  offset_hist — marginal compression hit, output still correct.
  Phase 3 collapses these at the kernel level.
- prime_offset_history seeds BOTH rep[0..2] and offset_hist
  atomically from a dictionary load.

## Defensive validation

- MatchGeneratorDriver::new asserts slice_size > 0,
  max_slices_in_window > 0, checked_mul for the product, and
  checked_next_power_of_two for window_log_init derivation —
  catches all four overflow / degenerate paths with a clear
  domain-specific panic instead of a deep matcher-internal
  failure.
- FastHashTable construction-time mls / hash_log validation
  unchanged from phase 1a.

## Tests

573/573 pass on the full workspace nextest suite:

- 21 unit tests on FastKernelMatcher (lifecycle, accept + start,
  skip flavors, dict-prime hash population, eviction, boundary
  cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync,
  prefix-eviction during dict-priming, drain prefix_start_index
  runaway, trim_to_window/last_block_start drift, oversize-block
  eviction bound).
- 32 unit tests on the underlying kernel (donor-formula parity,
  prefix-filter, repcode backward extension three-piece proof,
  short-input early-return uniformity).
- All frame_compressor integration tests (raw-block detection,
  hinted source-size matrix, level-1 round-trips through both
  the in-tree decoder and FFI decode).
- All cross_validation Rust-encoded → FFI-decoded round-trips
  (every level 1..=22, dict + no-dict, encoded by Rust then
  read by the C reference decoder verbatim).

18 legacy tests in match_generator.rs that exercised
SuffixStore-specific behavior or required block.len() <
HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory
comments — their substance either has equivalent coverage in the
new tests or relied on algorithm-specific quirks the donor-shape
kernel doesn't reproduce by design.

## Benchmark (i9-9900K)

cargo bench deltas vs main (fc63464) on
compress/level_1_fast/*/matrix/pure_rust:

| Scenario | Δ time | Throughput | Note |
|----------|-------:|-----------:|------|
| low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster |
| decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster |
| high-entropy-1m | -67.3% | 633 MiB/s | 3× faster |
| small-10k-random | -36.0% | 768 MiB/s | 1.6× faster |
| small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster |
| small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead |
| large-log-stream | +122.3% | 235 MiB/s | regression — see below |

Large-log-stream regression is expected at this phase: legacy
SuffixStore used effectively window_log-sized hash slots (512K
for level-1 window_log=19), while phase 1b uses donor-parity
hash_log=14 (16K slots). 25 MiB dense-compressible log content
hits ~1500 collisions/slot. Donor C zstd shows the same trade-off
(~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining +
cmov) closes the gap per the documented roadmap (#198 items 2/3/5).

## What's NOT in this PR

- Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead,
  cmov match-found variant, per-level mls dispatch,
  kSearchStrength acceleration gradient for negative Fast levels.
- LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes
  donor level-1 defaults (14/7) today.
- hash_fill_step stride for dict-priming — still hard-coded to 1
  (LevelParams field is wired in but the Fast matcher always
  strides at 1).

Closes #198 phase 1b.

Related: #178 (umbrella regression issue), #215 (phase 1a — kernel
modules, merged in fc63464).
polaz added a commit that referenced this pull request May 21, 2026
…#198 phase 1b) (#217)

* perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b)

Phase 1b of #198 — wires the donor-shape kernel modules from
phase 1a (#215, merged in fc63464) into the production hot path.
The legacy SuffixStore-based MatchGenerator in simple/mod.rs is
fully removed; MatcherStorage::Simple now holds a
FastKernelMatcher that drives compress_block_fast<MLS> once per
block.

Selected for every Fast-strategy level — CompressionLevel::Uncompressed,
CompressionLevel::Fastest, CompressionLevel::Level(1), and the
negative CompressionLevel::Level(-7..=-1) variants. All Fast
levels currently resolve to the same matcher with donor level-1
hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength
dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found)
land in phase 3.

## What changes

- simple/mod.rs collapses from 496 → 14 lines (only module
  declarations + docstring remain). SuffixStore, WindowEntry,
  MatchGenerator, repcode_candidate, add_data, next_sequence,
  add_suffixes_till, insert_suffix_if_absent,
  add_suffixes_interleaved_fast, offset_match_len, reserve — all
  gone with the wiring.
- simple/fast_matcher.rs is the new active matcher: full inherent
  surface (accept_data, start_matching, skip_matching_with_hint,
  trim_to_window, last_committed_space, reset,
  prime_offset_history, take_recycled_space).
- MatchGeneratorDriver Simple-arm wiring: commit_space →
  m.accept_data(space) with eager pre-commit eviction;
  start_matching::<Fast> → m.start_matching(handler);
  skip_matching_with_hint → m.skip_matching_with_hint(hint);
  reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG,
  FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window();
  prime_with_dictionary → m.prime_offset_history(offset_hist).
- Per-block input Vec recycled via take_recycled_space() →
  vec_pool (zero zero-fill cost — buffer pushed with len=0,
  get_next_space resizes on pop).

## Invariants

- prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The
  first byte of history is a reserved dummy (sentinel-0 guard);
  real input data starts at history[1]. Donor C zstd achieves
  the same effect via a virtual base pointer; the flat Vec<u8>
  model here pays one byte of memory overhead for the same
  correctness property (no missed matches at segment boundaries).
- history.len() bounded by 2 × max_window_size post-append, even
  for oversize committed blocks (retain_real = cap.saturating_sub(
  space.len()).min(max_window_size)).
- Eviction preserves the dummy AND rebases prefix_start_index back
  to RESERVED on every drain — cumulative growth would push the
  filter past every valid history index and reject all match
  candidates wholesale.
- Hash table rehashed after drain so retained tail bytes stay
  matchable. Amortised O(1) per byte of input.
- rep[0..2] ↔ offset_hist[0..2] in lockstep on the common
  (lit_len > 0) path. Known divergence on back-to-back repcode
  matches (lit_len == 0 emits): kernel's rep unchanged, wire
  encoder per RFC 8878 §3.1.2.5 remaps codes and rotates
  offset_hist — marginal compression hit, output still correct.
  Phase 3 collapses these at the kernel level.
- prime_offset_history seeds BOTH rep[0..2] and offset_hist
  atomically from a dictionary load.

## Defensive validation

- MatchGeneratorDriver::new asserts slice_size > 0,
  max_slices_in_window > 0, checked_mul for the product, and
  checked_next_power_of_two for window_log_init derivation —
  catches all four overflow / degenerate paths with a clear
  domain-specific panic instead of a deep matcher-internal
  failure.
- FastHashTable construction-time mls / hash_log validation
  unchanged from phase 1a.

## Tests

573/573 pass on the full workspace nextest suite:

- 21 unit tests on FastKernelMatcher (lifecycle, accept + start,
  skip flavors, dict-prime hash population, eviction, boundary
  cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync,
  prefix-eviction during dict-priming, drain prefix_start_index
  runaway, trim_to_window/last_block_start drift, oversize-block
  eviction bound).
- 32 unit tests on the underlying kernel (donor-formula parity,
  prefix-filter, repcode backward extension three-piece proof,
  short-input early-return uniformity).
- All frame_compressor integration tests (raw-block detection,
  hinted source-size matrix, level-1 round-trips through both
  the in-tree decoder and FFI decode).
- All cross_validation Rust-encoded → FFI-decoded round-trips
  (every level 1..=22, dict + no-dict, encoded by Rust then
  read by the C reference decoder verbatim).

18 legacy tests in match_generator.rs that exercised
SuffixStore-specific behavior or required block.len() <
HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory
comments — their substance either has equivalent coverage in the
new tests or relied on algorithm-specific quirks the donor-shape
kernel doesn't reproduce by design.

## Benchmark (i9-9900K)

cargo bench deltas vs main (fc63464) on
compress/level_1_fast/*/matrix/pure_rust:

| Scenario | Δ time | Throughput | Note |
|----------|-------:|-----------:|------|
| low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster |
| decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster |
| high-entropy-1m | -67.3% | 633 MiB/s | 3× faster |
| small-10k-random | -36.0% | 768 MiB/s | 1.6× faster |
| small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster |
| small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead |
| large-log-stream | +122.3% | 235 MiB/s | regression — see below |

Large-log-stream regression is expected at this phase: legacy
SuffixStore used effectively window_log-sized hash slots (512K
for level-1 window_log=19), while phase 1b uses donor-parity
hash_log=14 (16K slots). 25 MiB dense-compressible log content
hits ~1500 collisions/slot. Donor C zstd shows the same trade-off
(~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining +
cmov) closes the gap per the documented roadmap (#198 items 2/3/5).

## What's NOT in this PR

- Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead,
  cmov match-found variant, per-level mls dispatch,
  kSearchStrength acceleration gradient for negative Fast levels.
- LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes
  donor level-1 defaults (14/7) today.
- hash_fill_step stride for dict-priming — still hard-coded to 1
  (LevelParams field is wired in but the Fast matcher always
  strides at 1).

Closes #198 phase 1b.

Related: #178 (umbrella regression issue), #215 (phase 1a — kernel
modules, merged in fc63464).

* fix(match_generator,fast_matcher): correctness + doc accuracy fixes (#217 review round 1)

Four findings from CodeRabbit + Copilot's first review pass on the
squashed PR #217. All four touch comment / metadata accuracy or
a single-line correctness issue; no behavioural change beyond the
window_size sync.

**CR outside-diff (match_generator.rs:598-613)** — correctness:
`reported_window_size` was using the unrounded `max_window_size`
while the matcher itself was constructed from `next_pow2` (rounded
up). For non-power-of-two constructor inputs (e.g. `slice_size *
max_slices_in_window = 100_000`), `window_size()` would report
65_536 (un-rounded floor) while the active backend actually carried
131_072 (rounded-up next_pow2). The drift held until the first
`reset()` overwrote both sides from LevelParams. Fix: report
`next_pow2` so the two stay in lockstep at construction time.

**CR #1 (match_generator.rs:944)** — declined, deferred to phase 3:
Fast levels (Uncompressed, Fastest, Level(-7..=1)) all hard-code
donor level-1 cParams. The acceleration gradient between negative-
level fast modes and Level(1) lands when phase 3 ports donor's
4-cursor lookahead + cmov match-found + per-level kSearchStrength
dispatch (issue #198 items 2/3/5). Updated the inline code
comment to scrub the closed-PR reference and frame the deferral
against phase 3 directly.

**Copilot #2 (match_generator.rs:572)** — doc: the validation-guard
comment described `next_power_of_two` returning 0 on overflow,
which was old-Rust behaviour. Modern Rust panics; we now use
`checked_next_power_of_two` (commit landed in the squash). Rewrote
the comment to enumerate the three actual failure modes (zero
args, mul overflow, next-pow2 overflow) and the three guards that
catch them.

**Copilot #3 (fast_matcher.rs:303)** — doc: `last_committed_space`'s
pre-`accept_data` state description claimed `last_block_start = 0
/ history.len() = 0`, but post-RESERVED_PREFIX_BYTES-seed
construction (#216 / phase 1b) leaves both at
`RESERVED_PREFIX_BYTES`. The returned slice is still empty (the
`history[last_block_start..]` range is empty), just for a
different reason. Updated the doc to reflect the seeded-dummy
invariant.

573/573 tests pass; clippy clean.

* docs(fast_matcher): compact last_committed_space docstring (apply new compactness rule)

Doc-only change: collapse the verbose multi-paragraph
last_committed_space docstring (added during PR #216 review
rounds) into a 6-line three-bullet form. Same semantic content,
zero narrative.

Apply the new docstring compactness rule (one-two phrases default,
multi-paragraph only for non-obvious invariants).

573/573 tests pass; clippy clean.

* docs(fast_matcher): align dict-prime boundary test comments with RESERVED_PREFIX_BYTES seed (#217 Copilot #4, #5)

Doc-only update aligning the dict-prime boundary test
comment + inline note with the RESERVED_PREFIX_BYTES seed
(post-phase-1b range is [RESERVED..=RESERVED], not [0..=0]).
Compactness rule applied — concise two-block form.

573/573 tests pass; clippy clean.

* docs(fast_matcher): correct prefix_start_index wording — rebases, not bumps (#217 Copilot #6, #7)

Two doc comments described prefix_start_index as 'bumped forward'
or 'advances' as history is evicted, implying a monotonic absolute
index. Actual code (drain_real_prefix) rebases it back to
RESERVED_PREFIX_BYTES on every drain — the retained tail is
re-indexed in the new coordinate space.

Updated both sites (struct field doc + trim_to_window header) to
match. Compactness rule applied — trim_to_window header collapsed
from 10 lines to 5.

573/573 tests pass; clippy clean.

* docs(fast_matcher): drop stale hash_log scaling claim + add lit_len=0 TODO marker (#217 Copilot #8, #9)

#8 — header for FAST_LEVEL_1_HASH_LOG said 'reset path rebinds
hash_log proportionally on source-size hint'. Untrue today: driver
passes only window_log per-level, hash_log + mls hard-coded. Pin
per-level scaling to phase 3.

#9 — same lit_len=0 / back-to-back-rep1 concern as the prior PR's
#21. Inline + module docs already explain, but verbose prose isn't
anchored. Collapsed inline comment to a single
'// TODO(#198 phase 3):' line so the deferral marker is
unambiguous.

573/573 tests pass; clippy clean.

* fix(fast_matcher): assert hard preconditions + skip lit_len=0 rotation + hoist mls dispatch (#217 round 5)

Four findings from the latest CR + Copilot review pass.

**CR outside-diff (327-331) — assert!, not debug_assert!:** the
duplicate-pending guard in accept_data was debug_assert!, so in
release builds a double-commit would silently overwrite pending
and drop a committed block. Hard-fail instead.

**CR outside-diff (358-363) — block_size <= 2 × max_window_size:**
the eviction math computes retain_real via saturating_sub, which
silently collapses to 0 if space.len() > cap. The full block is
still appended afterwards, violating the documented
'history bounded by 2 × max_window_size' invariant. Added an
assert! precondition so callers see a clear panic at the boundary
instead of an invisible invariant break.

**CR #10 — lit_len == 0 offset_hist rotation (third raise):** the
prior round added a TODO marker explaining the divergence, but CR
correctly pushed back that documentation isn't a fix. Skip
encode_offset_with_history when literals.is_empty() so
offset_hist stays in lockstep with the kernel's unchanged rep —
no divergence on the back-to-back rep1 path. Wire encoder
downstream still sees the Triple with raw offset; its own
encoding stays correct (lit_len-0 absolute encoding per RFC
8878 §3.1.2.5). Module docstring 'Known divergence' section
collapsed accordingly.

**Copilot #11 — hoist mls dispatch outside prime_hash loop:**
moved the per-MLS match arm OUTSIDE the per-position loop. New
prime_hash_table_impl<const MLS: u32> is monomorphised per
matcher instance; the hot path is branch-free on mls.

573/573 tests pass; clippy clean.

* fix(fast_matcher): call encode_offset_with_history unconditionally (#217 Copilot #12, revert round 5 #10)

Round 5 skipped encode_offset_with_history when literals.is_empty()
to keep matcher.offset_hist 'in lockstep' with kernel rep. Copilot
correctly pushed back: Dfast / Row / HashChain matchers all call
it unconditionally (passing lit_len = 0 when applicable), and the
'lockstep' framing was wrong — matcher.offset_hist tracks the
WIRE ENCODER's history while matcher.rep tracks the KERNEL's
state. They're not supposed to be the same.

Revert the skip. Module docstring rewritten to call out that the
two fields reflect DIFFERENT state and may diverge on lit_len = 0
emits per RFC 8878 §3.1.2.5 — both halves stay self-consistent
within their own domain.

Round 5 #10 reply to CR was based on the wrong mental model;
that's reflected in this fix by reverting to the donor /
other-matchers behaviour.

573/573 tests pass; clippy clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants