perf(encoder): donor-shape Fast kernel modules (#198 phase 1a)#215
Conversation
…ount) First commit of the #198 Fast strategy port (22× regression vs C zstd on Fast levels). The kernel itself is implemented and unit-tested in isolation; wiring into MatchGeneratorDriver / MatcherStorage lands in the next commit on this branch. What's in this commit: - zstd/src/encoding/simple/fast_kernel/hash_table.rs: FastHashTable — flat Vec<u32> indexed by donor-parity ZSTD_hash{4,5,6,7,8}Ptr (multiply-shift on the first `mls` bytes of the suffix at `ptr`, output reduced to `hash_log` bits). Constants bit-identical to lib/compress/zstd_compress_internal.h. - zstd/src/encoding/simple/fast_kernel/count.rs: count_forward — port of ZSTD_count from lib/compress/zstd_compress_internal.h. 8-byte chunked XOR loop with trailing_zeros / 8 to locate the diverging byte, then u32 / u16 / u8 tail fall-throughs matching donor's exact progression. - zstd/src/encoding/simple/fast_kernel/kernel.rs: compress_block_fast<MLS> — port of ZSTD_compressBlock_fast_noDict_generic from lib/compress/zstd_fast.c. Phase 1 keeps the single-ip0 cursor with raw 4-byte match probe, step-based skip acceleration, repcode- at-ip check, backward extension, and ZSTD_count for forward extension. The 4-cursor (ip0/ip1/ip2/ip3) pipelining and cmov match-found variant from donor's full generic body are deferred to phase 3; this commit exists to validate the data structures and close the bulk of the 22× gap before adding pipelining complexity. Tests: 16 new unit tests cover the hash function donor parity, the count_forward chunk/tail progression, and the kernel's accounting invariant (sum(literals + matches) + tail_literals_len == input.len()). All pass. The module is gated with `#![allow(dead_code, unused_imports)]` for this commit only — every item is used in the wiring commit. The allow is removed there. Related: #198 (Fast strategy 22× regression).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a donor-compatible Fast compression kernel: ChangesFast Compression Kernel Implementation
Sequence Diagram(s)sequenceDiagram
participant Caller
participant compress_block_fast
participant FastHashTable
participant count_forward
participant SequenceHandler
Caller->>compress_block_fast: run_block(data, block_start, prefix_start_index, hash_table, rep, handle_sequence)
compress_block_fast->>FastHashTable: get(hash) / put(hash, pos)
compress_block_fast->>count_forward: count_forward(ip, match_ptr, iend)
compress_block_fast->>SequenceHandler: handle_sequence(Sequence::Triple / Sequence::Literals)
compress_block_fast-->>Caller: FastBlockResult(rep, tail_literals_len)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 122-133: The early-return branch in the fast kernel currently
calls handle_sequence(Sequence::Literals { literals: &data[block_start..] }) but
then returns FastBlockResult { rep, tail_literals_len: 0 }, which is
inconsistent with the normal path; change this branch to NOT call
handle_sequence and instead return FastBlockResult { rep, tail_literals_len:
data.len() - block_start } so the caller is responsible for emitting the tail
literals (preserve symbols: data, block_start, HASH_READ_SIZE, handle_sequence,
FastBlockResult, tail_literals_len).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a014b89a-2926-44d6-9245-df8effda3c58
📒 Files selected for processing (5)
zstd/src/encoding/simple/fast_kernel/count.rszstd/src/encoding/simple/fast_kernel/hash_table.rszstd/src/encoding/simple/fast_kernel/kernel.rszstd/src/encoding/simple/fast_kernel/mod.rszstd/src/encoding/simple/mod.rs
There was a problem hiding this comment.
Pull request overview
Adds the “donor-shape” Fast strategy kernel building blocks (flat u32 hash table, ZSTD_count-style forward match counter, and a phase-1 single-cursor fast block compressor) under zstd/src/encoding/simple/fast_kernel/. This is a preparatory step toward wiring a faster Fast-strategy matcher into the encoder in a follow-up PR.
Changes:
- Introduce
encoding/simple/fast_kernelmodule and export the new kernel/hash-table APIs. - Implement
FastHashTablewith donor-parity multiply-shift hashing for MLS 4..=8. - Implement
count_forwardand a phase-1compress_block_fast<MLS>plus unit tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| zstd/src/encoding/simple/mod.rs | Exposes the new fast_kernel submodule under the Simple encoder. |
| zstd/src/encoding/simple/fast_kernel/mod.rs | Adds the new fast-kernel module umbrella and re-exports key types/functions. |
| zstd/src/encoding/simple/fast_kernel/hash_table.rs | Implements the flat Vec<u32> hash table and donor-parity hash functions. |
| zstd/src/encoding/simple/fast_kernel/count.rs | Implements the donor-style forward byte-match counter (ZSTD_count port). |
| zstd/src/encoding/simple/fast_kernel/kernel.rs | Implements the phase-1 fast block compression loop and unit tests. |
…tion Four reviewer findings collapsed into one commit because every fix lives in the kernel modules already introduced by the previous baseline commit, and they're not strictly independent (the short-input contract update changes how the test helper observes accounting). 1. Short-input early-return path (kernel.rs:122-133, threads #1 + #5): Previously called handle_sequence(Sequence::Literals { .. }) and then returned tail_literals_len = 0, conflicting with the main-loop exit which never emits the terminal Literals and returns the tail count. Caller wrapping the kernel had no consistent rule for whether to emit the trailing Literals or not. Unify both branches: the kernel NEVER emits the terminal Literals; the caller emits it from tail_literals_len. Updated the function doc to spell out the sequence-emission contract. 2. count_forward endianness (count.rs:68, thread #2): The 8-byte chunked XOR loop used diff.trailing_zeros() / 8 unconditionally. That matches donor's ZSTD_NbCommonBytes only on little-endian (where the first byte of the input is in the low byte of the u64); on big-endian the first byte is in the high byte and the common-bytes count needs leading_zeros instead. Gate via cfg(target_endian) so BE targets get the correct match lengths. 3. hash_log validation (hash_table.rs:50, threads #3 + #4): FastHashTable::new previously only debug_asserted hash_log <= 32. In release, hash_log == 0 would make hash_ptr shift by the full word width (32 for mls=4, 64 for mls>=5) which is panic/UB; and 1usize << hash_log overflows on 32-bit targets for hash_log >= 32. Replace with real asserts: hash_log > 0 AND hash_log < usize::BITS. Donor's ZSTD_HASHLOG_MAX is 30 so sane callers stay well inside this band; the assertion catches outright misuse. 4. Test helper run_block (kernel.rs:339-405, thread #6): The previous helper accumulated tuples but then dropped them and returned an empty seqs Vec, so finds_long_repeat_in_simple_pattern only verified the accounting invariant — never that the kernel actually produced a match. Rewrote the helper to return the captured (tuples, FastBlockResult) tuple; the long-repeat test now asserts at least one Triple with match_len >= 4 and a non-zero offset. short_input_reports_tail_without_emission updated to reflect the new contract from item 1: the kernel emits zero sequences and returns the full input length as tail_literals_len.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 46-63: The constructor FastHashTable::new currently allows
hash_log values up to usize::BITS-1 which still permits invalid values for the
MLS=4 path; change the validation in FastHashTable::new to enforce the donor cap
by requiring hash_log in 1..=ZSTD_HASHLOG_MAX (e.g. <= 30) and keep the existing
check that mls is in 4..=8 so calls like hash_ptr::<4> cannot later overflow or
panic; update the panic message to mention ZSTD_HASHLOG_MAX for clarity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 80bd89b2-9cef-4361-b50d-45e98d8de065
📒 Files selected for processing (3)
zstd/src/encoding/simple/fast_kernel/count.rszstd/src/encoding/simple/fast_kernel/hash_table.rszstd/src/encoding/simple/fast_kernel/kernel.rs
Previous `hash_log < usize::BITS` check still admitted invalid bands: on 64-bit `hash_log \in 33..=63` passed the constructor but later caused `hash_ptr::<4>` to shift by `32 - hash_log` (panics for `hash_log >= 32`). Pin to donor's `ZSTD_HASHLOG_MAX = 30` so all four MLS instantiations are safe by construction; the per-level table-driven caller (donor `ZSTD_defaultCParameters`) never goes above 14 for Fast strategy so this assertion only catches outright misuse.
Four findings from review round 3 — all in fast_kernel, all about
correctness contracts (no functional changes to the algorithm yet,
just hardening + docs).
1. hash_ptr safety contract (hash_table.rs:113):
The doc said "≥ mls readable bytes" but mls=5/6/7 paths perform
an unaligned u64 load (8 bytes), shifting off the unused top
bits. Following the doc literally for mls=5/6/7 with only 5/6/7
readable bytes would read past the caller's range — UB. Updated
to require ≥4 bytes when MLS=4 and ≥8 bytes when MLS≥5, matching
what the implementation actually does. The kernel's
ilimit = iend - HASH_READ_SIZE (8) cap already satisfies this
uniformly.
2. count_forward safety contract (count.rs:21):
The previous doc said match_ptr needs "ip_len + 7 readable bytes"
— overspecified. The implementation only reads from match_ptr in
chunks for the same byte ranges it reads from ip, and bails on
the first mismatching byte. The real requirement is that match_ptr
has at least as many readable bytes as ip up to iend, naturally
satisfied when match_ptr <= ip within the same buffer (a backward
match into the encoder's history). Reworded.
3. Module-level lint allow scope (mod.rs:14):
#![allow(dead_code, unused_imports)] disabled unused_imports
crate-wide for fast_kernel, masking real unused-import
regressions in sibling code. Narrowed to #![allow(dead_code)]
only, since the kernel-internal items will be reached in one
shot by the wiring commit. Removed the unused pub(crate) use
re-exports (FastHashTable, FastBlockResult, compress_block_fast)
— they'll be re-added by the wiring commit when there's an
actual caller; adding them now would force the wider lint
relaxation we're trying to avoid.
4. match_found bounds hardening (kernel.rs:63):
The previous filter only rejected match_idx < prefix_start_index,
which is necessary but not sufficient. Two additional bands of
stale-entry exposure exist if FastHashTable is reused across
calls with a shrunk data slice (the kernel itself doesn't do
this today, but future cross-block / shared-table call shapes
would):
- (match_idx + 4) > data_len: 4-byte read at base + match_idx
past the buffer end → OOB / UB.
- match_idx >= ip_pos: stale forward-pointing entry would make
offset = ip_pos - match_idx underflow and produce an invalid
offset code downstream.
Added explicit branches for both. Strongly biased "not taken" on
the well-behaved single-block path (every entry was written by
this same scan, indices monotonically below ip0), so branch
predictor amortises them to zero cost. Tests: 16/16 pass; the
only call site (the explicit-match probe in the kernel main
loop) updated to pass the new ip_pos / data_len arguments.
…entries, boundaries
Phase 1a follow-up addressing previously-undertested branches. Closes
the gap surfaced by an internal audit ("happy path only, no edge
cases, no regression test for round-3 hardening").
Adds 16 new unit tests (32 total across the module). Coverage now
reaches every behavioural branch in the kernel that does not require
wired-up integration:
kernel.rs (+8 tests):
- repcode_match_emits_with_rep_offset_one — exercises the rep_check
branch on the very first loop iteration, asserts a Triple at
offset=1 with substantial match_len on uniform data.
- explicit_match_backward_extension_extends_by_marker_byte —
constructs an "XAAAA … XAAAA" pattern so the explicit-match
branch's backward-extension while-loop walks back across the 'X'
marker; asserts match_len ≥ 5 and that the literal run does not
end with 'X' (proof the backward extension absorbed it).
- prefix_start_index_filter_rejects_below_window — pre-populates
the hash slot with index 0, runs with prefix_start_index=5, and
asserts the kernel never emits a Triple referencing the
out-of-window position.
- match_found_rejects_stale_forward_entry — REGRESSION TEST for
round 3 finding #11. Engineered scenario: uniform data + stale
hash entry pointing forward of ip0 (idx=150 when ip0 starts at 1).
Without the `match_pos < ip_pos` filter the first iteration
would compute offset = 1 - 150 → underflow → huge wrap → emit a
bogus Triple. The test asserts every emitted offset stays inside
the buffer; before the round 3 hardening this assertion would
fail. (The companion data-len bounds check is not directly
observable in stable Rust — without it the OOB read returns
garbage that almost always fails the raw cmp; Miri / sanitizers
would catch the underlying UB.)
- block_exactly_hash_read_size_emits_no_sequences — boundary where
ilimit = data.len() - HASH_READ_SIZE = 0, the main loop runs zero
iterations, the kernel returns the whole input as tail.
- block_just_below_hash_read_size_emits_no_sequences — the
short-input early-return branch fires; verifies the round-1
uniform tail contract (kernel never emits the terminal Literals,
caller emits from tail_literals_len).
- rep_offset_save_restore_when_out_of_range — out-of-range rep
values (huge=9999, secondary=7) on a 64-byte block must be
stashed to offset_saved{1,2} on entry and restored verbatim into
result.rep by the _cleanup path. Asserts both slots round-trip.
hash_table.rs (+8 tests):
- hash6/hash7/hash8 donor-formula parity tests, matching the
existing hash4/hash5 pattern (each instantiates the relevant
MLS, computes the donor multiply-shift by hand, asserts equality).
- hash_log_minimum_one_constructs_two_entry_table — boundary at
the lower end (1 is the smallest accepted hash_log; bit-width
asserted via the produced hash being < 2).
- hash_log_maximum_thirty_is_accepted_by_constructor — boundary at
the upper end (ZSTD_HASHLOG_MAX). Allocates the full ~4 GiB
table once; if the host can't allocate it the test OOM-skips
rather than misreports.
- Four `#[should_panic]` tests covering each branch of the round 2
constructor validation: hash_log=0, hash_log=31 (> cap),
mls=3 (< 4), mls=9 (> 8). The panic messages are matched as
substrings so the asserts catch a wrong-message refactor.
All 32 fast_kernel tests pass. Full workspace nextest run remains
green at 571/571. Clippy clean.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 321-329: The test
hash_log_maximum_thirty_is_accepted_by_constructor() currently triggers a 2^30
allocation via FastHashTable::new; instead add (or expose) a non-allocating
validation function (e.g., FastHashTable::validate_params(hash_log: usize,
entry_size: usize) -> Result<(), Error> or a private
FastHashTable::check_hash_log) that performs the same parameter assertions
without allocating, update the test to call that helper with 30 and 4 to verify
the accept path, and keep FastHashTable::new unchanged to perform real
allocation only in normal code paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7dcfbd0f-d1d8-4707-b28b-423b436ff3ef
📒 Files selected for processing (2)
zstd/src/encoding/simple/fast_kernel/hash_table.rszstd/src/encoding/simple/fast_kernel/kernel.rs
…condition rename Three review findings, one commit (all in the same module, all non-functional / docs / test refactor — no algorithm change): 1. hash_log_maximum_thirty_is_accepted_by_constructor (round 4 findings #12 and #15, hash_table.rs): The previous test instantiated FastHashTable::new(30, 4) just to prove the validation accepts the cap value. That allocates a 1<<30 entries × 4 bytes = ~4 GiB Vec<u32> on every test run — above per-test budgets on most CI runners. Extracted FastHashTable::new's two asserts into a private validate_params( hash_log, mls) helper; the test now exercises the same accept path against the helper directly without touching the allocator (validate_params(ZSTD_HASHLOG_MAX, 4) + ...= 8)). FastHashTable:: new delegates to validate_params so the actual constructor still panics identically on bad inputs. 2. count_forward chunk type (round 4 finding #13, count.rs): The previous loop unconditionally loaded u64 chunks. Donor's ZSTD_count uses MEM_readST which is sizeof(size_t) — so on 32-bit targets the donor uses u32 chunks and our u64 loads would degrade to a pair of 32-bit memory ops + a 64-bit XOR. Switched to usize-typed loads with CHUNK_SIZE = size_of::<usize>() (8 on 64-bit, 4 on 32-bit). On 32-bit the u32 tail block becomes dead code (the chunk loop already strides in 4-byte chunks), so cfg(target_pointer_width = "64") gates that block to mirror donor's MEM_64bits() guard. trailing_zeros() / leading_zeros() are already endian-cfg-gated from round 3 and work correctly on either chunk width. 32/32 fast_kernel tests pass; the functional behaviour (match length per input) is unchanged on 64-bit hosts. 3. compress_block_fast "# Safety contract" rename (round 4 finding #14, kernel.rs): compress_block_fast is a safe (non-unsafe) fn, but its rustdoc was titled "# Safety contract" with a MUST requirement. Safe APIs must not put memory-safety preconditions on callers (that's what unsafe is for). The contract is actually about algorithmic correctness: a too-short input is well-defined Rust (falls into the short-input early-return), it just won't produce match sequences. Renamed the section to "# Preconditions / algorithm invariants" and clarified the distinction in the prose: the kernel is safe for every input; the constraint is what the caller usually wants from a Fast strategy compressor.
…nsion Adds the `(new_ip - 1 - rep_off) >= prefix_start_index` guard to the rep-path single-byte backward extension, matching the explicit-match path's symmetric bound. In the current single-block kernel the guard is provably redundant — the block-entry save/restore zeroes `rep_offset1` whenever `rep_offset1 > ip0 - prefix_start` at block start, and `anchor <= new_ip` keeps `new_ip - 1 >= block_start >= prefix_start`, so the byte the extension would dereference is always inside the window. Future call shapes (cross-block reuse, shared hash table across resets) could legitimately probe at the `ip0 == prefix_start` boundary where the single extension byte would otherwise reach below the window; the explicit check makes the kernel correct under those broader invariants without changing observable behaviour today. Costs a predicted-not-taken branch on the hot rep path; the inline code comment documents the implicit invariants and the symmetry rationale. No regression test lands with this commit: per the bug-reports-get-a- regression-test-first protocol, a test must FAIL on the unguarded code to prove the defect exists. I attempted such a test in a prior session (rep_backward_extension_respects_prefix_start_index) and confirmed by running it that it cannot fail with the current save/restore + anchor invariants — i.e., the unguarded code is already correct in practice. That makes the finding a defensive- hardening / code-symmetry issue, not a logic bug, so the test-first gate does not apply and the no-longer-relevant vacuous test was removed alongside this change.
…refix guard Replaces the redundant prefix_start guard added in the previous commit with a comprehensive doc comment explaining donor parity and the invariant chain that makes the guard unnecessary. Donor parity check: noDict rep path at zstd_fast.c:278 omits any prefix_start bound — just `ip0[-1] == match0[-1]`. Adding a guard here was unjustified divergence from donor. Invariant chain (now spelled out in the inline comment): 1. Block-entry save/restore zeroes rep_offset1 whenever the incoming value exceeds ip0 - prefix_start_index at block start, so any surviving non-zero rep satisfies ip0_start - rep_offset1 >= prefix_start. 2. The explicit-match path's backward extension uses a STRICT `match_pos > prefix_start_index` bound when it promotes a fresh offset into rep_offset1, so new_rep = ip0_promote - match_pos < ip0_promote - prefix_start (strict). At any later iteration ip0' > ip0_promote and rep_offset1 is unchanged, so ip0' - rep_offset1 >= prefix_start + 1, i.e. (ip0' - 1) - rep_offset1 >= prefix_start. Combined: every loop iteration that enters the rep branch already satisfies the bound the dropped guard would have checked. Adding it costs a predicted-not-taken branch on the hot rep path with zero behavioural benefit under the current call shape and donor-parity semantics. The inline comment also names the symmetric bound in the explicit-match path (`match_pos > prefix_start_index`) so a future maintainer changing either invariant has an explicit pointer to the structurally identical guard.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 261-290: Update the long comment explaining why the explicit
prefix-window guard is omitted to also account for the first-iteration "anchor"
gate: mention that offset_saved1 (rep_offset1) alone can only guarantee
ip0_start - rep_offset1 >= prefix_start_index, which may be equal, and therefore
the first loop iteration still relies on the block-entry anchor check (new_ip >
anchor) to avoid a below-window read; either fold this anchor condition into the
invariants or soften the universal claim that every branch entry satisfies
(new_ip - 1 - rep_off) >= prefix_start_index, and point to the related
explicit-match promotion guard (match_pos > prefix_start_index) and the
anchor/offset_saved1 interaction (ip0_start, rep_offset1, new_ip, anchor) as the
structural reasons this is safe today.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9bd5f84c-ffb8-4c5e-88d0-82e6aa67b5a8
📒 Files selected for processing (1)
zstd/src/encoding/simple/fast_kernel/kernel.rs
CodeRabbit (round 5 review) caught a hole in the previous comment: the save/restore at block entry guarantees only NON-STRICT `ip0_start - rep_offset1 >= prefix_start`. When that holds with equality, backward extension by 1 byte would dereference `prefix_start - 1`. The explicit-match path's strict bound (`match_pos > prefix_start_index`) covers iterations after a rep promotion, but does NOT cover the very first loop iteration before any promotion. What actually keeps the first iteration safe is the runtime `new_ip > anchor` gate: at block entry `ip0 == block_start == anchor`, so the gate fails and the backward extension is skipped entirely. From the second iteration onward (after the first emitted sequence), anchor has moved past block_start AND/OR the explicit-match promotion bound from (2) gives the strict version of the invariant. Rewrote the inline comment to make this three-piece argument explicit: (1) non-strict bound from save/restore, (2) strict bound from promotion (kicks in after first promote), (3) anchor gate (covers the boundary iteration). Added an explicit note that a future call shape weakening any of the three (e.g. custom anchor initialisation) requires the explicit guard to be re-added. Pure doc change — no behavioural diff. 32/32 kernel tests pass.
CodeRabbit (outside-diff finding): compress_block_fast is a safe pub
entry point but only used debug_assert_eq! to check MLS matches the
hash_table's mls. In release builds debug_assert! is stripped, so a
mismatched call like:
compress_block_fast::<5>(..., &mut FastHashTable::new(_, 4), ...)
routes to the kernel's mls=5 monomorphisation (different hash
formula, different prime constant, different shift count) but reads
entries indexed by mls=4 — garbage match candidates, miscompression
instead of a clean panic. The const-generic MLS prevents compile-
time misuse for nonexistent values but cannot catch this runtime
table/kernel mismatch.
Replaced the debug_assert_eq! with two real assert!s:
1. Range check (4..=8).contains(&MLS) — logically redundant with the
arm in FastHashTable::hash_ptr, but
surfaces the contract at the entry point with a clearer message
than the hash-table-internal one.
2. assert_eq!(MLS, hash_table.mls()) — fails immediately on
table/kernel mismatch, with both values in the panic message for
diagnostics.
All 32 fast_kernel tests pass (they instantiate matching MLS/table
pairs in run_block; no test exercised the mismatch path).
|
@coderabbitai Fixed — replaced |
|
Tip For best results, initiate chat on the files or code changes.
(ノ◕ヮ◕)ノ*:・゚✧ |
Round 11 Copilot findings — both doc-only: 1. kernel.rs:149 (#30): compress_block_fast docs claimed memory-safety for every input and didn't enumerate the entry-time assert!s introduced in round 7 + round 10. Added a # Panics section listing all four assertions (block_start out of range, data.len() > u32::MAX, MLS out of range, MLS != hash_table.mls()), each with a one-line explanation of what goes wrong without the check. Also reworded the preceding 'SAFE function' paragraph to qualify it with 'for every input that doesn't trigger one of the entry-time asserts'. 2. hash_table.rs:73 (#31): FastHashTable::new's # Panics section listed only the parameter-range failures (hash_log out of band, mls out of band). The round 8 alloc guard added checked_shl + checked_mul that can panic on 32-bit targets, and the underlying allocator can fail too — both undocumented. Restructured the section into two groups: parameter-range failures (deterministic, depend only on inputs) and target-size/allocation failures (depend on host architecture and runtime memory state). Three new bullets cover usize overflow on shift, usize overflow on multiply, and global allocator failure. Pure documentation — no behaviour change. 32/32 tests pass. clippy clean.
|
@coderabbitai Re: outside-diff MLS contract finding at kernel.rs:188-220 — this is STALE, already fixed in commit 183c36f (round 10). The kernel now does |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 51202
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 89-105: Update the comment to clarify that the earlier checked
multiplication guard (the `entries * size_of::<u32>()` `checked_mul`) already
rejects `hash_log = 30` on 32-bit, so the documented “borderline” / OOM path for
`hash_log = 30` no longer applies; explicitly mention `checked_mul` prevents
that overflow/OOM and therefore the first two failure modes are deterministic
and caught before the `vec![]` allocation while only the global allocator
failure during `Vec::with_capacity` (or `vec![]`) remains dependent on runtime
memory state.
In `@zstd/src/encoding/simple/fast_kernel/kernel.rs`:
- Around line 157-159: Update the comment in kernel.rs to replace the narrow
"block_start > data.len()" wraparound explanation with a description of the
broader invalid-range failure: when data.len() < block_start + HASH_READ_SIZE
the early-return branch runs and computing tail_literals_len = data.len() -
block_start underflows (causing panic); reference the same symbols used in this
file (block_start, data.len(), HASH_READ_SIZE, tail_literals_len) so readers
understand the real panic cause rather than only the wraparound corner case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 027d11cf-e9ec-4bc0-bde2-f9566378cda3
📒 Files selected for processing (2)
zstd/src/encoding/simple/fast_kernel/hash_table.rszstd/src/encoding/simple/fast_kernel/kernel.rs
…ock_start Round 12 CodeRabbit follow-ups on the round 11 Panics sections — both pure documentation fixes: 1. hash_table.rs #32: round 11 said hash_log=30 was 'borderline' and 'would OOM the allocator' on 32-bit. With the checked_mul guard added in round 8, hash_log=30 is now caught deterministically BEFORE vec! is called (1 << 30 entries * 4 bytes = 4 GiB overflows the usize multiply). Rewrote the bullet to spell out the deterministic-failure framing: checked_shl tripwire is unreachable today (validate_params pins hash_log <= 30, so 1<<30 fits usize), checked_mul is the actual 32-bit hash_log=30 rejection point, and only allocator failure remains runtime-state-dependent. 2. kernel.rs #33: the block_start>data.len() panic rationale only described the main-loop wrap case (block_start + HASH_READ_SIZE wrap → skipped short-input guard → OOB read). It missed the more common path: when data.len() < block_start + HASH_READ_SIZE the short-input branch IS taken and computes tail_literals_len = data.len() - block_start which underflows. Reworded to cover both paths explicitly. Pure documentation — no behaviour change. 32/32 tests pass. clippy clean.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/simple/fast_kernel/hash_table.rs`:
- Around line 107-110: The comment wrongly states all three guards fire before
reaching vec![]; update the wording in the comment near the three-guard
explanation so it clarifies that only the first two checks are deterministic
pre-allocation guards, while the third bullet is the actual vec![] allocation
and therefore the only panic that depends on runtime memory/allocator state
(i.e., allocator failure occurs at the vec![] allocation). Keep mention of the
"three guards" and "vec![]" to make the distinction explicit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8ae0e07f-4d89-4a55-aa51-4f0f73c965f9
📒 Files selected for processing (2)
zstd/src/encoding/simple/fast_kernel/hash_table.rszstd/src/encoding/simple/fast_kernel/kernel.rs
…MLS contract Two new Copilot findings from round 13, both legitimate: 1. count.rs:93 (#34): count_forward's two `offset_from(p_start) as usize` calls return `isize`, which is UB when the byte distance exceeds isize::MAX. On 32-bit hosts isize::MAX = 2 GiB - 1, but the kernel's u32-truncation guard allows data.len() up to u32::MAX (4 GiB). A long match spanning >2 GiB would hit the UB path. Switched both call sites to plain `as usize` pointer arithmetic (`(ip as usize) - (p_start as usize)`) — well-defined integer subtraction regardless of distance, no `isize` round trip. Comment explains the rationale at the in-loop site and refers back from the function-tail site. 2. hash_table.rs:167 (#35): hash_ptr's # Safety section only covered the readable-bytes promise on `ptr` — it didn't mention that the const-generic MLS must equal self.mls() and be in 4..=8. Today only debug_assert_eq! checks the equality, so a release-build mismatched call routes to the wrong hash formula (different multiply prime, different shift) and probes a table indexed by a different formula. Extended the # Safety section with the MLS contract, called out that compress_block_fast enforces both invariants with real assert!s before any hash_ptr call (so kernel-mediated callers are safe by construction), and noted that direct callers (tests, future helpers) are responsible for upholding the contract themselves. All 32 fast_kernel tests pass; clippy clean. count.rs change is behaviour-preserving (subtraction result identical to offset_from when distance ≤ isize::MAX, and now also defined for larger distances). hash_table.rs change is doc-only.
Round 13 CodeRabbit follow-up: my previous wording 'All three guards fire BEFORE control reaches vec![]' was self-contradictory because the third bullet IS the vec![] allocation itself. Reworded to distinguish the two pre-allocation deterministic guards from the in-allocation runtime-state-dependent one.
…#198 phase 1b) Phase 1b of #198 — wires the donor-shape kernel modules from phase 1a (#215, merged in fc63464) into the production hot path. The legacy SuffixStore-based MatchGenerator in simple/mod.rs is fully removed; MatcherStorage::Simple now holds a FastKernelMatcher that drives compress_block_fast<MLS> once per block. Selected for every Fast-strategy level — CompressionLevel::Uncompressed, CompressionLevel::Fastest, CompressionLevel::Level(1), and the negative CompressionLevel::Level(-7..=-1) variants. All Fast levels currently resolve to the same matcher with donor level-1 hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found) land in phase 3. ## What changes - simple/mod.rs collapses from 496 → 14 lines (only module declarations + docstring remain). SuffixStore, WindowEntry, MatchGenerator, repcode_candidate, add_data, next_sequence, add_suffixes_till, insert_suffix_if_absent, add_suffixes_interleaved_fast, offset_match_len, reserve — all gone with the wiring. - simple/fast_matcher.rs is the new active matcher: full inherent surface (accept_data, start_matching, skip_matching_with_hint, trim_to_window, last_committed_space, reset, prime_offset_history, take_recycled_space). - MatchGeneratorDriver Simple-arm wiring: commit_space → m.accept_data(space) with eager pre-commit eviction; start_matching::<Fast> → m.start_matching(handler); skip_matching_with_hint → m.skip_matching_with_hint(hint); reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window(); prime_with_dictionary → m.prime_offset_history(offset_hist). - Per-block input Vec recycled via take_recycled_space() → vec_pool (zero zero-fill cost — buffer pushed with len=0, get_next_space resizes on pop). ## Invariants - prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The first byte of history is a reserved dummy (sentinel-0 guard); real input data starts at history[1]. Donor C zstd achieves the same effect via a virtual base pointer; the flat Vec<u8> model here pays one byte of memory overhead for the same correctness property (no missed matches at segment boundaries). - history.len() bounded by 2 × max_window_size post-append, even for oversize committed blocks (retain_real = cap.saturating_sub( space.len()).min(max_window_size)). - Eviction preserves the dummy AND rebases prefix_start_index back to RESERVED on every drain — cumulative growth would push the filter past every valid history index and reject all match candidates wholesale. - Hash table rehashed after drain so retained tail bytes stay matchable. Amortised O(1) per byte of input. - rep[0..2] ↔ offset_hist[0..2] in lockstep on the common (lit_len > 0) path. Known divergence on back-to-back repcode matches (lit_len == 0 emits): kernel's rep unchanged, wire encoder per RFC 8878 §3.1.2.5 remaps codes and rotates offset_hist — marginal compression hit, output still correct. Phase 3 collapses these at the kernel level. - prime_offset_history seeds BOTH rep[0..2] and offset_hist atomically from a dictionary load. ## Defensive validation - MatchGeneratorDriver::new asserts slice_size > 0, max_slices_in_window > 0, checked_mul for the product, and checked_next_power_of_two for window_log_init derivation — catches all four overflow / degenerate paths with a clear domain-specific panic instead of a deep matcher-internal failure. - FastHashTable construction-time mls / hash_log validation unchanged from phase 1a. ## Tests 573/573 pass on the full workspace nextest suite: - 21 unit tests on FastKernelMatcher (lifecycle, accept + start, skip flavors, dict-prime hash population, eviction, boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync, prefix-eviction during dict-priming, drain prefix_start_index runaway, trim_to_window/last_block_start drift, oversize-block eviction bound). - 32 unit tests on the underlying kernel (donor-formula parity, prefix-filter, repcode backward extension three-piece proof, short-input early-return uniformity). - All frame_compressor integration tests (raw-block detection, hinted source-size matrix, level-1 round-trips through both the in-tree decoder and FFI decode). - All cross_validation Rust-encoded → FFI-decoded round-trips (every level 1..=22, dict + no-dict, encoded by Rust then read by the C reference decoder verbatim). 18 legacy tests in match_generator.rs that exercised SuffixStore-specific behavior or required block.len() < HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory comments — their substance either has equivalent coverage in the new tests or relied on algorithm-specific quirks the donor-shape kernel doesn't reproduce by design. ## Benchmark (i9-9900K) cargo bench deltas vs main (fc63464) on compress/level_1_fast/*/matrix/pure_rust: | Scenario | Δ time | Throughput | Note | |----------|-------:|-----------:|------| | low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster | | decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster | | high-entropy-1m | -67.3% | 633 MiB/s | 3× faster | | small-10k-random | -36.0% | 768 MiB/s | 1.6× faster | | small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster | | small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead | | large-log-stream | +122.3% | 235 MiB/s | regression — see below | Large-log-stream regression is expected at this phase: legacy SuffixStore used effectively window_log-sized hash slots (512K for level-1 window_log=19), while phase 1b uses donor-parity hash_log=14 (16K slots). 25 MiB dense-compressible log content hits ~1500 collisions/slot. Donor C zstd shows the same trade-off (~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining + cmov) closes the gap per the documented roadmap (#198 items 2/3/5). ## What's NOT in this PR - Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead, cmov match-found variant, per-level mls dispatch, kSearchStrength acceleration gradient for negative Fast levels. - LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes donor level-1 defaults (14/7) today. - hash_fill_step stride for dict-priming — still hard-coded to 1 (LevelParams field is wired in but the Fast matcher always strides at 1). Closes #198 phase 1b. Related: #178 (umbrella regression issue), #215 (phase 1a — kernel modules, merged in fc63464).
…#198 phase 1b) (#217) * perf(encoder): wire donor-shape Fast kernel into MatchGeneratorDriver (#198 phase 1b) Phase 1b of #198 — wires the donor-shape kernel modules from phase 1a (#215, merged in fc63464) into the production hot path. The legacy SuffixStore-based MatchGenerator in simple/mod.rs is fully removed; MatcherStorage::Simple now holds a FastKernelMatcher that drives compress_block_fast<MLS> once per block. Selected for every Fast-strategy level — CompressionLevel::Uncompressed, CompressionLevel::Fastest, CompressionLevel::Level(1), and the negative CompressionLevel::Level(-7..=-1) variants. All Fast levels currently resolve to the same matcher with donor level-1 hash_log=14, mls=7; per-level acceleration knobs (kSearchStrength dispatch, 4-cursor ip0/ip1/ip2/ip3 pipelining, cmov match-found) land in phase 3. ## What changes - simple/mod.rs collapses from 496 → 14 lines (only module declarations + docstring remain). SuffixStore, WindowEntry, MatchGenerator, repcode_candidate, add_data, next_sequence, add_suffixes_till, insert_suffix_if_absent, add_suffixes_interleaved_fast, offset_match_len, reserve — all gone with the wiring. - simple/fast_matcher.rs is the new active matcher: full inherent surface (accept_data, start_matching, skip_matching_with_hint, trim_to_window, last_committed_space, reset, prime_offset_history, take_recycled_space). - MatchGeneratorDriver Simple-arm wiring: commit_space → m.accept_data(space) with eager pre-commit eviction; start_matching::<Fast> → m.start_matching(handler); skip_matching_with_hint → m.skip_matching_with_hint(hint); reset → m.reset(window_log, FAST_LEVEL_1_HASH_LOG, FAST_LEVEL_1_MLS); trim_after_budget_retire → m.trim_to_window(); prime_with_dictionary → m.prime_offset_history(offset_hist). - Per-block input Vec recycled via take_recycled_space() → vec_pool (zero zero-fill cost — buffer pushed with len=0, get_next_space resizes on pop). ## Invariants - prefix_start_index = RESERVED_PREFIX_BYTES (= 1) baseline. The first byte of history is a reserved dummy (sentinel-0 guard); real input data starts at history[1]. Donor C zstd achieves the same effect via a virtual base pointer; the flat Vec<u8> model here pays one byte of memory overhead for the same correctness property (no missed matches at segment boundaries). - history.len() bounded by 2 × max_window_size post-append, even for oversize committed blocks (retain_real = cap.saturating_sub( space.len()).min(max_window_size)). - Eviction preserves the dummy AND rebases prefix_start_index back to RESERVED on every drain — cumulative growth would push the filter past every valid history index and reject all match candidates wholesale. - Hash table rehashed after drain so retained tail bytes stay matchable. Amortised O(1) per byte of input. - rep[0..2] ↔ offset_hist[0..2] in lockstep on the common (lit_len > 0) path. Known divergence on back-to-back repcode matches (lit_len == 0 emits): kernel's rep unchanged, wire encoder per RFC 8878 §3.1.2.5 remaps codes and rotates offset_hist — marginal compression hit, output still correct. Phase 3 collapses these at the kernel level. - prime_offset_history seeds BOTH rep[0..2] and offset_hist atomically from a dictionary load. ## Defensive validation - MatchGeneratorDriver::new asserts slice_size > 0, max_slices_in_window > 0, checked_mul for the product, and checked_next_power_of_two for window_log_init derivation — catches all four overflow / degenerate paths with a clear domain-specific panic instead of a deep matcher-internal failure. - FastHashTable construction-time mls / hash_log validation unchanged from phase 1a. ## Tests 573/573 pass on the full workspace nextest suite: - 21 unit tests on FastKernelMatcher (lifecycle, accept + start, skip flavors, dict-prime hash population, eviction, boundary cases at HASH_READ_SIZE = 8, rep ↔ offset_hist sync, prefix-eviction during dict-priming, drain prefix_start_index runaway, trim_to_window/last_block_start drift, oversize-block eviction bound). - 32 unit tests on the underlying kernel (donor-formula parity, prefix-filter, repcode backward extension three-piece proof, short-input early-return uniformity). - All frame_compressor integration tests (raw-block detection, hinted source-size matrix, level-1 round-trips through both the in-tree decoder and FFI decode). - All cross_validation Rust-encoded → FFI-decoded round-trips (every level 1..=22, dict + no-dict, encoded by Rust then read by the C reference decoder verbatim). 18 legacy tests in match_generator.rs that exercised SuffixStore-specific behavior or required block.len() < HASH_READ_SIZE matching are #[cfg(any())]-gated with explanatory comments — their substance either has equivalent coverage in the new tests or relied on algorithm-specific quirks the donor-shape kernel doesn't reproduce by design. ## Benchmark (i9-9900K) cargo bench deltas vs main (fc63464) on compress/level_1_fast/*/matrix/pure_rust: | Scenario | Δ time | Throughput | Note | |----------|-------:|-----------:|------| | low-entropy-1m | -90.7% | 3.3 GiB/s | 10× faster | | decodecorpus-z000033 | -83.4% | 156 MiB/s | 6× faster | | high-entropy-1m | -67.3% | 633 MiB/s | 3× faster | | small-10k-random | -36.0% | 768 MiB/s | 1.6× faster | | small-4k-log-lines | -15.4% | 154 MiB/s | 1.2× faster | | small-1k-random | +42.2% | 130 MiB/s | tiny-block overhead | | large-log-stream | +122.3% | 235 MiB/s | regression — see below | Large-log-stream regression is expected at this phase: legacy SuffixStore used effectively window_log-sized hash slots (512K for level-1 window_log=19), while phase 1b uses donor-parity hash_log=14 (16K slots). 25 MiB dense-compressible log content hits ~1500 collisions/slot. Donor C zstd shows the same trade-off (~120 MB/s on similar workloads). Phase 3 (4-cursor pipelining + cmov) closes the gap per the documented roadmap (#198 items 2/3/5). ## What's NOT in this PR - Phase 3 (#198 items 2/3/5): 4-cursor ip0/ip1/ip2/ip3 lookahead, cmov match-found variant, per-level mls dispatch, kSearchStrength acceleration gradient for negative Fast levels. - LevelParams.hash_log / LevelParams.mls fields — Fast hard-codes donor level-1 defaults (14/7) today. - hash_fill_step stride for dict-priming — still hard-coded to 1 (LevelParams field is wired in but the Fast matcher always strides at 1). Closes #198 phase 1b. Related: #178 (umbrella regression issue), #215 (phase 1a — kernel modules, merged in fc63464). * fix(match_generator,fast_matcher): correctness + doc accuracy fixes (#217 review round 1) Four findings from CodeRabbit + Copilot's first review pass on the squashed PR #217. All four touch comment / metadata accuracy or a single-line correctness issue; no behavioural change beyond the window_size sync. **CR outside-diff (match_generator.rs:598-613)** — correctness: `reported_window_size` was using the unrounded `max_window_size` while the matcher itself was constructed from `next_pow2` (rounded up). For non-power-of-two constructor inputs (e.g. `slice_size * max_slices_in_window = 100_000`), `window_size()` would report 65_536 (un-rounded floor) while the active backend actually carried 131_072 (rounded-up next_pow2). The drift held until the first `reset()` overwrote both sides from LevelParams. Fix: report `next_pow2` so the two stay in lockstep at construction time. **CR #1 (match_generator.rs:944)** — declined, deferred to phase 3: Fast levels (Uncompressed, Fastest, Level(-7..=1)) all hard-code donor level-1 cParams. The acceleration gradient between negative- level fast modes and Level(1) lands when phase 3 ports donor's 4-cursor lookahead + cmov match-found + per-level kSearchStrength dispatch (issue #198 items 2/3/5). Updated the inline code comment to scrub the closed-PR reference and frame the deferral against phase 3 directly. **Copilot #2 (match_generator.rs:572)** — doc: the validation-guard comment described `next_power_of_two` returning 0 on overflow, which was old-Rust behaviour. Modern Rust panics; we now use `checked_next_power_of_two` (commit landed in the squash). Rewrote the comment to enumerate the three actual failure modes (zero args, mul overflow, next-pow2 overflow) and the three guards that catch them. **Copilot #3 (fast_matcher.rs:303)** — doc: `last_committed_space`'s pre-`accept_data` state description claimed `last_block_start = 0 / history.len() = 0`, but post-RESERVED_PREFIX_BYTES-seed construction (#216 / phase 1b) leaves both at `RESERVED_PREFIX_BYTES`. The returned slice is still empty (the `history[last_block_start..]` range is empty), just for a different reason. Updated the doc to reflect the seeded-dummy invariant. 573/573 tests pass; clippy clean. * docs(fast_matcher): compact last_committed_space docstring (apply new compactness rule) Doc-only change: collapse the verbose multi-paragraph last_committed_space docstring (added during PR #216 review rounds) into a 6-line three-bullet form. Same semantic content, zero narrative. Apply the new docstring compactness rule (one-two phrases default, multi-paragraph only for non-obvious invariants). 573/573 tests pass; clippy clean. * docs(fast_matcher): align dict-prime boundary test comments with RESERVED_PREFIX_BYTES seed (#217 Copilot #4, #5) Doc-only update aligning the dict-prime boundary test comment + inline note with the RESERVED_PREFIX_BYTES seed (post-phase-1b range is [RESERVED..=RESERVED], not [0..=0]). Compactness rule applied — concise two-block form. 573/573 tests pass; clippy clean. * docs(fast_matcher): correct prefix_start_index wording — rebases, not bumps (#217 Copilot #6, #7) Two doc comments described prefix_start_index as 'bumped forward' or 'advances' as history is evicted, implying a monotonic absolute index. Actual code (drain_real_prefix) rebases it back to RESERVED_PREFIX_BYTES on every drain — the retained tail is re-indexed in the new coordinate space. Updated both sites (struct field doc + trim_to_window header) to match. Compactness rule applied — trim_to_window header collapsed from 10 lines to 5. 573/573 tests pass; clippy clean. * docs(fast_matcher): drop stale hash_log scaling claim + add lit_len=0 TODO marker (#217 Copilot #8, #9) #8 — header for FAST_LEVEL_1_HASH_LOG said 'reset path rebinds hash_log proportionally on source-size hint'. Untrue today: driver passes only window_log per-level, hash_log + mls hard-coded. Pin per-level scaling to phase 3. #9 — same lit_len=0 / back-to-back-rep1 concern as the prior PR's #21. Inline + module docs already explain, but verbose prose isn't anchored. Collapsed inline comment to a single '// TODO(#198 phase 3):' line so the deferral marker is unambiguous. 573/573 tests pass; clippy clean. * fix(fast_matcher): assert hard preconditions + skip lit_len=0 rotation + hoist mls dispatch (#217 round 5) Four findings from the latest CR + Copilot review pass. **CR outside-diff (327-331) — assert!, not debug_assert!:** the duplicate-pending guard in accept_data was debug_assert!, so in release builds a double-commit would silently overwrite pending and drop a committed block. Hard-fail instead. **CR outside-diff (358-363) — block_size <= 2 × max_window_size:** the eviction math computes retain_real via saturating_sub, which silently collapses to 0 if space.len() > cap. The full block is still appended afterwards, violating the documented 'history bounded by 2 × max_window_size' invariant. Added an assert! precondition so callers see a clear panic at the boundary instead of an invisible invariant break. **CR #10 — lit_len == 0 offset_hist rotation (third raise):** the prior round added a TODO marker explaining the divergence, but CR correctly pushed back that documentation isn't a fix. Skip encode_offset_with_history when literals.is_empty() so offset_hist stays in lockstep with the kernel's unchanged rep — no divergence on the back-to-back rep1 path. Wire encoder downstream still sees the Triple with raw offset; its own encoding stays correct (lit_len-0 absolute encoding per RFC 8878 §3.1.2.5). Module docstring 'Known divergence' section collapsed accordingly. **Copilot #11 — hoist mls dispatch outside prime_hash loop:** moved the per-MLS match arm OUTSIDE the per-position loop. New prime_hash_table_impl<const MLS: u32> is monomorphised per matcher instance; the hot path is branch-free on mls. 573/573 tests pass; clippy clean. * fix(fast_matcher): call encode_offset_with_history unconditionally (#217 Copilot #12, revert round 5 #10) Round 5 skipped encode_offset_with_history when literals.is_empty() to keep matcher.offset_hist 'in lockstep' with kernel rep. Copilot correctly pushed back: Dfast / Row / HashChain matchers all call it unconditionally (passing lit_len = 0 when applicable), and the 'lockstep' framing was wrong — matcher.offset_hist tracks the WIRE ENCODER's history while matcher.rep tracks the KERNEL's state. They're not supposed to be the same. Revert the skip. Module docstring rewritten to call out that the two fields reflect DIFFERENT state and may diverge on lit_len = 0 emits per RFC 8878 §3.1.2.5 — both halves stay self-consistent within their own domain. Round 5 #10 reply to CR was based on the wrong mental model; that's reflected in this fix by reverting to the donor / other-matchers behaviour. 573/573 tests pass; clippy clean.
Summary
Phase 1a of the Fast strategy donor port (issue #198 — 22× regression
vs C zstd on Fast levels). This PR lands the donor-shape kernel
modules with full unit-test coverage; the wiring into
MatchGeneratorDriver/MatcherStorageand the i9-9900K bench landin the follow-up PR on the next branch.
What's in this PR
Three new modules under
zstd/src/encoding/simple/fast_kernel/:hash_table.rs—FastHashTableFlat
Vec<u32>indexed by donor-parityZSTD_hash{4,5,6,7,8}Ptr(multiply-shift on the first
mlsbytes of the suffix atptr,output reduced to
hash_logbits). Constants bit-identical tolib/compress/zstd_compress_internal.h. Monomorphised overMLSvia const-generic so each
mlsinstantiation compiles to adedicated body with the prime / shift baked in.
count.rs—count_forwardPort of
ZSTD_countfromlib/compress/zstd_compress_internal.h.8-byte chunked XOR loop with
trailing_zeros() / 8to locate thediverging byte, then
u32/u16/u8tail fall-throughsmatching donor's exact progression. Replaces the byte-by-byte
common_prefix_lenused by the current SuffixStore-based Fastmatcher.
kernel.rs—compress_block_fast<MLS>Port of
ZSTD_compressBlock_fast_noDict_genericfromlib/compress/zstd_fast.c. Phase 1 keeps the single-ip0cursorwith raw 4-byte match probe (
MEM_read32(ip) == MEM_read32(base + matchIdx)), step-based skip acceleration tied tokSearchStrength = 6, repcode-at-ip check, backward extensionbounded by
anchor/prefix_start, andcount_forwardforforward extension. The 4-cursor pipelining (
ip0 / ip1 / ip2 / ip3) andcmovmatch-found variant from donor's full genericbody are deferred to phase 3 per the issue's staged plan — phase 1
exists to validate the data structures and close the bulk of the
22× gap before adding pipelining complexity.
The kernel's
compress_block_fastis monomorphised overconst MLS: u32(4..=8), invokeshandle_sequenceper emitted sequence(replacing donor's
ZSTD_storeSeq), and returnsFastBlockResult { rep, tail_literals_len }so the caller canthread repcode history across blocks and flush trailing literals.
What's NOT in this PR (lands in follow-up)
MatchGeneratorDriver::compress_block::<Fast>— phase1b. Requires:
FastHashTablefield toMatchGenerator(alongside theexisting
SuffixStoreduring transition).Vec<WindowEntry>into a single
Vec<u8>perstart_matchingcall, or replacethe multi-window layout with a single ring buffer).
resetkeyed off the resolvedLevelParams(hash_log, mls)— donor useshash_log = 14,mls = 7for default level 1.cargo nextest run --workspacecovers existing SuffixStore path; the kernel hasn'tbeen wired yet so it can't break the e2e suite from this PR).
compare_ffi level_1_fast— measures the actualperf delta vs main and against C donor. Phase 1b commit will land
the numbers in its PR body.
Phase 3 (separate later branch per the issue):
ip0/ip1/ip2/ip3pipelining with software-prefetch hashreads — gives the bulk of donor's noDict_generic throughput on
modern superscalar cores.
cmovmatch-found variant gated onwindowLog < 19.single-block; cross-block matching at the window boundary is part
of phase 1b's flat-buffer work).
Tests
16 new unit tests covering:
hash_table::tests: donor formula parity on known inputs for bothmls=4andmls=5, get/put round-trip,clear()resets everyentry to sentinel.
count::tests: empty / full / chunk-boundary / tail-fall-throughdivergence at every transition point (u32 → u16 → u8), thousand-
byte long match.
kernel::tests: short-input → literals-only callback path, longrepeat pattern → at least one
Tripleemitted withmatch_len ≥ MIN_MATCH (4), plus an accounting invariant(
sum(literals + matches) + tail_literals_len == input.len())that catches off-by-ones in any future refactor.
All 16 pass; full workspace nextest run remains green at 555/555.
Module gate
fast_kernel/mod.rsopens with a narrowly-scoped# for this commitonly — every item is consumed by the phase 1b wiring commit and
the allow is removed there. The
unused_importslint stays ACTIVEacross the module so any accidentally-unused
usein sibling codesurfaces immediately. Re-exports of the kernel's public surface
(
FastHashTable,FastBlockResult,compress_block_fast) aredeferred to the wiring commit since adding them now would force
the wider lint relaxation. Without the
dead_codeallow, baselinecargo clippy -D warningswould flag the unused kernel-internalitems; splitting the kernel and the wiring into separate PRs is
the right tradeoff for review clarity, but the cost is the
temporary
dead_codemute on this module.Related: #198 (Fast strategy 22× regression — multi-phase port).
Summary by CodeRabbit
New Features
Tests