perf(encoding): #18 Phase 5 — LDM producer (gear hash + bucket table + search/emit)#139
Conversation
Lands the foundational primitives of the donor LDM producer (`lib/compress/zstd_ldm.c` v1.5.7) under a new `encoding/ldm/` module. The verify + extend + emit half of the pipeline plus the `window_log >= 27` activation gate follow in a second Phase 5 commit; by default the producer stays `None` on every BtMatcher, preserving exact pre-Phase-5 behaviour (ratio gate `level22_sequences_match_donor_on_corpus_proxy` PASS). What ships: * `ldm/gear_hash.rs` — 64-bit gear rolling hash with the donor's 256-entry permutation table (`zstd_ldm_geartab.h`) reproduced verbatim; full unit cross-check confirms zero mismatches against the parsed donor header. `init` / `reset` / `feed` match the donor recurrence `hash = (hash << 1) + GEAR_TAB[byte]` with the same 4-wide unroll and `LDM_BATCH_SIZE = 64` cap. * `ldm/params.rs` — `LdmParams` + `adjust_for(window_log, strategy)` is a direct port of `ZSTD_ldm_adjustParameters` (`zstd_ldm.c:135`), including the `7 - strategy/3` hash-rate-log mapping, the `BOUNDED(6, window_log - hash_rate_log, 30)` hash-log clamp, the `minMatchLength /= 2` halving at strategies >= btultra (8), and the `BOUNDED(LDM_BUCKET_SIZE_LOG, strategy, 8)` bucket clamp. * `ldm/table.rs` — `LdmHashTable` with `entries: Vec<LdmEntry>` of length `1 << hash_log` and `bucket_offsets: Vec<u8>` for the per-bucket round-robin write cursor (donor `ZSTD_ldm_insertEntry`, `zstd_ldm.c:198-207`); silent `MIN(bucket_size_log, hash_log)` clamp matches `zstd_ldm.c:176`. * `ldm/mod.rs` — `LdmProducer` aggregator owns the hash state, the bucket table, and a `LDM_BATCH_SIZE` splits scratch buffer. The fill path (`generate_into`) walks the input through the gear hash, XXH64s each `min_match_length`-byte window (donor `zstd_ldm.c:315`, seed 0), masks the low bits for the bucket id (`hash_log - bucket_size_log` bits) and stores the high 32 bits as the entry checksum. * `bt/mod.rs` carries a new `ldm_producer: Option<LdmProducer>` field initialised to `None`; `BtMatcher::reset` propagates `clear` into the producer when present. `prepare_ldm_candidates` now takes the per-frame `history` slice and delegates to the producer when active, otherwise preserves the original defensive `ldm_sequences.clear()` no-op. * `match_generator.rs` callsite threads `self.table.history` into `prepare_ldm_candidates` (disjoint-field split borrow keeps the borrow checker happy). Tests: * 24 new unit tests (9 gear_hash, 5 params, 7 table, 3 producer) with donor `file:line` citations on every primitive. * Anchor entries `GEAR_TAB[0,42,80,255]` cross-checked against donor `zstd_ldm_geartab.h`. * Recurrence regression test compares `reset` vs `feed(mask= u64::MAX)` to guard the 4-wide unroll. * Round-robin insertion + adjacent-bucket isolation + post-clear re-init covered for `LdmHashTable`. * `LdmProducer` constructs with donor btultra2/window=27 defaults (min_match=32, hash_rate_log=4, bucket_size_log=8); `clear` rewinds the rolling hash to `GEAR_HASH_INIT`. Build + suite: * 451 / 451 lib tests pass (full suite, no skips). * `cargo clippy --lib --tests` clean. * Ratio gate `level22_sequences_match_donor_on_corpus_proxy` PASS. * No new warnings; `ldm/mod.rs` carries a Phase-1-style `#![allow(dead_code)]` marker (same precedent as `bt/mod.rs`) for the bucket-lookup helpers consumed in the follow-up emit commit.
Completes the donor LDM producer pipeline by adding the bucket lookup + forward/backward extension half. With this commit `LdmProducer::generate_into` is functionally equivalent to donor `ZSTD_ldm_generateSequences_internal` on the prefix-only path: when a `BtMatcher` carries a `Some(LdmProducer)` it emits real [`HcRawSeq`] entries that the existing optimal-parser consumer (plumbed in #119) walks at block compress time. What ships: * `ldm/search.rs` — `count_backwards_match` mirrors donor `ZSTD_ldm_countBackwardsMatch` (`zstd_ldm.c:214`); honours both the `anchor` and `match_base` lower bounds, capping the backward walk at the tighter of `min(p_in - anchor, p_match - match_base)`. `find_best_match` walks every slot of the bucket associated with `hash_id`, applies the donor staleness + checksum filter (`zstd_ldm.c:431`), runs the forward `common_prefix_len` and backward count, and returns the candidate with the largest combined `forward + backward`. Inputs are bundled into `FindBestMatchInputs` to keep `clippy::too_many_arguments` happy without flattening the donor citations. * `ldm/mod.rs` — `generate_into` re-implemented end-to-end: re-seeds the rolling hash and primes it with the first `min_match_length` bytes of the block (donor `zstd_ldm.c:381-383`); maintains absolute `anchor` and `ip` cursors; per-split branches on `split < anchor` (insert-only), no-match (insert-only), or match-found (emit `HcRawSeq` → insert new entry AFTER lookup to avoid clobbering bestEntry, donor `zstd_ldm.c:490-492` → advance anchor → re-reset hash when the emitted match overruns the hashed window, donor `zstd_ldm.c:497-508`). * Module-level docs updated to spell out the activation policy explicitly: `LdmProducer` is **never activated automatically by any `CompressionLevel` preset** — distro-grade parity with upstream `libzstd.so.1`, where `ZSTD_compress(..., level)` keeps LDM off at every standard level (1..22). The opt-in surface lands separately: `#27` plugs the Rust parameter API into `BtMatcher::ldm_producer`, `#126`/`#127` wire `ZSTD_c_enableLongDistanceMatching` through the C ABI, and `#128` exposes the `zstd --long[=N]` CLI flag. Tests: * 9 new search tests (4 `count_backwards_match` invariants, 5 `find_best_match` paths: stale-rejection, checksum filtering, longest-combined selection, backward extension, short-forward rejection). Every donor-staleness fixture uses `offset > 0` to satisfy the reserved "empty-slot" sentinel. * End-to-end producer test (`generate_into_emits_long_range_match_on_repeated_payload`): 4 KiB deterministic LCG payload + 64 KiB unique-byte gap + 4 KiB repeat of payload; asserts the producer emits at least one `HcRawSeq` whose offset >= the gap distance and whose match length meets the `min_match_length` floor. Build + suite: * 460 / 460 lib tests pass (+9 vs the foundations commit: 8 search + 1 e2e). Phase-5 LDM tests now total 33. * `cargo clippy --lib --tests` clean. * Ratio gate `level22_sequences_match_donor_on_corpus_proxy` PASS — `ldm_producer = None` on every `CompressionLevel` preset preserves byte-parity with upstream.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a complete Long Distance Matching (LDM) subsystem (params, gear hash, table, search, producer) and wires an optional LdmProducer into the BT optimal matcher, updating prepare_ldm_candidates to accept live-history plus absolute coordinates. ChangesLong Distance Matching (LDM) Producer Implementation
Sequence Diagram(s)sequenceDiagram
participant MatchGen as HcMatchGenerator
participant Bt as BtMatcher
participant Prod as LdmProducer
participant Gear as GearHashState
participant Table as LdmHashTable
participant Search as find_best_match
MatchGen->>Bt: prepare_ldm_candidates(live_history, history_abs_start, current_abs_start, len)
Bt->>Prod: generate_into(live_history, history_abs_start, block_start_abs, block_end_abs, out)
Prod->>Gear: reset(prefix_bytes)
loop feed batches
Prod->>Gear: feed(data_chunk, &mut splits)
Gear-->>Prod: split offsets
loop per split
Prod->>Search: find_best_match(table, hash_id, checksum, inputs)
Search-->>Prod: LdmMatch or None
alt match found
Prod-->>Bt: append HcRawSeq (litlen, offset, matchlen)
end
Prod->>Table: insert_absolute(hash_id, abs_pos, checksum)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Adds a new long-distance matching (LDM) producer implementation (gear rolling hash + bucketed hash table + verify/extend search + sequence emission) under encoding/ldm/, and plumbs it into the BT optimal path behind an opt-in Option<LdmProducer> so default compression presets remain LDM-off (upstream parity).
Changes:
- Introduce
encoding/ldm/module implementing the donor-style LDM producer pipeline (gear hash, params, table, search, and aggregator/driver). - Extend
BtMatcherwith an optionalldm_producerand delegateprepare_ldm_candidates()to it when present. - Wire the optimal matcher driver to call
prepare_ldm_candidates()and register the newldmmodule inencoding/mod.rs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| zstd/src/encoding/mod.rs | Exposes the new encoding::ldm module. |
| zstd/src/encoding/match_generator.rs | Calls into BT to prepare LDM candidates before optimal parsing. |
| zstd/src/encoding/bt/mod.rs | Adds ldm_producer: Option<LdmProducer> and delegates candidate generation when enabled. |
| zstd/src/encoding/ldm/mod.rs | Aggregates LDM producer state and implements the end-to-end generate_into pipeline. |
| zstd/src/encoding/ldm/gear_hash.rs | Implements donor-parity gear rolling hash + split-point detection. |
| zstd/src/encoding/ldm/params.rs | Implements donor-parity LDM parameter derivation and bounds. |
| zstd/src/encoding/ldm/table.rs | Implements the bucketed LDM hash table and insertion policy. |
| zstd/src/encoding/ldm/search.rs | Implements bucket lookup, checksum filtering, and forward/backward match extension. |
…mask 1. `prepare_ldm_candidates` translates absolute stream positions to indices into the supplied `history` slice. Before, the absolute `current_abs_start` (= `history_abs_start + window_size − current_len`) was handed to `LdmProducer::generate_into` as if it were a slice index, so as soon as `history_abs_start != 0` after window eviction the producer would index out of bounds in `history[block_start..block_end]`. The method now takes an explicit `history_abs_start` parameter; the call site in `start_matching_optimal` passes `self.table.history_abs_start` alongside the `history` slice. Matches the coordinate convention every other matcher backend uses (`idx = abs_pos − history_abs_start`, see `match_table/storage.rs:317-318`). 2. `LdmHashTable::new` asserts `effective_bucket_log <= 8`. The bucket round-robin cursor is a `u8` (mirrors donor `BYTE bucketOffsets[]`, `zstd_ldm.c:202`); above 256 slots the cursor would silently truncate, dropping new inserts on the floor. `LdmParams::adjust_for` already clamps to `LDM_BUCKETSIZELOG_MAX = 8`, so the producer's own path is unaffected — the assertion guards callers that bypass the params helper. 3. `LdmProducer::generate_into` derives `hash_id_mask` from `self.hash_table.bucket_mask()` instead of recomputing from `params.hash_log / params.bucket_size_log`. The table applies a `min(bucket_size_log, hash_log)` clamp (`zstd_ldm.c:176`) which the producer-side recomputation did not; using the table's authoritative mask eliminates the drift class. Tests: new `prepare_ldm_candidates_translates_absolute_positions _to_slice_indices` regression in `bt/mod.rs` (panics pre-fix on `history[1024..1280]`); new `new_rejects_bucket_size_log_above _donor_cap` in `table.rs` (panics with the `ZSTD_LDM_BUCKETSIZELOG_MAX` assertion on `LdmHashTable::new(12, 9)`). 462 / 462 lib tests pass. `cargo clippy --lib --tests` clean. `level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/bt/mod.rs`:
- Around line 353-362: The producer's persisted positions must be kept in
absolute coordinates: avoid rebasing to slice-relative indexes before passing
positions into ldm_producer; instead thread history_abs_start into the
producer/search APIs so the producer and its hash table continue to store
absolute positions, and only subtract history_abs_start when performing the
final history[..] slice indexing. Update the call site around
ldm_producer.generate_into (and the ldm_sequences usage) to pass absolute
start/end (e.g., use current_abs_start and current_abs_start + current_len or
analogous absolute bounds) or add a parameter history_abs_start to the
producer/search path, and change the producer/search implementations
(generate_into or its new/overloaded variant) to convert to slice indices only
when accessing history. Ensure any internal hash-table insertions/searches use
absolute positions so entries remain valid across evictions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e81a8163-2625-47d7-a3ee-d3b58e7ad32e
📒 Files selected for processing (4)
zstd/src/encoding/bt/mod.rszstd/src/encoding/ldm/mod.rszstd/src/encoding/ldm/table.rszstd/src/encoding/match_generator.rs
The producer's bucket table survives across blocks within a frame; storing slice-relative offsets in `LdmEntry.offset` broke correctness as soon as `history_abs_start` advanced past zero under a window slide — old entries pointed at the wrong bytes, and `offset = split_abs − stale_match_pos` could underflow `u32` and emit invalid back-references. This commit moves every cross-block invariant onto absolute stream coordinates: * `LdmProducer::generate_into` now takes `(live_history, history_abs_start, block_start_abs, block_end_abs, out)`. `live_history[0]` corresponds to absolute `history_abs_start` (the byte that other matcher backends call `base + dictLimit`); the abs→slice translation happens only at the moment of `live_history[..]` indexing. * `LdmEntry.offset` stores absolute stream positions. Entries that fall out of the current window after a slide are filtered by `find_best_match`'s `entry.offset <= lowest_index_abs` staleness check, where the producer now passes the current `history_abs_start` as `lowest_index_abs`. * `FindBestMatchInputs` gains a `history_abs_start` field; search-side `live_history`, `split_abs`, `anchor_abs`, `lowest_index_abs` are all interpreted as absolute coordinates. Internal abs→slice translation lives in a single place. * `BtMatcher::prepare_ldm_candidates` forwards the four absolute coordinates straight to the producer (no translation in-between). * `MatchGeneratorDriver::start_matching_optimal` passes `self.table.live_history()` (the post-`history_start` slice) instead of the full `history` Vec; `history_abs_start` corresponds to `live_history()[0]`, not `history[0]`, so the earlier code would have indexed into the dead prefix once eviction began. * `gear_hash.rs` anchor-entry test comment fix: GEAR_TAB[80] is donor row 26 col 2, not row 21 col 2 (`80 / 3 = 26 r 2`). Tests: * New `generate_into_preserves_bucket_entries_across_history_ slide` in `ldm/mod.rs`: builds a 2×4 KiB payload + 2×32 KiB gap frame; invokes `generate_into` first with `history_abs_start = 0` (populates the table with entries inside payload #1), then again with `history_abs_start = 2048` (simulates window eviction that drops the leading half of payload #1) targeting payload #2. Asserts (a) the producer doesn't panic, (b) every emitted offset stays inside the post-slide live history, (c) at least one emitted offset >= GAP_A — the long-range win bucket entries from call 1 are supposed to produce. Pre-fix slice-relative offsets would either miss the match entirely or emit a corrupt larger-than-live-history offset. * `prepare_ldm_candidates_translates_absolute_positions_to_ slice_indices` regression from round 1 still passes — the abs-coords producer makes the original OOB scenario impossible by construction. * All search.rs / generate_into unit tests migrated to the new signature (added `history_abs_start = 0` argument). Build + suite: * 463 / 463 lib tests pass (+1 cross-slide regression). * `cargo clippy --lib --tests` clean. * `level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS — `ldm_producer = None` default still preserves byte-parity with upstream.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/ldm/gear_hash.rs`:
- Around line 101-116: The degenerate branch in GearHashState::new currently
computes (1u64 << hash_rate_log) which can overflow if hash_rate_log >= 64; make
it defensive by clamping the shift or returning a full-ones mask when
hash_rate_log is out of range: e.g., compute let shift = hash_rate_log.min(63);
if hash_rate_log >= 64 then set stop_mask = u64::MAX else use (1u64 << shift) -
1 (or equivalent) so stop_mask construction in new never shifts by >=64;
reference GearHashState::new, stop_mask and hash_rate_log when applying the fix.
In `@zstd/src/encoding/ldm/search.rs`:
- Around line 49-68: The docstring for count_backwards_match incorrectly calls
parameters "absolute byte positions" while callers like find_best_match pass
slice-relative indices (after subtracting history_abs_start); update the
documentation of count_backwards_match to state that p_in_abs, anchor_abs,
p_match_abs, and match_base_abs are indices relative to the provided history
slice (or, alternatively, change callers to pass true absolute positions), and
ensure the wording and parameter names reflect "slice indices" rather than
"absolute byte positions" so the contract matches the implementation and
callers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ff790935-e1c0-4250-88e4-58a01ddfdf31
📒 Files selected for processing (5)
zstd/src/encoding/bt/mod.rszstd/src/encoding/ldm/gear_hash.rszstd/src/encoding/ldm/mod.rszstd/src/encoding/ldm/search.rszstd/src/encoding/match_generator.rs
Wraps the LDM bucket table in donor's rebase scheme so streams
beyond `u32::MAX` (4 GiB) encode without truncation; tightens
small-detail review feedback alongside.
1. **Rebase scheme (donor `ZSTD_ldm_reduceTable`, zstd_ldm.c:520).**
`LdmEntry.offset` is no longer an absolute stream position
stored as `u32` (would truncate above 4 GiB). It is now a
`u32` *relative* to `LdmHashTable::position_base` with a +1
bias so the default-zeroed value remains the empty-slot
sentinel. New helpers:
- `insert_absolute(hash_id, abs_pos, checksum)` — translates
`abs_pos − position_base + 1` before storing.
- `resolve(&entry) -> Option<usize>` — inverse translation;
returns `None` for the sentinel.
- `ensure_room_for(abs_pos)` — checks whether the relative
offset would exceed `u32::MAX − REBASE_GUARD_BAND` (2^30
headroom); if so, calls `reduce`.
- `reduce(amount)` — subtracts `amount` from every entry's
relative offset (saturating at 0 = empty), advances
`position_base`. Matches donor `ZSTD_ldm_reduceTable` shape.
The producer calls `ensure_room_for` before every insert and
uses `insert_absolute`; `find_best_match` uses `table.resolve`
to translate entries back to absolute positions before the
staleness / window-bounds checks. Same scheme `MatchTable`
uses for the BT/HC chain table.
2. **Wider position types in the search-side surface.**
`LdmMatch.match_pos` and `FindBestMatchInputs.lowest_index_abs`
widened from `u32` to `usize` so values past 4 GiB stay
representable end-to-end. Internal storage remains `u32`
(the rebase scheme keeps the value in range); only the
external API uses `usize`.
3. **Defensive `hash_rate_log.min(63)` clamp** in
`GearHashState::new` (gear_hash.rs:116). `LdmParams::adjust_for`
already produces 4..7, but a future caller bypassing the
params helper could hit the `1u64 << hash_rate_log` shift
panic — donor calls this path via internal-only entry points
too, but our `pub(crate)` surface invites a wider audience.
4. **`LdmHashTable::clear` uses `entries.fill(...)`** instead of
a manual loop. `LdmEntry` is `Copy` so this compiles to a
bulk memset — meaningful when `hash_log` is large and clear
runs at every frame boundary.
5. **`count_backwards_match` parameter rename `*_abs → *_idx`**.
After the absolute-coords refactor, callers translate to
slice indices via `abs − history_abs_start` before invoking
the function; the `*_abs` names misled readers (and the
deferred extDict path) into expecting absolute coordinates.
6. **Test memory.** Three tests that exercised the producer at
`with_window_and_strategy(27, 9)` allocated a `1 << 23 = 8M`
entry table (~64 MiB) under parallel nextest, risking OOM on
32-bit shards. They now use a hand-tuned `LdmParams { hash_log:
10, .. }` via a `test_params()` helper (~8 KiB allocation)
without sacrificing test coverage. `producer_constructs_with
_donor_default_params` was split — the default-derivation
check now uses `LdmParams::adjust_for(27, 9)` directly so the
~64 MiB producer isn't instantiated at all.
Tests:
* +4 new rebase coverage in `table.rs`:
- `insert_absolute_round_trips_through_resolve`
- `resolve_returns_none_for_empty_slot`
- `reduce_preserves_resolved_absolute_positions` (verifies the
donor "subtract reducer, saturate at 0" semantics and
`position_base` advance preserve resolved positions for
surviving entries)
- `ensure_room_for_rebases_above_guard_band` (triggers a
rebase at `u32::MAX − 2^30 + 1` and confirms a fresh insert
at the trigger position round-trips through `resolve`).
* All search-side tests migrated to `insert_absolute` so the
+1 bias is exercised end-to-end.
Build + suite:
* 467 / 467 lib tests pass (+4 rebase, no regressions).
* `cargo clippy --lib --tests` clean.
* `level22_sequences_match_donor_on_corpus_proxy` ratio gate
PASS — `ldm_producer = None` default still preserves
byte-parity with upstream.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/ldm/table.rs`:
- Around line 170-173: The clear() method currently zeroes entries and
bucket_offsets but leaves position_base unchanged; update LdmTable::clear (the
clear() impl) to also set self.position_base = 0 so subsequent calls like
insert_absolute(hash_id, 0, checksum) do not produce negative relative offsets
against an old base; ensure you modify the same clear() that fills
LdmEntry::default() and bucket_offsets.fill(0) to include resetting
position_base to 0.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5a70916a-39a8-422c-8e2a-0f3e927acafe
📒 Files selected for processing (5)
zstd/src/encoding/bt/mod.rszstd/src/encoding/ldm/gear_hash.rszstd/src/encoding/ldm/mod.rszstd/src/encoding/ldm/search.rszstd/src/encoding/ldm/table.rs
`clear()` previously zeroed the entries and bucket cursors but left `position_base` at its last rebased value. After a `reduce()` shifted the base forward, `clear()` would leave the table in a state where a fresh-frame insert at absolute 0 hit the `abs_pos >= position_base` assertion in `insert_absolute`. Resetting `position_base = 0` keeps `clear()` semantically equivalent to constructing a fresh `LdmHashTable` — the per- frame producer reset path can now restart absolute positions from anywhere without surprise. New `clear_resets_position_base` regression in table.rs: inserts at abs 1024, calls `reduce(1 << 20)` to shift the base, verifies `position_base > 0`, calls `clear()`, asserts `position_base == 0`, and then inserts at abs 0 (would panic on the assertion if the reset didn't happen). 468 / 468 lib tests pass. `cargo clippy --lib --tests` clean.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.
| Benchmark suite | Current: 6de5ef7 | Previous: 4bf13f0 | Ratio |
|---|---|---|---|
compress/level22/small-4k-log-lines/matrix/c_ffi |
0.111 ms |
0.085 ms |
1.31 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
README: * Lead with a value-prop tagline (pure Rust, dictionary handle, no FFI, no_std) instead of jumping straight into the benchmarks dashboard. * Add a Quick Start section with `cargo add` and a `no_std` toggle. * Reframe the status section as decoder / encoder / dictionary capability summaries instead of a TODO-style feature checklist. * Collapse the internal compression-strategy backend table into a `<details>` block. * Move the live benchmark dashboard link into a Performance section pointing at the GitHub Pages dashboard; methodology stays in BENCHMARKS.md, no headline numbers in README. * Move the project-relationship paragraph to the bottom and shorten it. * Drop the internal ADR-013 reference (replaced by a concrete no-cmake / no-FFI statement in the tagline). * Switch the `LICENSE` link from a broken relative path to an absolute GitHub URL — the previous form rendered dead on docs.rs. docs.rs landing: * Replace the duplicated Decompression/Compression preamble in `lib.rs` with a feature-oriented overview that links into the per-module documentation, then re-includes the README via `include_str!` (single source of truth for the landing page). * Enable `#![cfg_attr(docsrs, feature(doc_cfg))]` so feature-gated items render their feature flag badge on docs.rs. * Add `[package.metadata.docs.rs] all-features = true, rustdoc-args = ["--cfg", "docsrs"]` so the published documentation builds with every optional surface visible. * Expand `decoding/mod.rs` and `encoding/mod.rs` module preambles — the previous one-liners pointed users at the wrong entry point. The new docs walk through the three entry points each side exposes (one-shot / streaming / low-level) with the trade-off for each. All 468 / 468 lib tests pass. `cargo clippy --lib --tests` clean. `cargo doc --lib --features dict_builder` builds with zero warnings.
The doc comment referenced a non-existent `LDM_INSERT_LOOKAHEAD` constant and claimed the function returned `abs_pos - shift_applied`. The function returns `()` and operates only on the supplied `abs_pos` — there is no per-call lookahead. The rewritten comment matches the actual signature and explains the single-compare common path plus the rebase trigger. Doc-only change; 468 / 468 lib tests still pass.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/ldm/table.rs`:
- Around line 294-296: The rustdoc for ensure_room_for uses a line that begins
with "- " which rustdoc treats as a list item and triggers
clippy::doc-lazy-continuation; edit the doc comment for ensure_room_for (the
paragraph mentioning u32::MAX and REBASE_GUARD_BAND and Self::reduce) so the
sentence does not start with a leading hyphen — e.g., rephrase the fragment to
remove the leading "-" or wrap the symbol in inline code text so it reads like
"exceed `u32::MAX - REBASE_GUARD_BAND`, advance `position_base` by
`REBASE_GUARD_BAND` and run `Self::reduce` across all…" ensuring the line no
longer begins with "- " to satisfy clippy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8aab66e4-92a3-4663-a828-b57c8890ed3e
📒 Files selected for processing (6)
README.mdzstd/Cargo.tomlzstd/src/decoding/mod.rszstd/src/encoding/ldm/table.rszstd/src/encoding/mod.rszstd/src/lib.rs
Clippy 1.95's `doc_lazy_continuation` lint parses a line starting with `- ` as the head of a markdown list item; the surrounding sentence continuation then triggers the indentation warning. CI hits `-D warnings` so the build fails. Rewording to `When the relative position would exceed u32::MAX − REBASE_GUARD_BAND` keeps the prose at column 0 with no leading dash.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/ldm/table.rs`:
- Around line 218-237: insert() currently accepts an entry when the bucket
offset is 0, but 0 is the empty-slot sentinel so such inserts become invisible
to resolve(); add a runtime check in pub(crate) fn insert (in table.rs) to
reject or panic on attempts to insert when offset == 0 (use the computed offset
variable and include a clear message referencing the bucket/hash_id), and update
the round-robin test fixture that uses 0 as test data so it uses a non-zero slot
value (also apply the same guard where similar logic exists around the other
insert-like code span noted in the review).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: cd8b0a29-def5-487c-b985-8d4de3871b43
📒 Files selected for processing (1)
zstd/src/encoding/ldm/table.rs
The new ensure_room_for_loops_across_multiple_guard_bands test materialises an abs_pos of 5 * REBASE_GUARD_BAND (= 5 GiB) as a usize literal, which overflows usize on i686 (usize::MAX ≈ 4 GiB) at compile time under #[deny(arithmetic_overflow)]. Gate the test to target_pointer_width = "64"; the scenario is unreachable on 32-bit anyway because usize::MAX caps addressable streams below the multi-rebase threshold the test exercises.
…insert panic docs CodeRabbit Major: insert() accepted entry.offset == 0, which aliases the empty-slot sentinel — resolve() would treat the stored entry as empty and silently drop the candidate. Promote to a runtime assert (not debug_assert!) so contract violations fail fast in release. Shift the round-robin fixture from 0..6 to 1..=6 so the test stops using the sentinel as data, and add a regression test for the new panic path. Copilot: the prior panics-doc claim that release would 'corrupt the next bucket' was inaccurate — Vec OOB indexing panics in both debug and release. Rephrase: the debug_assert only exists for an earlier, clearer error message; release falls back to Vec's own bounds check.
compress streams from Read incrementally and buffers only compressed blocks until the frame end (to write Frame Content Size in the header) — peak ≈ output_size, independent of input length. compress_to_vec eagerly read_to_end's the input into a Vec so the encoder can be handed &[u8] + an exact source-size hint, peak ≈ input + output. The previous combined wording conflated the two and overstated compress's input buffering requirements.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/mod.rs`:
- Around line 3-25: The module docs incorrectly state "Three entry points" while
listing four (`compress`, `compress_to_vec`, `StreamingEncoder`,
`FrameCompressor`); update the opening sentence to reflect the correct count
(e.g., "Four entry points cover the common use cases") or rephrase it to avoid a
hard count, and ensure the updated text remains consistent with the subsequent
bullets for `compress`, `compress_to_vec`, `StreamingEncoder`, and
`FrameCompressor`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4ad2428f-7021-4273-be14-74996bfa3d0f
📒 Files selected for processing (1)
zstd/src/encoding/mod.rs
…reference CodeRabbit + Copilot flagged 'Three entry points' enumerating four bullets (compress, compress_to_vec, StreamingEncoder, FrameCompressor). Bump to 'Four'. Copilot also noted StreamingEncoder implements crate::io::Write, not std::io::Write unconditionally — crate::io re-exports std::io when the std feature is on and falls back to a no_std-friendly trait otherwise. Update the bullet to reference crate::io::Write and call out the alias so no_std consumers are not misled.
The previous wording 'peak ≈ output_size, independent of input length' misled: output_size grows with input length, and worst-case for incompressible payloads is O(input_size) (compressed cannot exceed input + small overhead). Rephrase to 'O(compressed_size) (worst-case O(input_size))' and call out explicitly that the win vs compress_to_vec is avoiding materialising the input alongside the output.
… customMem tracking Three coupled changes that together bring Dfast (default level) memory from 70 MB to ~9 MB on a 1 MiB input and make the FFI-side memory metric actually count the C heap: 1. **[u32; 4] slots + rebase scheme** — DfastMatchGenerator's short_hash / long_hash switch from Vec<[usize; 4]> to Vec<[u32; 4]>. A new position_base field anchors the +1-biased relative encoding (slot 0 = empty sentinel, slot N = abs_pos position_base + N - 1). ensure_room_for() advances the base when the relative offset would exceed u32::MAX - GUARD_BAND, shifting every stored slot down via reduce(). Same scheme as encoding/ldm/table.rs from PR #139. Per-slot memory: 32 -> 16 bytes (2x). Combined with item 2 the two-table footprint drops from 64 MiB to ~4 MiB on level 3 large input. 2. **DFAST_HASH_BITS 20 -> 17** — matches donor clevels.h:31 (level 3 large-input bucket, hashLog = 17). Caps the upper bound; dfast_hash_bits_for_window still clamps the runtime value to [MIN_WINDOW_LOG, DFAST_HASH_BITS]. 3. **ZSTD_customMem hooks for FFI bench** — CR review of PR #143 flagged the original ffi_peak_alloc_bytes metric as misleading: #[global_allocator] only observes Rust allocations, so libzstd's CCtx + hashTable + chainTable + workspace allocations through libc malloc were invisible. Route the FFI path through ZSTD_createCCtx_advanced / ZSTD_createDCtx_advanced with a customMem shim that forwards alloc/free to alloc::alloc / alloc::dealloc (which the TrackingAllocator wraps). FFI memory numbers now reflect real C heap use. Two new functions in compare_ffi.rs: - ffi_encode_all_aligned: replaces zstd::stream::Encoder with raw ZSTD_compress2 on a customMem-aware CCtx - ffi_decompress_via_custom_mem: same idea for the decode path The header-prefix allocator (16-byte align, size stored at ptr-16) lets customFree recover the layout for symmetric tracking. Local level_3_dfast measurements (macOS aarch64, 1 MiB decodecorpus-z000033): axis before after win peak alloc (rust) 70 MiB 9.7 MiB 7.2x lower peak alloc (ffi) 1 MiB* 2.3 MiB *was misreported ratio rust/ffi 67x 4.3x genuine narrowing throughput 9.89 MB/s 16.0 MB/s 1.6x faster compression 0.5349 0.5349 unchanged Ratio gate level22_sequences_match_donor_on_corpus_proxy green throughout. 478/478 lib tests pass. Part of #111 Phase 7. Closes #143 review threads (Rust-only allocator scope + heading rename).
Three Copilot review threads on PR #143: 1. compare_ffi.rs:226 — the customMem-instrumented FFI encode helper was also used by criterion's c_ffi timing benchmark, so libzstd latency numbers included header bookkeeping + atomic tracker overhead a real caller never pays. Split into two helpers: ffi_encode_all_aligned (uninstrumented, via zstd::stream::Encoder, used by the timing bench) and ffi_encode_via_custom_mem (customMem-aware, used only inside measure_peak_alloc). Throughput numbers now reflect what a real downstream caller observes. 2. dfast/mod.rs:156 — no targeted coverage for the new ensure_room_for -> reduce() path. Added dfast_ensure_room_for_rebases_above_guard_band, mirroring the LdmHashTable rebase tests from PR #139. The test seeds an early slot, fires ensure_room_for at u32::MAX - GUARD_BAND + 1 (gated to target_pointer_width = 64), and verifies (a) position_base advanced by DFAST_REBASE_GUARD_BAND, (b) the pre-rebase entry clamped to DFAST_EMPTY_SLOT, (c) a post-rebase pack/unpack round-trips the absolute position. 3. match_generator.rs:67 — PR description didn't surface that DFAST_HASH_BITS 20 -> 17 plus the [usize;4] -> [u32;4] storage change are encoder behavior changes, not just bench/reporting work. Rewrote the PR body to explicitly call out the scope: Dfast default sizing now matches donor clevels.h, slot representation is rebase-backed, compression ratio remains identical on every tested scenario (level_3_dfast smoke). 479/479 lib tests pass; level22 ratio gate green.
…ine (#143) * feat(bench): replace buffer-size estimate with real peak-alloc metric Bench previously emitted REPORT_MEM with rust_buffer_bytes_estimate = ffi_buffer_bytes_estimate = input.len() + output.len() — a static shape estimate that ignores matcher state, scratch buffers, and any per-call internal allocation. The number was identical for both columns, so it carried no comparative signal. Replace with a hand-rolled GlobalAlloc wrapper around System (~30 LoC) that tracks live-bytes and high-water mark via two relaxed atomics. measure_peak_alloc(closure) reset()s the peak before running the closure and reports the delta of live bytes above the baseline. bench_compress and bench_decompress each wrap one full encode/decode pass for Rust and FFI separately so the emitted rust_peak_alloc_bytes / ffi_peak_alloc_bytes reflect real per-call high-water marks rather than approximations. run-benchmarks.sh: - Update the REPORT_MEM regex + field names in lockstep. - Add peak_alloc_bytes records to benchmark-relative.json (third metric alongside compression_ratio + throughput_bytes_per_sec). delta_ratio = rust_peak / ffi_peak, same direction semantics as compression_ratio (>1 means Rust uses more). Stage normalised (compress / decompress-{source} -> stage + source) so the dashboard's per-series grouping picks them up under the existing filter logic. - Markdown report header renamed to 'Rust/C peak alloc bytes'. Baseline numbers from a local level_3_dfast shard (will land in CI on next bench-matrix run; numbers below are macOS aarch64 local): scenario rust_kb ffi_kb ratio small-1k-random 1299 33 39x small-10k-random 1335 42 32x small-4k-log-lines 1330 32 41x high-entropy-1m 71709 2080 34x decodecorpus-z000033 71160 1056 67x low-entropy-1m 69199 32 2142x large-log-stream 180823 51 3543x This is the first half of the Phase 7 baseline (#111). Encode + decode latency and compression ratio numbers already land via the existing bench matrix; this commit closes the gap on the third dimension. Picking the worst (scenario, level) ratios from the next main snapshot will drive the target_feature / arena-allocator discussion in the second half of Phase 7. * perf(dfast): donor-parity hash storage (u32 slots + rebase) + libzstd customMem tracking Three coupled changes that together bring Dfast (default level) memory from 70 MB to ~9 MB on a 1 MiB input and make the FFI-side memory metric actually count the C heap: 1. **[u32; 4] slots + rebase scheme** — DfastMatchGenerator's short_hash / long_hash switch from Vec<[usize; 4]> to Vec<[u32; 4]>. A new position_base field anchors the +1-biased relative encoding (slot 0 = empty sentinel, slot N = abs_pos position_base + N - 1). ensure_room_for() advances the base when the relative offset would exceed u32::MAX - GUARD_BAND, shifting every stored slot down via reduce(). Same scheme as encoding/ldm/table.rs from PR #139. Per-slot memory: 32 -> 16 bytes (2x). Combined with item 2 the two-table footprint drops from 64 MiB to ~4 MiB on level 3 large input. 2. **DFAST_HASH_BITS 20 -> 17** — matches donor clevels.h:31 (level 3 large-input bucket, hashLog = 17). Caps the upper bound; dfast_hash_bits_for_window still clamps the runtime value to [MIN_WINDOW_LOG, DFAST_HASH_BITS]. 3. **ZSTD_customMem hooks for FFI bench** — CR review of PR #143 flagged the original ffi_peak_alloc_bytes metric as misleading: #[global_allocator] only observes Rust allocations, so libzstd's CCtx + hashTable + chainTable + workspace allocations through libc malloc were invisible. Route the FFI path through ZSTD_createCCtx_advanced / ZSTD_createDCtx_advanced with a customMem shim that forwards alloc/free to alloc::alloc / alloc::dealloc (which the TrackingAllocator wraps). FFI memory numbers now reflect real C heap use. Two new functions in compare_ffi.rs: - ffi_encode_all_aligned: replaces zstd::stream::Encoder with raw ZSTD_compress2 on a customMem-aware CCtx - ffi_decompress_via_custom_mem: same idea for the decode path The header-prefix allocator (16-byte align, size stored at ptr-16) lets customFree recover the layout for symmetric tracking. Local level_3_dfast measurements (macOS aarch64, 1 MiB decodecorpus-z000033): axis before after win peak alloc (rust) 70 MiB 9.7 MiB 7.2x lower peak alloc (ffi) 1 MiB* 2.3 MiB *was misreported ratio rust/ffi 67x 4.3x genuine narrowing throughput 9.89 MB/s 16.0 MB/s 1.6x faster compression 0.5349 0.5349 unchanged Ratio gate level22_sequences_match_donor_on_corpus_proxy green throughout. 478/478 lib tests pass. Part of #111 Phase 7. Closes #143 review threads (Rust-only allocator scope + heading rename). * test(dfast): assert pack_slot never returns sentinel before bucket membership checks CR review of PR #143 flagged the three dfast_seed_remaining_hashable* tests as potentially vacuous: the bucket membership assertion `matcher.long_hash[long_hash].contains(&target_slot)` would still pass on a partially-empty bucket if `pack_slot(target_abs_pos)` ever regressed to `DFAST_EMPTY_SLOT` — the sentinel sits in unused entries, so a false positive could mask a real failure to seed the position. Guard each `pack_slot` -> `contains` pairing with an explicit `assert_ne!(target_slot, DFAST_EMPTY_SLOT)`. Three call sites in the dfast_seed_remaining_hashable_starts_* tests. * ci(bench): strategy-grouped shards, PR-fast feedback, alert only on canonical levels Three coupled CI changes for Phase 7: 1. **Shard plan moved from "one level per shard" to "one strategy per shard"**. bench-matrix now emits a 'shards' output: each entry has an 'id' (fast / dfast / greedy / lazy / btopt / btultra / btultra2) and a CSV of levels ( is comma-aware, so the bench binary iterates all levels in one process). Shard count drops from 87 (3 targets × 29 levels) to 21 (3 × 7 strategies) on main, with the per-binary build cost amortised across each strategy's levels. 2. **PR runs only the canonical pair**. On a 'pull_request' event bench-matrix emits a single 'pr-canonical' shard whose levels are 'level_3_dfast,level_22_btultra2' — the donor default + max compression. Three shards total per PR (one per target) instead of 87. Reviewers see ratio + speed + memory deltas for the two levels we ship as the primary public guarantees within minutes, the full 29-level matrix still runs post-merge to keep the gh-pages snapshot complete. 3. **github-action-benchmark alert restricted to canonical levels**. ALERT_LEVELS = {level_3_dfast, level_22_btultra2} in run-benchmarks.sh. Every other level still lands in benchmark-relative.json / benchmark-report.md / the dashboard, but only the canonical pair drives regression alerts — experimental levels (fast tier, btopt) shouldn't fire false positives while they're being tuned. aggregate-bench-levels.py markdown header changes from '## Level: <level>' to '## Strategy group: <shard_id>' so the per-target combined report makes sense when each shard now bundles many levels. Smoke-tested locally: single shard with comma-separated filter 'level_3_dfast,level_22_btultra2' emits 98 relative records (49 per level), confirming both levels iterate in one binary invocation. * fix(bench): unbias FFI latency + add Dfast rebase regression test Three Copilot review threads on PR #143: 1. compare_ffi.rs:226 — the customMem-instrumented FFI encode helper was also used by criterion's c_ffi timing benchmark, so libzstd latency numbers included header bookkeeping + atomic tracker overhead a real caller never pays. Split into two helpers: ffi_encode_all_aligned (uninstrumented, via zstd::stream::Encoder, used by the timing bench) and ffi_encode_via_custom_mem (customMem-aware, used only inside measure_peak_alloc). Throughput numbers now reflect what a real downstream caller observes. 2. dfast/mod.rs:156 — no targeted coverage for the new ensure_room_for -> reduce() path. Added dfast_ensure_room_for_rebases_above_guard_band, mirroring the LdmHashTable rebase tests from PR #139. The test seeds an early slot, fires ensure_room_for at u32::MAX - GUARD_BAND + 1 (gated to target_pointer_width = 64), and verifies (a) position_base advanced by DFAST_REBASE_GUARD_BAND, (b) the pre-rebase entry clamped to DFAST_EMPTY_SLOT, (c) a post-rebase pack/unpack round-trips the absolute position. 3. match_generator.rs:67 — PR description didn't surface that DFAST_HASH_BITS 20 -> 17 plus the [usize;4] -> [u32;4] storage change are encoder behavior changes, not just bench/reporting work. Rewrote the PR body to explicitly call out the scope: Dfast default sizing now matches donor clevels.h, slot representation is rebase-backed, compression ratio remains identical on every tested scenario (level_3_dfast smoke). 479/479 lib tests pass; level22 ratio gate green. * fix(bench): gate tracking allocator, stream FFI customMem, skip non-canonical alert fallback Three coupled fixes from PR #143 review (3 CR + 2 Copilot): 1. **Tracking allocator gated by TRACKING_ENABLED flag** (CR #7 + Copilot #9). Until now TrackingAllocator updated atomics on every process-wide alloc/free/realloc/alloc_zeroed, including criterion's timing loops. That biased every Rust-side throughput sample by the atomic fetch_add/fetch_max + branch overhead while the FFI uninstrumented path paid nothing. Add a static AtomicBool TRACKING_ENABLED = false; flip to true only inside measure_peak_alloc via an RAII guard (TrackingGuard) so panic safety is automatic, then flip back to false on drop. All four allocator methods now check TRACKING_ENABLED before touching ALLOC_CURRENT/ALLOC_PEAK so the timing loops pay one branch + Relaxed atomic load per allocation instead of two atomic RMWs. 2. **FFI customMem encode now streams via ZSTD_compressStream2** (Copilot #10). Previous one-shot via ZSTD_compress2 preallocated ZSTD_compressBound(input.len()) output Vec, while the throughput path used zstd::stream::Encoder writing into a growing Vec. That made ffi_peak_alloc_bytes inflate to the worst-case bound on every measurement — visible on large-log-stream where preallocation showed ~104 MiB FFI peak while real streaming peaks at ~3.7 MiB. Replace with a streaming loop over ZSTD_compressStream2 with ZSTD_e_continue/ZSTD_e_end, growing the output Vec on each flush. Both paths now allocate the same FFI encode shape. 3. **Non-canonical shards emit empty benchmark-results.json** (CR #6). Previously when regression_levels was empty (e.g. the 'fast' or 'btopt' strategy shard) the fallback repopulated benchmark_results with EVERY parsed timing, silently re-expanding alert surface to 29 levels. Split the fallback: when regression_levels still has members (canonical shard mis-scoped on scenarios), keep the fallback so the dashboard sees data; when regression_levels is empty (non-canonical strategy shard), emit an empty list so github-action-benchmark gets no rows to alert on. INFO log shows the empty alert-set + present-set for the affected shard. Also fixed the hash-table size math in DFAST_HASH_BITS rationale (CR #8): per-bucket vs per-slot mix-up — 1<<17 buckets × 16 bytes for [u32; 4] is 2 MiB per table (not 512 KiB as written). 479/479 lib tests pass. Smoke at level_3_dfast confirms FFI streaming encode emits comparable peak numbers (~3.2 MiB for 1 MiB input vs prior 2.3 MiB preallocated; large-log-stream drops from 104 MiB to 3.7 MiB). * test(dfast): cover long_hash in rebase regression + drop i686 gate Two CR nitpicks on dfast_ensure_room_for_rebases_above_guard_band: 1. The test only seeded short_hash before the rebase, so a future regression that cleared only short_hash would still pass even though reduce() is a shared contract for both tables. Seed long_hash[0][0] as well and assert it clamps to DFAST_EMPTY_SLOT after ensure_room_for runs. 2. The #[cfg(target_pointer_width = '64')] gate was prophylactic but unnecessary — trigger_abs = u32::MAX - DFAST_REBASE_GUARD_BAND + 1 = 0xC0000000 fits in usize on i686 (usize::MAX == u32::MAX there). Drop the gate so the packed-slot boundary path + the u32 ↔ usize round-trip is exercised on every target we ship. * fix(bench): balance dealloc counter via per-allocation header flag Both reviewers flagged the same correctness bug on the TRACKING_ENABLED gate: closures returned a Vec, the Vec dropped AFTER measure_peak_alloc flipped tracking back off, so dealloc skipped the fetch_sub and ALLOC_CURRENT drifted upward across every sample. On the i686 bench target the 32-bit AtomicUsize could wrap and corrupt later peak measurements. Switch to a header-prefix scheme: each TrackingAllocator allocation over-allocates max(align, 16) bytes, stores a u8 "counted at alloc time" flag at the header offset, and returns ptr+header to the user. dealloc reads the flag and always subtracts the size when set, even when TRACKING_ENABLED is now false. The fetch_add path on alloc still checks TRACKING_ENABLED so criterion timing loops outside measurement windows continue to skip the atomic RMW entirely. Drop the alloc_zeroed / realloc overrides — GlobalAlloc's defaults forward to our alloc/dealloc, which already do the right thing under header accounting. (System.realloc would invalidate the header, so the default alloc+copy+dealloc path is actually the correct one here.) Smoke at level_3_dfast keeps the same headline numbers (decodecorpus ~10 MiB Rust vs ~3.5 MiB FFI), confirming the fix doesn't shift the measurement scale — only that ALLOC_CURRENT is now balanced across the full bench process. 479/479 lib tests pass. * refactor(bench): RSS sampler + per-CCtx customMem, drop global allocator Memory measurement no longer biases timing samples and the FFI compare runs the same operation for both metrics: - Drop `#[global_allocator] TrackingAllocator`. Even with the per-call gate flag, every Rust allocation in the bench binary paid an atomic load + branch + extra 16-byte header — biasing criterion's timing loops against the FFI side which has no such wrapper. - Replace with OS RSS sampling for the Rust side: a background poller reads `mach_task_basic_info.resident_size` (macOS) or `/proc/self/statm` (Linux) every 250 µs and updates a peak counter. Hot-path overhead is zero — the kernel maintains RSS independently. - FFI side keeps `ZSTD_customMem` but routes through a per-CCtx `FfiMemTracker` (passed as `opaque`), not a process-wide static. No global state, no atomic ops; libzstd's single-threaded-per-context allocator path is the only writer. - Unify the encode and decode helpers: one `ffi_encode_to_vec` and one `FfiDCtxHandle::decompress_into`. Timing loops pass `mem = None` (default malloc, uninstrumented). Memory measurement passes `Some(&mut tracker)`. Same `ZSTD_compressStream2` / `ZSTD_decompressDCtx` call path in both cases — no aligned-vs-customMem split that could let the two metrics describe different operations. `FfiDCtxHandle` reuses one DCtx across decode iterations so the sample reflects steady-state (matches the pure-Rust `FrameDecoder` reuse pattern). CI gates added to lint job: - `compare_ffi` must not list `bench_internals` in `required-features` (would widen Rust-side visibility vs FFI, breaking parity). - `zstd/src/` must not reference any bench-instrumentation identifier (`TrackingAllocator`, `FfiMemTracker`, `PeakWindow`, `ZSTD_customMem`, `customMem(`). Bench instrumentation stays in `zstd/benches/`. Local verification on level_3_dfast / decodecorpus-z000033 (1 MiB): decompress c_ffi: 890 µs (no regression vs reusable DCtx baseline) compress: ratio + memory numbers unchanged shape vs prior baseline Lib tests: 533/533, clippy clean (both feature sets), production isolation grep: clean * fix(ci): skip doc-only lines in bench-leak gate The leak gate's rg invocation was raw text search and would trip on identifier mentions inside doc blocks even though the rationale said "identifier references in actual code are not". Pipe through a second `rg -v` that drops lines whose content starts with a Rust documentation prefix (`//`, `/*`, ` *`), so the gate fires only on real code uses of `TrackingAllocator`, `FfiMemTracker`, `PeakWindow`, `ZSTD_customMem`, or `customMem(`. Verified locally: - real code (`let _ = TrackingAllocator;`) → gate trips - doc-block mentions of every flagged symbol → gate passes - current `zstd/src/` → clean (no false positives) * chore(gitignore): ignore claude session state, pycache, zstd fuzz corpus * refactor(bench): rename Rust memory field to peak_rss_delta_bytes The Rust side metric was emitted as `rust_peak_alloc_bytes`, suggesting parity with the FFI side's precise `ZSTD_customMem` byte count. In reality it is OS resident-set-size delta from background sampling — allocations satisfied from already-faulted pages or from the allocator's cached arena do not bump RSS, so warm scenarios underreport against true peak allocated bytes. Renames: - bench `emit_memory_report` field: `rust_peak_alloc_bytes` → `rust_peak_rss_delta_bytes` - aggregator MEM_RE regex + JSON record key match the new name - dashboard markdown column header reflects the proxy semantics - FFI field unchanged — `ffi_peak_alloc_bytes` IS precise alloc counter Dashboard metric group key (`peak_alloc_bytes`) is kept stable so the plot continues to group rust vs ffi side-by-side; doc comments on both the bench and the aggregator now spell out the asymmetry so cross-side absolute values aren't read as directly comparable. * fix(ci): parse Cargo.toml with tomllib for bench-feature gate The awk/grep gate matched `required-features = [...]` only when the array fit on one line. A future reformat to a multi-line TOML array would let `bench_internals` slip into `compare_ffi`'s feature set without tripping the gate. Switch to `tomllib.load` (stdlib on Python 3.11+, available on every GitHub runner we use) so single-line and multi-line arrays are handled by the same parse path. Verified locally: - current Cargo.toml → gate passes - same file rewritten with `required-features = [\n "dict_builder", \n "bench_internals"\n]` → gate trips * refactor(bench): preserve decode RSS window + plain FfiMemTracker arithmetic Three small bench corrections: - Decode RSS sampling: `rust_window.finish()` now runs while `target` and `decoder` are still alive. Previously they sat in an inner block that dropped them before `finish` took its final on-thread sample — on sub-poll-interval payloads (small inputs decoding faster than the 250 µs sampler), the background poller can miss the spike entirely and the post-drop final sample underreports actual decode footprint. - `FfiMemTracker::current` updates with plain `+=` / `-=` instead of saturating arithmetic. A single-CCtx measurement bench cannot reach `usize::MAX` of live state under any realistic libzstd workload; if the counter ever did over/underflow that's a real bug (alloc/free imbalance or mis-routed opaque pointer) and a panic surfaces it instead of silently freezing at the saturation bound. - Code comments at the FFI-peak read sites document the intentional asymmetry: the FFI tracker observes ONLY libzstd's customMem requests, NOT the Rust-owned output/chunk/target buffers the FFI helpers also allocate. Both sides allocate output via the system allocator and the metric is the libzstd-internal cost on top, so wrapping the whole FFI call in an RSS window would inflate the comparison with bookkeeping the Rust path also pays. * fix(bench): take RSS baseline after sampler thread is live `PeakWindow::start` previously snapshotted `baseline` before `thread::spawn`, so the sampler thread's own startup cost (stack pages faulted in, TLS init) leaked into every reported delta as a fixed positive bias — most visible on the small scenarios where the metric already lives near the noise floor. Spawn the sampler first, have it store the first sample as the initial `peak` value, and only then read `baseline = peak.load(...)` on the main thread. The sampler thread's footprint is now absorbed by `baseline` and the returned delta reflects only the workload allocation. An `Arc<AtomicBool>` ready flag plus a `thread::yield_now()` spin gates the baseline read until the sampler has produced its first sample. Spin is bounded by a single OS RSS query so wall-clock cost is negligible. * refactor(bench): split compare_ffi into timing + memory binaries Single bench tried to do both timing and memory measurement, which forced asymmetric observers (OS RSS sampling for Rust vs customMem counter for FFI). Cross-side comparison was misleading because the two metrics measure different things, and the timing path still paid RSS-sampler thread overhead even when its data wasn't used. Split into two binaries with disjoint responsibilities: - `compare_ffi.rs` (timing + ratio only): runs under a vanilla system allocator. No `#[global_allocator]`, no `ZSTD_customMem` hooks, no RSS sampler. Criterion samples are unbiased. - `compare_ffi_memory.rs` (new): standalone `fn main()` that installs a `#[global_allocator]` tracking wrapper and routes libzstd's `ZSTD_customMem` allocate/free callbacks through `std::alloc::alloc` / `std::alloc::dealloc`. A single counter sums Rust-side and FFI bytes — by construction symmetric, so the cross-side ratio is meaningful again. Per-allocation tracking overhead is fine here because this binary doesn't claim to measure timing. Each binary registers in `Cargo.toml` as a separate `[[bench]]` and neither requires `bench_internals`, so the existing CI lint gate (no `bench_internals` in compare_ffi parity benches) extends naturally to cover both names. * ci(bench): wire compare_ffi_memory binary into main-push runs CI plumbing for the new memory bench: - bench-build step: cargo bench now builds both binaries in one invocation. zstd-sys/criterion/etc. only compile once; the binaries ship together in the `bench-binary-<target>` artifact. - run-benchmarks.sh: optional `STRUCTURED_ZSTD_BENCH_MEMORY_BIN` env var. When set, the memory binary runs sequentially after the timing binary and its `REPORT_MEM` output is appended to the same raw file for uniform downstream parsing. When unset (current PR-shard path), no memory measurement is taken. - benchmark shard job: passes `STRUCTURED_ZSTD_BENCH_MEMORY_BIN` only on `event_name == 'push'` (main pushes). PRs keep their fast review-cycle latency; main pushes get the full memory matrix into the dashboard. - bench-leak gate: extended to forbid `bench_internals` in `compare_ffi_memory` too, and the identifier-leak grep now matches `ALLOC_PEAK` / `ALLOC_CURRENT` / `TRACKING_ENABLED` from the new binary in addition to the previous symbol set. - run-benchmarks.sh `peak_alloc_bytes` aggregation: restores `delta_ratio` and a `delta>1 means Rust allocates more` style interpretation. Both columns now come from one observer, so the ratio is honest. * fix(bench): correct allocator header math + bypass tracker for FFI sizes Two correctness bugs in `compare_ffi_memory` flagged by CR/Copilot: 1. `augmented_layout` reserved a fixed 16-byte header but `alloc` / `dealloc` computed the actual header offset as `align.max(HEADER_BYTES)`. For SIMD types or any align-greater-than-16 allocation the user pointer ended up past the System.alloc-owned block — OOB writes during init plus a layout mismatch on `dealloc`. Centralised the header size in `tracker_header(layout) -> align.max(HEADER_BYTES)` and routed `augmented_layout`, `alloc`, and `dealloc` through the same helper so the three sites can't drift again. 2. `ffi_alloc` routed `std::alloc::alloc(size + FFI_HEADER)` through `TrackingAllocator`, so every libzstd allocation was counted as `size + 16` while the Rust side counts only `layout.size()` (the original request). That added a fixed 16-byte-per-alloc bias to `ffi_peak_alloc_bytes`, skewing the cross-side ratio Copilot flagged on PR #143. Switched `ffi_alloc` / `ffi_free` to `System.{alloc,dealloc}` (bypassing TrackingAllocator's header bookkeeping) and manually `fetch_add`/`fetch_sub` `size` (not `total`) into `ALLOC_CURRENT` / `ALLOC_PEAK`. libzstd allocations are now counted exactly like Rust allocations are. Dropped accordingly: customMem `decodecorpus-z000033 level_3_dfast` compress FFI peak 3,570,299 → 3,570,267 bytes (delta = 32 bytes header overhead across two allocations). Two CI-side aux fixes flagged in the same round: 3. `run-benchmarks.sh` now treats an empty `memory_rows` list as informational, not fatal — PR shards intentionally don't run the memory bench (no `STRUCTURED_ZSTD_BENCH_MEMORY_BIN` set) and would have crashed the timing-shard reporter otherwise. 4. The post-artifact `chmod +x` step covers both bench binaries. `actions/download-artifact` strips the executable bit, and only `compare_ffi` was being marked +x — the memory binary would have failed run-benchmarks.sh's `-x` check on main pushes. Also refreshes the stale "OS RSS delta" wording in the aggregator comment to describe the unified single-observer setup. * fix(ci): omit Peak Allocation section when memory bench did not run On PR shards `memory_rows` is empty (the memory binary is intentionally not invoked there). The Markdown report builder still emitted the `## Peak Allocation Bytes` heading and table header in that case, leaving a blank section. Guard the heading + table builder behind `if memory_rows:` so the section is omitted entirely when there's no data — matches the earlier INFO log that announced the omission. * fix(ci): scope timing fallback to canonical alerts; keep zero-ffi memory rows Two narrow correctness fixes in run-benchmarks.sh: - benchmark-results fallback (timing): previously, when the smoke filter matched zero rows on a strategy shard that contained a canonical alert level, the fallback rebuilt benchmark_results from the unfiltered `timings` list. On a `dfast` shard that also carries `level_2_dfast` this reintroduced the non-canonical sibling into the github-action-benchmark feed, breaking the "alerts fire ONLY on level_3_dfast + level_22_btultra2" contract. The fallback now filters `timing_rows` to `REGRESSION_STAGES ∩ regression_levels` so the canonical contract holds even when scenario mapping drifts. - peak_alloc_bytes row emission: `if delta_ratio is None: continue` dropped entire rows when a scenario had zero libzstd allocations, hiding valid datapoints. Keep the row, set `delta_ratio` and `delta_percent` to null when undefined — the dashboard still gets both `rust_value` and `ffi_value`. * docs(bench): align observer wording with current customMem-bypass path Two stale comments after the FFI tracker switched to `System.alloc` / `System.dealloc` bypass + manual atomic updates (commit d35a7bd): - Module-level docstring in `compare_ffi_memory.rs` still said the FFI side "routes through the same Rust allocator". Updated to describe the actual setup: Rust via `#[global_allocator]`, FFI via customMem callbacks calling `System.alloc` directly and manually updating shared atomics with the libzstd-requested size only (header bytes excluded to avoid double-counting). - Two equivalent comment blocks in `run-benchmarks.sh` (record builder + markdown table intro) carried the same old wording. Rewritten to match. No behavior change. * fix(bench): override realloc in TrackingAllocator + assert exact decode length Three fixes from CR/Copilot on compare_ffi_memory.rs: - TrackingAllocator now implements realloc. The default GlobalAlloc::realloc does alloc+copy+dealloc, which keeps both old and new buffers live during the copy window and temporarily inflates ALLOC_PEAK by the old buffer's size. Vec growth (the dominant realloc shape in the measured path) was therefore counted larger than what System.realloc actually does in production (typically in-place resize). New impl routes through System.realloc and updates counters by the size delta only — measured peak drops ~5% on Rust-side compress for the decodecorpus scenario at level_3_dfast (10,500,563 → 9,953,841 bytes), bringing the metric in line with real allocator behaviour. - ffi_decode now asserts written == expected_len. The Rust decode measurement already asserts exact length; without the matching assertion on the FFI side a partial/truncated decode would still emit a ffi_peak_alloc_bytes value, hiding the failure behind a misleading metric. - Module-level doc comment describing the observer setup was stale ("FFI callbacks route through std::alloc::alloc"). The customMem callbacks have been bypassing TrackingAllocator since commit d35a7bd — only the atomic counters are shared. Rewrote the doc to describe the actual two-path / one-counter setup. * perf(bench): reduce sample_size from 10 to 3 for Corpus/Entropy/Large/Silesia Criterion warned 'Unable to complete 10 samples in 4.0s' on heavy levels (level_22_btultra2 + 1 MiB+ payloads) because a single encode pass already takes well over a second. Drop sample_size to 3 for the three scenario classes that fired the warning. Three samples + the existing measurement_time budget is enough for the dashboard's regression-band sensitivity (±5%) on canonical alert levels while keeping per-shard wall time bounded. Small scenarios stay at sample_size=30 (they finish in microseconds, more samples cheap).
Summary
Phase 5 of #111 — implements #18 (long-distance matching). Lands the full donor LDM producer (gear rolling hash + bucket-based hash table + verify/extend search + aggregate driver) under a new
encoding/ldm/module. Two commits, ~2.1k LoC including 33 unit tests.Activation policy (distro-grade)
LdmProduceris never activated automatically by anyCompressionLevelpreset. This mirrors upstreamlibzstd.so.1behaviour:ZSTD_compress(..., level)keeps LDM off at every standard level (1..22). The opt-in surface lands separately:compress(level)/CompressionLevel::*set_parameter(EnableLdm, true))ZSTD_c_enableLongDistanceMatching)zstd --long[=N]CLI flagBtMatcher::ldm_producer: Option<LdmProducer>staysNoneuntil those issues land — preserves byte-parity with upstreamZSTD_compress(..., 22)and keeps thelevel22_sequences_match_donor_on_corpus_proxyratio gate green.Module structure
Donor parity anchors
Every primitive cites `lib/compress/zstd_ldm.c` v1.5.7 by file:line:
Implementation status
Test coverage
33 new LDM unit tests + 1 end-to-end producer test:
Build + suite
Commits
Test plan
Related issues
Closes #18
Summary by CodeRabbit