Skip to content

perf(encoding): #18 Phase 5 — LDM producer (gear hash + bucket table + search/emit)#139

Merged
polaz merged 25 commits into
mainfrom
feat/#18-phase5-ldm
May 15, 2026
Merged

perf(encoding): #18 Phase 5 — LDM producer (gear hash + bucket table + search/emit)#139
polaz merged 25 commits into
mainfrom
feat/#18-phase5-ldm

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 15, 2026

Summary

Phase 5 of #111 — implements #18 (long-distance matching). Lands the full donor LDM producer (gear rolling hash + bucket-based hash table + verify/extend search + aggregate driver) under a new encoding/ldm/ module. Two commits, ~2.1k LoC including 33 unit tests.

Activation policy (distro-grade)

LdmProducer is never activated automatically by any CompressionLevel preset. This mirrors upstream libzstd.so.1 behaviour: ZSTD_compress(..., level) keeps LDM off at every standard level (1..22). The opt-in surface lands separately:

Surface LDM activation Status
compress(level) / CompressionLevel::* off, every level preserved by this PR
Rust parameter API (set_parameter(EnableLdm, true)) on tracked in #27
C ABI (ZSTD_c_enableLongDistanceMatching) on tracked in #126 / #127
zstd --long[=N] CLI flag on tracked in #128

BtMatcher::ldm_producer: Option<LdmProducer> stays None until those issues land — preserves byte-parity with upstream ZSTD_compress(..., 22) and keeps the level22_sequences_match_donor_on_corpus_proxy ratio gate green.

Module structure

zstd/src/encoding/ldm/
├── mod.rs        — LdmProducer aggregator + generate_into pipeline
├── gear_hash.rs  — ZSTD_ldm_gear_init/reset/feed + verbatim 256-entry GEAR_TAB
├── params.rs     — LdmParams + ZSTD_ldm_adjustParameters port
├── table.rs      — LdmHashTable (round-robin bucket insert)
└── search.rs     — count_backwards_match + find_best_match (prefix-only path)

Donor parity anchors

Every primitive cites `lib/compress/zstd_ldm.c` v1.5.7 by file:line:

  • `gear_hash::GEAR_TAB` ← `zstd_ldm_geartab.h` (verbatim, 256-entry permutation table, cross-checked zero-mismatch against parsed donor header)
  • `gear_hash::GearHashState::new` ← `ZSTD_ldm_gear_init` (zstd_ldm.c:32)
  • `gear_hash::reset` ← `ZSTD_ldm_gear_reset` (zstd_ldm.c:65)
  • `gear_hash::feed` ← `ZSTD_ldm_gear_feed` (zstd_ldm.c:96)
  • `params::LdmParams::adjust_for` ← `ZSTD_ldm_adjustParameters` (zstd_ldm.c:135)
  • `table::LdmHashTable::insert` ← `ZSTD_ldm_insertEntry` (zstd_ldm.c:198)
  • `search::count_backwards_match` ← `ZSTD_ldm_countBackwardsMatch` (zstd_ldm.c:214)
  • `search::find_best_match` ← per-split inner loop (zstd_ldm.c:405-466)
  • `LdmProducer::generate_into` ← `ZSTD_ldm_generateSequences_internal` (zstd_ldm.c:346-515) prefix-only path

Implementation status

  • Prefix-only path — fully implemented. Equivalent to upstream on every input that fits in a single contiguous history slice (the only path the current Rust encoder exercises — no separate `extDict` buffer surface).
  • Two-segment (extDict) backward path — documented and deferred. `search.rs` cites the donor functions (`ZSTD_count_2segments`, `ZSTD_ldm_countBackwardsMatch_2segments`) but does not yet implement them; lands together with the eventual dictionary-history support work.

Test coverage

33 new LDM unit tests + 1 end-to-end producer test:

Module Tests Coverage
`gear_hash` 9 GEAR_TAB anchor entries (donor cross-check), stop_mask derivation (3 paths), recurrence regression, reset/feed equivalence, zero-mask split-per-byte, concatenation invariant, batch-size pre-condition
`params` 5 strategy→hash_rate_log table, hash_log clamp, min_match halving at btultra, bucket clamp, derived helpers
`table` 7 sizing math, bucket clamp, round-robin wrap, adjacent-bucket isolation, clear+rewind, donor-max hash_log, bucket_mask
`search` 9 count_backwards_match (mismatch stop + anchor bound), find_best_match (stale rejection, checksum filter, longest-combined selection, backward extension, short-forward rejection)
`mod` (aggregate) 3 producer construction with btultra2 defaults, clear-resets-rolling-hash, empty-range no-op
`mod` (e2e) 1 4 KiB payload + 64 KiB gap + 4 KiB repeat → producer emits at least one HcRawSeq crossing the gap

Build + suite

  • 460 / 460 lib tests pass on the full suite (was 427 pre-Phase-5 → +33 LDM tests).
  • `cargo clippy --lib --tests` clean (no warnings).
  • `level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS.
  • No new dependencies; reuses existing `twox-hash::XxHash64` already pulled in for frame checksums.

Commits

  • `8cd597b` — Phase 5 foundations: gear hash + params + bucket table + LdmProducer construction + fill path (gear walk + XXH64 + bucket insert)
  • `7759eef` — Phase 5 producer: search.rs (count_backwards_match + find_best_match) + generate_into rewritten end-to-end to emit HcRawSeq entries via verify/extend pipeline

Test plan

  • `cargo nextest run --lib` — 460/460 green
  • `cargo clippy --lib --tests` — clean
  • `level22_sequences_match_donor_on_corpus_proxy` — PASS (byte-parity preserved, LDM stays None by default)
  • End-to-end producer test — long-range match emitted, offset crosses the engineered gap
  • CI matrix (x86_64-linux, aarch64-linux, x86_64-darwin, aarch64-darwin, i686-linux) — pending GitHub Actions
  • Codecov patch coverage — pending

Related issues

  • Closes #18 — long-distance matching
  • Part of #111 Phase 5
  • Unblocks #27 — configurable parameter API (LDM activation hook will plug into `BtMatcher::ldm_producer`)

Closes #18

Summary by CodeRabbit

  • New Features
    • Feature-gated long-distance-match (LDM) encoder: gear-hash split detection, bucketed LDM table, candidate search, and a producer pipeline; BT matcher now uses live-history context for LDM candidates.
  • Bug Fixes
    • Prevented out-of-bounds panics by validating absolute-coordinate relationships and clearing producer state on reset.
  • Tests
    • Extensive unit and regression tests covering hashing, parameter derivation, table behavior, candidate search, producer flow, and panic prevention.
  • Documentation
    • Rewrote README and crate/module docs; enabled richer feature-gated docs for published builds.

Review Change Stack

polaz added 2 commits May 15, 2026 13:58
Lands the foundational primitives of the donor LDM producer
(`lib/compress/zstd_ldm.c` v1.5.7) under a new `encoding/ldm/`
module. The verify + extend + emit half of the pipeline plus the
`window_log >= 27` activation gate follow in a second Phase 5
commit; by default the producer stays `None` on every BtMatcher,
preserving exact pre-Phase-5 behaviour (ratio gate
`level22_sequences_match_donor_on_corpus_proxy` PASS).

What ships:

* `ldm/gear_hash.rs` — 64-bit gear rolling hash with the donor's
  256-entry permutation table (`zstd_ldm_geartab.h`) reproduced
  verbatim; full unit cross-check confirms zero mismatches against
  the parsed donor header. `init` / `reset` / `feed` match the
  donor recurrence `hash = (hash << 1) + GEAR_TAB[byte]` with the
  same 4-wide unroll and `LDM_BATCH_SIZE = 64` cap.
* `ldm/params.rs` — `LdmParams` + `adjust_for(window_log, strategy)`
  is a direct port of `ZSTD_ldm_adjustParameters` (`zstd_ldm.c:135`),
  including the `7 - strategy/3` hash-rate-log mapping, the
  `BOUNDED(6, window_log - hash_rate_log, 30)` hash-log clamp, the
  `minMatchLength /= 2` halving at strategies >= btultra (8), and
  the `BOUNDED(LDM_BUCKET_SIZE_LOG, strategy, 8)` bucket clamp.
* `ldm/table.rs` — `LdmHashTable` with `entries: Vec<LdmEntry>` of
  length `1 << hash_log` and `bucket_offsets: Vec<u8>` for the
  per-bucket round-robin write cursor (donor `ZSTD_ldm_insertEntry`,
  `zstd_ldm.c:198-207`); silent `MIN(bucket_size_log, hash_log)`
  clamp matches `zstd_ldm.c:176`.
* `ldm/mod.rs` — `LdmProducer` aggregator owns the hash state, the
  bucket table, and a `LDM_BATCH_SIZE` splits scratch buffer. The
  fill path (`generate_into`) walks the input through the gear
  hash, XXH64s each `min_match_length`-byte window (donor
  `zstd_ldm.c:315`, seed 0), masks the low bits for the bucket id
  (`hash_log - bucket_size_log` bits) and stores the high 32 bits
  as the entry checksum.
* `bt/mod.rs` carries a new `ldm_producer: Option<LdmProducer>`
  field initialised to `None`; `BtMatcher::reset` propagates
  `clear` into the producer when present. `prepare_ldm_candidates`
  now takes the per-frame `history` slice and delegates to the
  producer when active, otherwise preserves the original
  defensive `ldm_sequences.clear()` no-op.
* `match_generator.rs` callsite threads `self.table.history` into
  `prepare_ldm_candidates` (disjoint-field split borrow keeps the
  borrow checker happy).

Tests:

* 24 new unit tests (9 gear_hash, 5 params, 7 table, 3 producer)
  with donor `file:line` citations on every primitive.
* Anchor entries `GEAR_TAB[0,42,80,255]` cross-checked against
  donor `zstd_ldm_geartab.h`.
* Recurrence regression test compares `reset` vs `feed(mask=
  u64::MAX)` to guard the 4-wide unroll.
* Round-robin insertion + adjacent-bucket isolation + post-clear
  re-init covered for `LdmHashTable`.
* `LdmProducer` constructs with donor btultra2/window=27 defaults
  (min_match=32, hash_rate_log=4, bucket_size_log=8); `clear`
  rewinds the rolling hash to `GEAR_HASH_INIT`.

Build + suite:

* 451 / 451 lib tests pass (full suite, no skips).
* `cargo clippy --lib --tests` clean.
* Ratio gate `level22_sequences_match_donor_on_corpus_proxy` PASS.
* No new warnings; `ldm/mod.rs` carries a Phase-1-style
  `#![allow(dead_code)]` marker (same precedent as `bt/mod.rs`)
  for the bucket-lookup helpers consumed in the follow-up emit
  commit.
Completes the donor LDM producer pipeline by adding the bucket
lookup + forward/backward extension half. With this commit
`LdmProducer::generate_into` is functionally equivalent to donor
`ZSTD_ldm_generateSequences_internal` on the prefix-only path:
when a `BtMatcher` carries a `Some(LdmProducer)` it emits real
[`HcRawSeq`] entries that the existing optimal-parser consumer
(plumbed in #119) walks at block compress time.

What ships:

* `ldm/search.rs` — `count_backwards_match` mirrors donor
  `ZSTD_ldm_countBackwardsMatch` (`zstd_ldm.c:214`); honours both
  the `anchor` and `match_base` lower bounds, capping the
  backward walk at the tighter of `min(p_in - anchor,
  p_match - match_base)`. `find_best_match` walks every slot of
  the bucket associated with `hash_id`, applies the donor
  staleness + checksum filter (`zstd_ldm.c:431`), runs the
  forward `common_prefix_len` and backward count, and returns
  the candidate with the largest combined `forward + backward`.
  Inputs are bundled into `FindBestMatchInputs` to keep
  `clippy::too_many_arguments` happy without flattening the
  donor citations.
* `ldm/mod.rs` — `generate_into` re-implemented end-to-end:
  re-seeds the rolling hash and primes it with the first
  `min_match_length` bytes of the block (donor
  `zstd_ldm.c:381-383`); maintains absolute `anchor` and `ip`
  cursors; per-split branches on `split < anchor` (insert-only),
  no-match (insert-only), or match-found (emit `HcRawSeq` →
  insert new entry AFTER lookup to avoid clobbering bestEntry,
  donor `zstd_ldm.c:490-492` → advance anchor → re-reset hash
  when the emitted match overruns the hashed window, donor
  `zstd_ldm.c:497-508`).
* Module-level docs updated to spell out the activation policy
  explicitly: `LdmProducer` is **never activated automatically
  by any `CompressionLevel` preset** — distro-grade parity with
  upstream `libzstd.so.1`, where `ZSTD_compress(..., level)`
  keeps LDM off at every standard level (1..22). The opt-in
  surface lands separately: `#27` plugs the Rust parameter API
  into `BtMatcher::ldm_producer`, `#126`/`#127` wire
  `ZSTD_c_enableLongDistanceMatching` through the C ABI, and
  `#128` exposes the `zstd --long[=N]` CLI flag.

Tests:

* 9 new search tests (4 `count_backwards_match` invariants,
  5 `find_best_match` paths: stale-rejection, checksum
  filtering, longest-combined selection, backward extension,
  short-forward rejection). Every donor-staleness fixture uses
  `offset > 0` to satisfy the reserved "empty-slot" sentinel.
* End-to-end producer test
  (`generate_into_emits_long_range_match_on_repeated_payload`):
  4 KiB deterministic LCG payload + 64 KiB unique-byte gap +
  4 KiB repeat of payload; asserts the producer emits at least
  one `HcRawSeq` whose offset >= the gap distance and whose
  match length meets the `min_match_length` floor.

Build + suite:

* 460 / 460 lib tests pass (+9 vs the foundations commit:
  8 search + 1 e2e). Phase-5 LDM tests now total 33.
* `cargo clippy --lib --tests` clean.
* Ratio gate `level22_sequences_match_donor_on_corpus_proxy`
  PASS — `ldm_producer = None` on every `CompressionLevel`
  preset preserves byte-parity with upstream.
Copilot AI review requested due to automatic review settings May 15, 2026 11:25
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bf4c8353-c0b0-456c-871a-1f90f8e4da41

📥 Commits

Reviewing files that changed from the base of the PR and between 1e737a7 and 6de5ef7.

📒 Files selected for processing (1)
  • zstd/src/encoding/mod.rs

📝 Walkthrough

Walkthrough

Adds a complete Long Distance Matching (LDM) subsystem (params, gear hash, table, search, producer) and wires an optional LdmProducer into the BT optimal matcher, updating prepare_ldm_candidates to accept live-history plus absolute coordinates.

Changes

Long Distance Matching (LDM) Producer Implementation

Layer / File(s) Summary
LDM Parameters and Configuration
zstd/src/encoding/ldm/params.rs
LdmParams and donor-parity tuning constants with adjust_for, sizing helpers, and unit tests.
Gear Hash Rolling State
zstd/src/encoding/ldm/gear_hash.rs
GearHashState with new, reset, and feed; verbatim GEAR_TAB; emits split offsets into caller buffer; tests cover masks, recurrence, chunking, and panic behavior.
LDM Hash Table Storage
zstd/src/encoding/ldm/table.rs
LdmEntry and LdmHashTable with per-bucket round-robin insertion, clear, bucket accessors, absolute-position support (insert_absolute/resolve/reduce/ensure_room_for), sizing clamps/asserts, and tests.
Match Search and Extension
zstd/src/encoding/ldm/search.rs
find_best_match and count_backwards_match implement bucket scanning, checksum/staleness filtering, forward match and backward extension; includes tests.
LDM Producer Orchestration
zstd/src/encoding/ldm/mod.rs
LdmProducer owning gear state, hash table, split scratch; constructors, clear, and generate_into scanning history, producing splits, looking up matches, and emitting HcRawSeqs; tests included.
BT Matcher Integration
zstd/src/encoding/bt/mod.rs
BtMatcher gains ldm_producer: Option<LdmProducer> (default None), reset clears producer, prepare_ldm_candidates accepts live_history and absolute coordinates and delegates to LdmProducer::generate_into(...); includes regression test.
Match generator callsite
zstd/src/encoding/match_generator.rs
BT optimal-mode call now borrows live_history and history_abs_start and passes them into prepare_ldm_candidates.
Docs & manifest
README.md, zstd/Cargo.toml, zstd/src/decoding/mod.rs, zstd/src/encoding/mod.rs, zstd/src/lib.rs
README rewritten, docs.rs metadata added to Cargo.toml, decoding module docs expanded, encoding module wires feature-gated ldm, and crate docs cfg for docs.rs enabled.

Sequence Diagram(s)

sequenceDiagram
  participant MatchGen as HcMatchGenerator
  participant Bt as BtMatcher
  participant Prod as LdmProducer
  participant Gear as GearHashState
  participant Table as LdmHashTable
  participant Search as find_best_match
  MatchGen->>Bt: prepare_ldm_candidates(live_history, history_abs_start, current_abs_start, len)
  Bt->>Prod: generate_into(live_history, history_abs_start, block_start_abs, block_end_abs, out)
  Prod->>Gear: reset(prefix_bytes)
  loop feed batches
    Prod->>Gear: feed(data_chunk, &mut splits)
    Gear-->>Prod: split offsets
    loop per split
      Prod->>Search: find_best_match(table, hash_id, checksum, inputs)
      Search-->>Prod: LdmMatch or None
      alt match found
        Prod-->>Bt: append HcRawSeq (litlen, offset, matchlen)
      end
      Prod->>Table: insert_absolute(hash_id, abs_pos, checksum)
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

"🐇
I tap the gear table and watch splits form,
Buckets keep echoes steady and warm.
I chase far repeats across a span,
Stitch long matches as only a rabbit can,
Hooray for compacted bytes and code!"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: implementing Phase 5 of LDM producer with gear hash, bucket table, and search/emit functionality.
Linked Issues check ✅ Passed Code changes implement core LDM primitives (gear hash, hash table, search pipeline) from issue #18. Core acceptance criteria are met: parameter structure added, matches findable beyond window, C-parity established through donor port. Parameter API exposure and auto-window extension deferred to separate issues as noted.
Out of Scope Changes check ✅ Passed All changes align with Phase 5 scope: LDM producer implementation with gear hash, table, search, and integration into BtMatcher. Documentation updates and Cargo metadata are supporting scope. Two-segment backward path correctly documented as deferred.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#18-phase5-ldm

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new long-distance matching (LDM) producer implementation (gear rolling hash + bucketed hash table + verify/extend search + sequence emission) under encoding/ldm/, and plumbs it into the BT optimal path behind an opt-in Option<LdmProducer> so default compression presets remain LDM-off (upstream parity).

Changes:

  • Introduce encoding/ldm/ module implementing the donor-style LDM producer pipeline (gear hash, params, table, search, and aggregator/driver).
  • Extend BtMatcher with an optional ldm_producer and delegate prepare_ldm_candidates() to it when present.
  • Wire the optimal matcher driver to call prepare_ldm_candidates() and register the new ldm module in encoding/mod.rs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
zstd/src/encoding/mod.rs Exposes the new encoding::ldm module.
zstd/src/encoding/match_generator.rs Calls into BT to prepare LDM candidates before optimal parsing.
zstd/src/encoding/bt/mod.rs Adds ldm_producer: Option<LdmProducer> and delegates candidate generation when enabled.
zstd/src/encoding/ldm/mod.rs Aggregates LDM producer state and implements the end-to-end generate_into pipeline.
zstd/src/encoding/ldm/gear_hash.rs Implements donor-parity gear rolling hash + split-point detection.
zstd/src/encoding/ldm/params.rs Implements donor-parity LDM parameter derivation and bounds.
zstd/src/encoding/ldm/table.rs Implements the bucketed LDM hash table and insertion policy.
zstd/src/encoding/ldm/search.rs Implements bucket lookup, checksum filtering, and forward/backward match extension.

Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/bt/mod.rs Outdated
Comment thread zstd/src/encoding/ldm/table.rs
Comment thread zstd/src/encoding/ldm/mod.rs Outdated
…mask

1. `prepare_ldm_candidates` translates absolute stream positions
   to indices into the supplied `history` slice. Before, the
   absolute `current_abs_start` (= `history_abs_start +
   window_size − current_len`) was handed to
   `LdmProducer::generate_into` as if it were a slice index, so
   as soon as `history_abs_start != 0` after window eviction the
   producer would index out of bounds in
   `history[block_start..block_end]`. The method now takes an
   explicit `history_abs_start` parameter; the call site in
   `start_matching_optimal` passes `self.table.history_abs_start`
   alongside the `history` slice. Matches the coordinate
   convention every other matcher backend uses (`idx = abs_pos −
   history_abs_start`, see `match_table/storage.rs:317-318`).

2. `LdmHashTable::new` asserts `effective_bucket_log <= 8`. The
   bucket round-robin cursor is a `u8` (mirrors donor `BYTE
   bucketOffsets[]`, `zstd_ldm.c:202`); above 256 slots the
   cursor would silently truncate, dropping new inserts on the
   floor. `LdmParams::adjust_for` already clamps to
   `LDM_BUCKETSIZELOG_MAX = 8`, so the producer's own path is
   unaffected — the assertion guards callers that bypass the
   params helper.

3. `LdmProducer::generate_into` derives `hash_id_mask` from
   `self.hash_table.bucket_mask()` instead of recomputing from
   `params.hash_log / params.bucket_size_log`. The table applies
   a `min(bucket_size_log, hash_log)` clamp (`zstd_ldm.c:176`)
   which the producer-side recomputation did not; using the
   table's authoritative mask eliminates the drift class.

Tests: new `prepare_ldm_candidates_translates_absolute_positions
_to_slice_indices` regression in `bt/mod.rs` (panics pre-fix on
`history[1024..1280]`); new `new_rejects_bucket_size_log_above
_donor_cap` in `table.rs` (panics with the
`ZSTD_LDM_BUCKETSIZELOG_MAX` assertion on
`LdmHashTable::new(12, 9)`).

462 / 462 lib tests pass. `cargo clippy --lib --tests` clean.
`level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS.
@polaz polaz requested a review from Copilot May 15, 2026 12:25
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/bt/mod.rs`:
- Around line 353-362: The producer's persisted positions must be kept in
absolute coordinates: avoid rebasing to slice-relative indexes before passing
positions into ldm_producer; instead thread history_abs_start into the
producer/search APIs so the producer and its hash table continue to store
absolute positions, and only subtract history_abs_start when performing the
final history[..] slice indexing. Update the call site around
ldm_producer.generate_into (and the ldm_sequences usage) to pass absolute
start/end (e.g., use current_abs_start and current_abs_start + current_len or
analogous absolute bounds) or add a parameter history_abs_start to the
producer/search path, and change the producer/search implementations
(generate_into or its new/overloaded variant) to convert to slice indices only
when accessing history. Ensure any internal hash-table insertions/searches use
absolute positions so entries remain valid across evictions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e81a8163-2625-47d7-a3ee-d3b58e7ad32e

📥 Commits

Reviewing files that changed from the base of the PR and between 7759eef and bf3bcbd.

📒 Files selected for processing (4)
  • zstd/src/encoding/bt/mod.rs
  • zstd/src/encoding/ldm/mod.rs
  • zstd/src/encoding/ldm/table.rs
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/bt/mod.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/bt/mod.rs Outdated
Comment thread zstd/src/encoding/ldm/mod.rs Outdated
Comment thread zstd/src/encoding/ldm/gear_hash.rs Outdated
The producer's bucket table survives across blocks within a frame;
storing slice-relative offsets in `LdmEntry.offset` broke
correctness as soon as `history_abs_start` advanced past zero
under a window slide — old entries pointed at the wrong bytes,
and `offset = split_abs − stale_match_pos` could underflow `u32`
and emit invalid back-references.

This commit moves every cross-block invariant onto absolute
stream coordinates:

* `LdmProducer::generate_into` now takes
  `(live_history, history_abs_start, block_start_abs,
  block_end_abs, out)`. `live_history[0]` corresponds to absolute
  `history_abs_start` (the byte that other matcher backends call
  `base + dictLimit`); the abs→slice translation happens only at
  the moment of `live_history[..]` indexing.
* `LdmEntry.offset` stores absolute stream positions. Entries
  that fall out of the current window after a slide are filtered
  by `find_best_match`'s `entry.offset <= lowest_index_abs`
  staleness check, where the producer now passes the current
  `history_abs_start` as `lowest_index_abs`.
* `FindBestMatchInputs` gains a `history_abs_start` field;
  search-side `live_history`, `split_abs`, `anchor_abs`,
  `lowest_index_abs` are all interpreted as absolute coordinates.
  Internal abs→slice translation lives in a single place.
* `BtMatcher::prepare_ldm_candidates` forwards the four absolute
  coordinates straight to the producer (no translation
  in-between).
* `MatchGeneratorDriver::start_matching_optimal` passes
  `self.table.live_history()` (the post-`history_start` slice)
  instead of the full `history` Vec; `history_abs_start`
  corresponds to `live_history()[0]`, not `history[0]`, so the
  earlier code would have indexed into the dead prefix once
  eviction began.
* `gear_hash.rs` anchor-entry test comment fix: GEAR_TAB[80] is
  donor row 26 col 2, not row 21 col 2 (`80 / 3 = 26 r 2`).

Tests:

* New `generate_into_preserves_bucket_entries_across_history_
  slide` in `ldm/mod.rs`: builds a 2×4 KiB payload + 2×32 KiB
  gap frame; invokes `generate_into` first with
  `history_abs_start = 0` (populates the table with entries
  inside payload #1), then again with `history_abs_start = 2048`
  (simulates window eviction that drops the leading half of
  payload #1) targeting payload #2. Asserts (a) the producer
  doesn't panic, (b) every emitted offset stays inside the
  post-slide live history, (c) at least one emitted offset
  >= GAP_A — the long-range win bucket entries from call 1 are
  supposed to produce. Pre-fix slice-relative offsets would
  either miss the match entirely or emit a corrupt
  larger-than-live-history offset.
* `prepare_ldm_candidates_translates_absolute_positions_to_
  slice_indices` regression from round 1 still passes — the
  abs-coords producer makes the original OOB scenario
  impossible by construction.
* All search.rs / generate_into unit tests migrated to the new
  signature (added `history_abs_start = 0` argument).

Build + suite:

* 463 / 463 lib tests pass (+1 cross-slide regression).
* `cargo clippy --lib --tests` clean.
* `level22_sequences_match_donor_on_corpus_proxy` ratio gate
  PASS — `ldm_producer = None` default still preserves
  byte-parity with upstream.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/ldm/gear_hash.rs`:
- Around line 101-116: The degenerate branch in GearHashState::new currently
computes (1u64 << hash_rate_log) which can overflow if hash_rate_log >= 64; make
it defensive by clamping the shift or returning a full-ones mask when
hash_rate_log is out of range: e.g., compute let shift = hash_rate_log.min(63);
if hash_rate_log >= 64 then set stop_mask = u64::MAX else use (1u64 << shift) -
1 (or equivalent) so stop_mask construction in new never shifts by >=64;
reference GearHashState::new, stop_mask and hash_rate_log when applying the fix.

In `@zstd/src/encoding/ldm/search.rs`:
- Around line 49-68: The docstring for count_backwards_match incorrectly calls
parameters "absolute byte positions" while callers like find_best_match pass
slice-relative indices (after subtracting history_abs_start); update the
documentation of count_backwards_match to state that p_in_abs, anchor_abs,
p_match_abs, and match_base_abs are indices relative to the provided history
slice (or, alternatively, change callers to pass true absolute positions), and
ensure the wording and parameter names reflect "slice indices" rather than
"absolute byte positions" so the contract matches the implementation and
callers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ff790935-e1c0-4250-88e4-58a01ddfdf31

📥 Commits

Reviewing files that changed from the base of the PR and between bf3bcbd and b036a1d.

📒 Files selected for processing (5)
  • zstd/src/encoding/bt/mod.rs
  • zstd/src/encoding/ldm/gear_hash.rs
  • zstd/src/encoding/ldm/mod.rs
  • zstd/src/encoding/ldm/search.rs
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/ldm/gear_hash.rs
Comment thread zstd/src/encoding/ldm/search.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Comment thread zstd/src/encoding/ldm/table.rs Outdated
Comment thread zstd/src/encoding/ldm/search.rs Outdated
Comment thread zstd/src/encoding/ldm/mod.rs Outdated
Comment thread zstd/src/encoding/ldm/mod.rs Outdated
Comment thread zstd/src/encoding/bt/mod.rs Outdated
Comment thread zstd/src/encoding/ldm/table.rs Outdated
Comment thread zstd/src/encoding/ldm/search.rs Outdated
Wraps the LDM bucket table in donor's rebase scheme so streams
beyond `u32::MAX` (4 GiB) encode without truncation; tightens
small-detail review feedback alongside.

1. **Rebase scheme (donor `ZSTD_ldm_reduceTable`, zstd_ldm.c:520).**
   `LdmEntry.offset` is no longer an absolute stream position
   stored as `u32` (would truncate above 4 GiB). It is now a
   `u32` *relative* to `LdmHashTable::position_base` with a +1
   bias so the default-zeroed value remains the empty-slot
   sentinel. New helpers:
   - `insert_absolute(hash_id, abs_pos, checksum)` — translates
     `abs_pos − position_base + 1` before storing.
   - `resolve(&entry) -> Option<usize>` — inverse translation;
     returns `None` for the sentinel.
   - `ensure_room_for(abs_pos)` — checks whether the relative
     offset would exceed `u32::MAX − REBASE_GUARD_BAND` (2^30
     headroom); if so, calls `reduce`.
   - `reduce(amount)` — subtracts `amount` from every entry's
     relative offset (saturating at 0 = empty), advances
     `position_base`. Matches donor `ZSTD_ldm_reduceTable` shape.
   The producer calls `ensure_room_for` before every insert and
   uses `insert_absolute`; `find_best_match` uses `table.resolve`
   to translate entries back to absolute positions before the
   staleness / window-bounds checks. Same scheme `MatchTable`
   uses for the BT/HC chain table.

2. **Wider position types in the search-side surface.**
   `LdmMatch.match_pos` and `FindBestMatchInputs.lowest_index_abs`
   widened from `u32` to `usize` so values past 4 GiB stay
   representable end-to-end. Internal storage remains `u32`
   (the rebase scheme keeps the value in range); only the
   external API uses `usize`.

3. **Defensive `hash_rate_log.min(63)` clamp** in
   `GearHashState::new` (gear_hash.rs:116). `LdmParams::adjust_for`
   already produces 4..7, but a future caller bypassing the
   params helper could hit the `1u64 << hash_rate_log` shift
   panic — donor calls this path via internal-only entry points
   too, but our `pub(crate)` surface invites a wider audience.

4. **`LdmHashTable::clear` uses `entries.fill(...)`** instead of
   a manual loop. `LdmEntry` is `Copy` so this compiles to a
   bulk memset — meaningful when `hash_log` is large and clear
   runs at every frame boundary.

5. **`count_backwards_match` parameter rename `*_abs → *_idx`**.
   After the absolute-coords refactor, callers translate to
   slice indices via `abs − history_abs_start` before invoking
   the function; the `*_abs` names misled readers (and the
   deferred extDict path) into expecting absolute coordinates.

6. **Test memory.** Three tests that exercised the producer at
   `with_window_and_strategy(27, 9)` allocated a `1 << 23 = 8M`
   entry table (~64 MiB) under parallel nextest, risking OOM on
   32-bit shards. They now use a hand-tuned `LdmParams { hash_log:
   10, .. }` via a `test_params()` helper (~8 KiB allocation)
   without sacrificing test coverage. `producer_constructs_with
   _donor_default_params` was split — the default-derivation
   check now uses `LdmParams::adjust_for(27, 9)` directly so the
   ~64 MiB producer isn't instantiated at all.

Tests:

* +4 new rebase coverage in `table.rs`:
  - `insert_absolute_round_trips_through_resolve`
  - `resolve_returns_none_for_empty_slot`
  - `reduce_preserves_resolved_absolute_positions` (verifies the
    donor "subtract reducer, saturate at 0" semantics and
    `position_base` advance preserve resolved positions for
    surviving entries)
  - `ensure_room_for_rebases_above_guard_band` (triggers a
    rebase at `u32::MAX − 2^30 + 1` and confirms a fresh insert
    at the trigger position round-trips through `resolve`).
* All search-side tests migrated to `insert_absolute` so the
  +1 bias is exercised end-to-end.

Build + suite:

* 467 / 467 lib tests pass (+4 rebase, no regressions).
* `cargo clippy --lib --tests` clean.
* `level22_sequences_match_donor_on_corpus_proxy` ratio gate
  PASS — `ldm_producer = None` default still preserves
  byte-parity with upstream.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/ldm/table.rs`:
- Around line 170-173: The clear() method currently zeroes entries and
bucket_offsets but leaves position_base unchanged; update LdmTable::clear (the
clear() impl) to also set self.position_base = 0 so subsequent calls like
insert_absolute(hash_id, 0, checksum) do not produce negative relative offsets
against an old base; ensure you modify the same clear() that fills
LdmEntry::default() and bucket_offsets.fill(0) to include resetting
position_base to 0.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5a70916a-39a8-422c-8e2a-0f3e927acafe

📥 Commits

Reviewing files that changed from the base of the PR and between b036a1d and 70aca66.

📒 Files selected for processing (5)
  • zstd/src/encoding/bt/mod.rs
  • zstd/src/encoding/ldm/gear_hash.rs
  • zstd/src/encoding/ldm/mod.rs
  • zstd/src/encoding/ldm/search.rs
  • zstd/src/encoding/ldm/table.rs

Comment thread zstd/src/encoding/ldm/table.rs
`clear()` previously zeroed the entries and bucket cursors but
left `position_base` at its last rebased value. After a
`reduce()` shifted the base forward, `clear()` would leave the
table in a state where a fresh-frame insert at absolute 0 hit
the `abs_pos >= position_base` assertion in `insert_absolute`.

Resetting `position_base = 0` keeps `clear()` semantically
equivalent to constructing a fresh `LdmHashTable` — the per-
frame producer reset path can now restart absolute positions
from anywhere without surprise.

New `clear_resets_position_base` regression in table.rs:
inserts at abs 1024, calls `reduce(1 << 20)` to shift the base,
verifies `position_base > 0`, calls `clear()`, asserts
`position_base == 0`, and then inserts at abs 0 (would panic on
the assertion if the reset didn't happen).

468 / 468 lib tests pass. `cargo clippy --lib --tests` clean.
Copilot AI review requested due to automatic review settings May 15, 2026 14:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/ldm/table.rs Outdated
Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite Current: 6de5ef7 Previous: 4bf13f0 Ratio
compress/level22/small-4k-log-lines/matrix/c_ffi 0.111 ms 0.085 ms 1.31

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

polaz added 2 commits May 15, 2026 17:19
README:

* Lead with a value-prop tagline (pure Rust, dictionary handle, no FFI,
  no_std) instead of jumping straight into the benchmarks dashboard.
* Add a Quick Start section with `cargo add` and a `no_std` toggle.
* Reframe the status section as decoder / encoder / dictionary
  capability summaries instead of a TODO-style feature checklist.
* Collapse the internal compression-strategy backend table into a
  `<details>` block.
* Move the live benchmark dashboard link into a Performance section
  pointing at the GitHub Pages dashboard; methodology stays in
  BENCHMARKS.md, no headline numbers in README.
* Move the project-relationship paragraph to the bottom and shorten it.
* Drop the internal ADR-013 reference (replaced by a concrete
  no-cmake / no-FFI statement in the tagline).
* Switch the `LICENSE` link from a broken relative path to an absolute
  GitHub URL — the previous form rendered dead on docs.rs.

docs.rs landing:

* Replace the duplicated Decompression/Compression preamble in
  `lib.rs` with a feature-oriented overview that links into the
  per-module documentation, then re-includes the README via
  `include_str!` (single source of truth for the landing page).
* Enable `#![cfg_attr(docsrs, feature(doc_cfg))]` so feature-gated
  items render their feature flag badge on docs.rs.
* Add `[package.metadata.docs.rs] all-features = true, rustdoc-args =
  ["--cfg", "docsrs"]` so the published documentation builds with
  every optional surface visible.
* Expand `decoding/mod.rs` and `encoding/mod.rs` module preambles —
  the previous one-liners pointed users at the wrong entry point.
  The new docs walk through the three entry points each side exposes
  (one-shot / streaming / low-level) with the trade-off for each.

All 468 / 468 lib tests pass. `cargo clippy --lib --tests` clean.
`cargo doc --lib --features dict_builder` builds with zero warnings.
The doc comment referenced a non-existent `LDM_INSERT_LOOKAHEAD`
constant and claimed the function returned `abs_pos -
shift_applied`. The function returns `()` and operates only on
the supplied `abs_pos` — there is no per-call lookahead. The
rewritten comment matches the actual signature and explains the
single-compare common path plus the rebase trigger.

Doc-only change; 468 / 468 lib tests still pass.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/ldm/table.rs`:
- Around line 294-296: The rustdoc for ensure_room_for uses a line that begins
with "- " which rustdoc treats as a list item and triggers
clippy::doc-lazy-continuation; edit the doc comment for ensure_room_for (the
paragraph mentioning u32::MAX and REBASE_GUARD_BAND and Self::reduce) so the
sentence does not start with a leading hyphen — e.g., rephrase the fragment to
remove the leading "-" or wrap the symbol in inline code text so it reads like
"exceed `u32::MAX - REBASE_GUARD_BAND`, advance `position_base` by
`REBASE_GUARD_BAND` and run `Self::reduce` across all…" ensuring the line no
longer begins with "- " to satisfy clippy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8aab66e4-92a3-4663-a828-b57c8890ed3e

📥 Commits

Reviewing files that changed from the base of the PR and between c926dad and a1725fb.

📒 Files selected for processing (6)
  • README.md
  • zstd/Cargo.toml
  • zstd/src/decoding/mod.rs
  • zstd/src/encoding/ldm/table.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/lib.rs

Comment thread zstd/src/encoding/ldm/table.rs Outdated
Clippy 1.95's `doc_lazy_continuation` lint parses a line starting
with `- ` as the head of a markdown list item; the surrounding
sentence continuation then triggers the indentation warning. CI
hits `-D warnings` so the build fails. Rewording to `When the
relative position would exceed u32::MAX − REBASE_GUARD_BAND`
keeps the prose at column 0 with no leading dash.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/ldm/table.rs`:
- Around line 218-237: insert() currently accepts an entry when the bucket
offset is 0, but 0 is the empty-slot sentinel so such inserts become invisible
to resolve(); add a runtime check in pub(crate) fn insert (in table.rs) to
reject or panic on attempts to insert when offset == 0 (use the computed offset
variable and include a clear message referencing the bucket/hash_id), and update
the round-robin test fixture that uses 0 as test data so it uses a non-zero slot
value (also apply the same guard where similar logic exists around the other
insert-like code span noted in the review).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cd8b0a29-def5-487c-b985-8d4de3871b43

📥 Commits

Reviewing files that changed from the base of the PR and between 8be94ad and 23d5474.

📒 Files selected for processing (1)
  • zstd/src/encoding/ldm/table.rs

Comment thread zstd/src/encoding/ldm/table.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/ldm/table.rs Outdated
polaz added 2 commits May 15, 2026 22:04
The new ensure_room_for_loops_across_multiple_guard_bands test
materialises an abs_pos of 5 * REBASE_GUARD_BAND (= 5 GiB) as a
usize literal, which overflows usize on i686 (usize::MAX ≈ 4 GiB)
at compile time under #[deny(arithmetic_overflow)]. Gate the test
to target_pointer_width = "64"; the scenario is unreachable on
32-bit anyway because usize::MAX caps addressable streams below
the multi-rebase threshold the test exercises.
…insert panic docs

CodeRabbit Major: insert() accepted entry.offset == 0, which aliases
the empty-slot sentinel — resolve() would treat the stored entry as
empty and silently drop the candidate. Promote to a runtime assert
(not debug_assert!) so contract violations fail fast in release.
Shift the round-robin fixture from 0..6 to 1..=6 so the test stops
using the sentinel as data, and add a regression test for the new
panic path.

Copilot: the prior panics-doc claim that release would 'corrupt the
next bucket' was inaccurate — Vec OOB indexing panics in both debug
and release. Rephrase: the debug_assert only exists for an earlier,
clearer error message; release falls back to Vec's own bounds check.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/mod.rs Outdated
compress streams from Read incrementally and buffers only compressed
blocks until the frame end (to write Frame Content Size in the
header) — peak ≈ output_size, independent of input length.
compress_to_vec eagerly read_to_end's the input into a Vec so the
encoder can be handed &[u8] + an exact source-size hint, peak ≈
input + output. The previous combined wording conflated the two and
overstated compress's input buffering requirements.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/mod.rs`:
- Around line 3-25: The module docs incorrectly state "Three entry points" while
listing four (`compress`, `compress_to_vec`, `StreamingEncoder`,
`FrameCompressor`); update the opening sentence to reflect the correct count
(e.g., "Four entry points cover the common use cases") or rephrase it to avoid a
hard count, and ensure the updated text remains consistent with the subsequent
bullets for `compress`, `compress_to_vec`, `StreamingEncoder`, and
`FrameCompressor`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4ad2428f-7021-4273-be14-74996bfa3d0f

📥 Commits

Reviewing files that changed from the base of the PR and between c1f5b44 and d222e4e.

📒 Files selected for processing (1)
  • zstd/src/encoding/mod.rs

Comment thread zstd/src/encoding/mod.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/mod.rs Outdated
Comment thread zstd/src/encoding/mod.rs Outdated
…reference

CodeRabbit + Copilot flagged 'Three entry points' enumerating four
bullets (compress, compress_to_vec, StreamingEncoder,
FrameCompressor). Bump to 'Four'.

Copilot also noted StreamingEncoder implements crate::io::Write, not
std::io::Write unconditionally — crate::io re-exports std::io when
the std feature is on and falls back to a no_std-friendly trait
otherwise. Update the bullet to reference crate::io::Write and call
out the alias so no_std consumers are not misled.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/mod.rs Outdated
The previous wording 'peak ≈ output_size, independent of input length'
misled: output_size grows with input length, and worst-case for
incompressible payloads is O(input_size) (compressed cannot exceed
input + small overhead). Rephrase to 'O(compressed_size) (worst-case
O(input_size))' and call out explicitly that the win vs
compress_to_vec is avoiding materialising the input alongside the
output.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.

@polaz polaz merged commit 05036ce into main May 15, 2026
40 checks passed
@polaz polaz deleted the feat/#18-phase5-ldm branch May 15, 2026 22:04
@sw-release-bot sw-release-bot Bot mentioned this pull request May 15, 2026
polaz added a commit that referenced this pull request May 16, 2026
… customMem tracking

Three coupled changes that together bring Dfast (default level)
memory from 70 MB to ~9 MB on a 1 MiB input and make the
FFI-side memory metric actually count the C heap:

1. **[u32; 4] slots + rebase scheme** — DfastMatchGenerator's
   short_hash / long_hash switch from Vec<[usize; 4]> to
   Vec<[u32; 4]>. A new position_base field anchors the +1-biased
   relative encoding (slot 0 = empty sentinel, slot N = abs_pos
   position_base + N - 1). ensure_room_for() advances the base
   when the relative offset would exceed u32::MAX - GUARD_BAND,
   shifting every stored slot down via reduce(). Same scheme as
   encoding/ldm/table.rs from PR #139.

   Per-slot memory: 32 -> 16 bytes (2x). Combined with item 2 the
   two-table footprint drops from 64 MiB to ~4 MiB on level 3
   large input.

2. **DFAST_HASH_BITS 20 -> 17** — matches donor clevels.h:31
   (level 3 large-input bucket, hashLog = 17). Caps the upper
   bound; dfast_hash_bits_for_window still clamps the runtime
   value to [MIN_WINDOW_LOG, DFAST_HASH_BITS].

3. **ZSTD_customMem hooks for FFI bench** — CR review of PR #143
   flagged the original ffi_peak_alloc_bytes metric as
   misleading: #[global_allocator] only observes Rust
   allocations, so libzstd's CCtx + hashTable + chainTable +
   workspace allocations through libc malloc were invisible.
   Route the FFI path through ZSTD_createCCtx_advanced /
   ZSTD_createDCtx_advanced with a customMem shim that forwards
   alloc/free to alloc::alloc / alloc::dealloc (which the
   TrackingAllocator wraps). FFI memory numbers now reflect real
   C heap use.

   Two new functions in compare_ffi.rs:
   - ffi_encode_all_aligned: replaces zstd::stream::Encoder with
     raw ZSTD_compress2 on a customMem-aware CCtx
   - ffi_decompress_via_custom_mem: same idea for the decode path
   The header-prefix allocator (16-byte align, size stored at
   ptr-16) lets customFree recover the layout for symmetric
   tracking.

Local level_3_dfast measurements (macOS aarch64, 1 MiB
decodecorpus-z000033):

  axis              before     after          win
  peak alloc (rust) 70 MiB     9.7 MiB        7.2x lower
  peak alloc (ffi)  1 MiB*     2.3 MiB        *was misreported
  ratio rust/ffi    67x        4.3x           genuine narrowing
  throughput        9.89 MB/s  16.0 MB/s      1.6x faster
  compression       0.5349     0.5349         unchanged

Ratio gate level22_sequences_match_donor_on_corpus_proxy green
throughout. 478/478 lib tests pass. Part of #111 Phase 7.

Closes #143 review threads (Rust-only allocator scope + heading
rename).
polaz added a commit that referenced this pull request May 16, 2026
Three Copilot review threads on PR #143:

1. compare_ffi.rs:226 — the customMem-instrumented FFI encode helper
   was also used by criterion's c_ffi timing benchmark, so libzstd
   latency numbers included header bookkeeping + atomic tracker
   overhead a real caller never pays. Split into two helpers:
   ffi_encode_all_aligned (uninstrumented, via zstd::stream::Encoder,
   used by the timing bench) and ffi_encode_via_custom_mem
   (customMem-aware, used only inside measure_peak_alloc). Throughput
   numbers now reflect what a real downstream caller observes.

2. dfast/mod.rs:156 — no targeted coverage for the new
   ensure_room_for -> reduce() path. Added
   dfast_ensure_room_for_rebases_above_guard_band, mirroring the
   LdmHashTable rebase tests from PR #139. The test seeds an early
   slot, fires ensure_room_for at u32::MAX - GUARD_BAND + 1 (gated
   to target_pointer_width = 64), and verifies (a) position_base
   advanced by DFAST_REBASE_GUARD_BAND, (b) the pre-rebase entry
   clamped to DFAST_EMPTY_SLOT, (c) a post-rebase pack/unpack
   round-trips the absolute position.

3. match_generator.rs:67 — PR description didn't surface that
   DFAST_HASH_BITS 20 -> 17 plus the [usize;4] -> [u32;4] storage
   change are encoder behavior changes, not just bench/reporting
   work. Rewrote the PR body to explicitly call out the scope:
   Dfast default sizing now matches donor clevels.h, slot
   representation is rebase-backed, compression ratio remains
   identical on every tested scenario (level_3_dfast smoke).

479/479 lib tests pass; level22 ratio gate green.
polaz added a commit that referenced this pull request May 16, 2026
…ine (#143)

* feat(bench): replace buffer-size estimate with real peak-alloc metric

Bench previously emitted REPORT_MEM with rust_buffer_bytes_estimate
= ffi_buffer_bytes_estimate = input.len() + output.len() — a static
shape estimate that ignores matcher state, scratch buffers, and any
per-call internal allocation. The number was identical for both
columns, so it carried no comparative signal.

Replace with a hand-rolled GlobalAlloc wrapper around System (~30
LoC) that tracks live-bytes and high-water mark via two relaxed
atomics. measure_peak_alloc(closure) reset()s the peak before
running the closure and reports the delta of live bytes above the
baseline. bench_compress and bench_decompress each wrap one full
encode/decode pass for Rust and FFI separately so the emitted
rust_peak_alloc_bytes / ffi_peak_alloc_bytes reflect real per-call
high-water marks rather than approximations.

run-benchmarks.sh:
 - Update the REPORT_MEM regex + field names in lockstep.
 - Add peak_alloc_bytes records to benchmark-relative.json (third
   metric alongside compression_ratio + throughput_bytes_per_sec).
   delta_ratio = rust_peak / ffi_peak, same direction semantics as
   compression_ratio (>1 means Rust uses more). Stage normalised
   (compress / decompress-{source} -> stage + source) so the
   dashboard's per-series grouping picks them up under the existing
   filter logic.
 - Markdown report header renamed to 'Rust/C peak alloc bytes'.

Baseline numbers from a local level_3_dfast shard (will land in CI
on next bench-matrix run; numbers below are macOS aarch64 local):

  scenario              rust_kb   ffi_kb   ratio
  small-1k-random         1299      33     39x
  small-10k-random        1335      42     32x
  small-4k-log-lines      1330      32     41x
  high-entropy-1m        71709    2080     34x
  decodecorpus-z000033   71160    1056     67x
  low-entropy-1m         69199      32   2142x
  large-log-stream      180823      51   3543x

This is the first half of the Phase 7 baseline (#111). Encode +
decode latency and compression ratio numbers already land via the
existing bench matrix; this commit closes the gap on the third
dimension. Picking the worst (scenario, level) ratios from the next
main snapshot will drive the target_feature / arena-allocator
discussion in the second half of Phase 7.

* perf(dfast): donor-parity hash storage (u32 slots + rebase) + libzstd customMem tracking

Three coupled changes that together bring Dfast (default level)
memory from 70 MB to ~9 MB on a 1 MiB input and make the
FFI-side memory metric actually count the C heap:

1. **[u32; 4] slots + rebase scheme** — DfastMatchGenerator's
   short_hash / long_hash switch from Vec<[usize; 4]> to
   Vec<[u32; 4]>. A new position_base field anchors the +1-biased
   relative encoding (slot 0 = empty sentinel, slot N = abs_pos
   position_base + N - 1). ensure_room_for() advances the base
   when the relative offset would exceed u32::MAX - GUARD_BAND,
   shifting every stored slot down via reduce(). Same scheme as
   encoding/ldm/table.rs from PR #139.

   Per-slot memory: 32 -> 16 bytes (2x). Combined with item 2 the
   two-table footprint drops from 64 MiB to ~4 MiB on level 3
   large input.

2. **DFAST_HASH_BITS 20 -> 17** — matches donor clevels.h:31
   (level 3 large-input bucket, hashLog = 17). Caps the upper
   bound; dfast_hash_bits_for_window still clamps the runtime
   value to [MIN_WINDOW_LOG, DFAST_HASH_BITS].

3. **ZSTD_customMem hooks for FFI bench** — CR review of PR #143
   flagged the original ffi_peak_alloc_bytes metric as
   misleading: #[global_allocator] only observes Rust
   allocations, so libzstd's CCtx + hashTable + chainTable +
   workspace allocations through libc malloc were invisible.
   Route the FFI path through ZSTD_createCCtx_advanced /
   ZSTD_createDCtx_advanced with a customMem shim that forwards
   alloc/free to alloc::alloc / alloc::dealloc (which the
   TrackingAllocator wraps). FFI memory numbers now reflect real
   C heap use.

   Two new functions in compare_ffi.rs:
   - ffi_encode_all_aligned: replaces zstd::stream::Encoder with
     raw ZSTD_compress2 on a customMem-aware CCtx
   - ffi_decompress_via_custom_mem: same idea for the decode path
   The header-prefix allocator (16-byte align, size stored at
   ptr-16) lets customFree recover the layout for symmetric
   tracking.

Local level_3_dfast measurements (macOS aarch64, 1 MiB
decodecorpus-z000033):

  axis              before     after          win
  peak alloc (rust) 70 MiB     9.7 MiB        7.2x lower
  peak alloc (ffi)  1 MiB*     2.3 MiB        *was misreported
  ratio rust/ffi    67x        4.3x           genuine narrowing
  throughput        9.89 MB/s  16.0 MB/s      1.6x faster
  compression       0.5349     0.5349         unchanged

Ratio gate level22_sequences_match_donor_on_corpus_proxy green
throughout. 478/478 lib tests pass. Part of #111 Phase 7.

Closes #143 review threads (Rust-only allocator scope + heading
rename).

* test(dfast): assert pack_slot never returns sentinel before bucket membership checks

CR review of PR #143 flagged the three dfast_seed_remaining_hashable*
tests as potentially vacuous: the bucket membership assertion
`matcher.long_hash[long_hash].contains(&target_slot)` would still
pass on a partially-empty bucket if `pack_slot(target_abs_pos)`
ever regressed to `DFAST_EMPTY_SLOT` — the sentinel sits in
unused entries, so a false positive could mask a real failure to
seed the position.

Guard each `pack_slot` -> `contains` pairing with an explicit
`assert_ne!(target_slot, DFAST_EMPTY_SLOT)`. Three call sites in
the dfast_seed_remaining_hashable_starts_* tests.

* ci(bench): strategy-grouped shards, PR-fast feedback, alert only on canonical levels

Three coupled CI changes for Phase 7:

1. **Shard plan moved from "one level per shard" to "one strategy
   per shard"**. bench-matrix now emits a 'shards' output: each
   entry has an 'id' (fast / dfast / greedy / lazy / btopt /
   btultra / btultra2) and a CSV of levels ( is comma-aware, so the bench binary iterates all
   levels in one process). Shard count drops from 87 (3 targets ×
   29 levels) to 21 (3 × 7 strategies) on main, with the per-binary
   build cost amortised across each strategy's levels.

2. **PR runs only the canonical pair**. On a 'pull_request' event
   bench-matrix emits a single 'pr-canonical' shard whose levels
   are 'level_3_dfast,level_22_btultra2' — the donor default + max
   compression. Three shards total per PR (one per target) instead
   of 87. Reviewers see ratio + speed + memory deltas for the two
   levels we ship as the primary public guarantees within minutes,
   the full 29-level matrix still runs post-merge to keep the
   gh-pages snapshot complete.

3. **github-action-benchmark alert restricted to canonical levels**.
   ALERT_LEVELS = {level_3_dfast, level_22_btultra2} in
   run-benchmarks.sh. Every other level still lands in
   benchmark-relative.json / benchmark-report.md / the dashboard,
   but only the canonical pair drives regression alerts —
   experimental levels (fast tier, btopt) shouldn't fire false
   positives while they're being tuned.

aggregate-bench-levels.py markdown header changes from
'## Level: <level>' to '## Strategy group: <shard_id>' so the
per-target combined report makes sense when each shard now bundles
many levels.

Smoke-tested locally: single shard with comma-separated filter
'level_3_dfast,level_22_btultra2' emits 98 relative records (49
per level), confirming both levels iterate in one binary invocation.

* fix(bench): unbias FFI latency + add Dfast rebase regression test

Three Copilot review threads on PR #143:

1. compare_ffi.rs:226 — the customMem-instrumented FFI encode helper
   was also used by criterion's c_ffi timing benchmark, so libzstd
   latency numbers included header bookkeeping + atomic tracker
   overhead a real caller never pays. Split into two helpers:
   ffi_encode_all_aligned (uninstrumented, via zstd::stream::Encoder,
   used by the timing bench) and ffi_encode_via_custom_mem
   (customMem-aware, used only inside measure_peak_alloc). Throughput
   numbers now reflect what a real downstream caller observes.

2. dfast/mod.rs:156 — no targeted coverage for the new
   ensure_room_for -> reduce() path. Added
   dfast_ensure_room_for_rebases_above_guard_band, mirroring the
   LdmHashTable rebase tests from PR #139. The test seeds an early
   slot, fires ensure_room_for at u32::MAX - GUARD_BAND + 1 (gated
   to target_pointer_width = 64), and verifies (a) position_base
   advanced by DFAST_REBASE_GUARD_BAND, (b) the pre-rebase entry
   clamped to DFAST_EMPTY_SLOT, (c) a post-rebase pack/unpack
   round-trips the absolute position.

3. match_generator.rs:67 — PR description didn't surface that
   DFAST_HASH_BITS 20 -> 17 plus the [usize;4] -> [u32;4] storage
   change are encoder behavior changes, not just bench/reporting
   work. Rewrote the PR body to explicitly call out the scope:
   Dfast default sizing now matches donor clevels.h, slot
   representation is rebase-backed, compression ratio remains
   identical on every tested scenario (level_3_dfast smoke).

479/479 lib tests pass; level22 ratio gate green.

* fix(bench): gate tracking allocator, stream FFI customMem, skip non-canonical alert fallback

Three coupled fixes from PR #143 review (3 CR + 2 Copilot):

1. **Tracking allocator gated by TRACKING_ENABLED flag** (CR #7 +
   Copilot #9). Until now TrackingAllocator updated atomics on every
   process-wide alloc/free/realloc/alloc_zeroed, including criterion's
   timing loops. That biased every Rust-side throughput sample by the
   atomic fetch_add/fetch_max + branch overhead while the FFI
   uninstrumented path paid nothing. Add a static AtomicBool
   TRACKING_ENABLED = false; flip to true only inside measure_peak_alloc
   via an RAII guard (TrackingGuard) so panic safety is automatic, then
   flip back to false on drop. All four allocator methods now check
   TRACKING_ENABLED before touching ALLOC_CURRENT/ALLOC_PEAK so the
   timing loops pay one branch + Relaxed atomic load per allocation
   instead of two atomic RMWs.

2. **FFI customMem encode now streams via ZSTD_compressStream2** (Copilot
   #10). Previous one-shot via ZSTD_compress2 preallocated
   ZSTD_compressBound(input.len()) output Vec, while the throughput path
   used zstd::stream::Encoder writing into a growing Vec. That made
   ffi_peak_alloc_bytes inflate to the worst-case bound on every
   measurement — visible on large-log-stream where preallocation
   showed ~104 MiB FFI peak while real streaming peaks at ~3.7 MiB.
   Replace with a streaming loop over ZSTD_compressStream2 with
   ZSTD_e_continue/ZSTD_e_end, growing the output Vec on each flush.
   Both paths now allocate the same FFI encode shape.

3. **Non-canonical shards emit empty benchmark-results.json** (CR #6).
   Previously when regression_levels was empty (e.g. the 'fast' or
   'btopt' strategy shard) the fallback repopulated benchmark_results
   with EVERY parsed timing, silently re-expanding alert surface to
   29 levels. Split the fallback: when regression_levels still has
   members (canonical shard mis-scoped on scenarios), keep the fallback
   so the dashboard sees data; when regression_levels is empty
   (non-canonical strategy shard), emit an empty list so
   github-action-benchmark gets no rows to alert on. INFO log shows
   the empty alert-set + present-set for the affected shard.

Also fixed the hash-table size math in DFAST_HASH_BITS rationale (CR
#8): per-bucket vs per-slot mix-up — 1<<17 buckets × 16 bytes for
[u32; 4] is 2 MiB per table (not 512 KiB as written).

479/479 lib tests pass. Smoke at level_3_dfast confirms FFI streaming
encode emits comparable peak numbers (~3.2 MiB for 1 MiB input vs
prior 2.3 MiB preallocated; large-log-stream drops from 104 MiB to
3.7 MiB).

* test(dfast): cover long_hash in rebase regression + drop i686 gate

Two CR nitpicks on dfast_ensure_room_for_rebases_above_guard_band:

1. The test only seeded short_hash before the rebase, so a future
   regression that cleared only short_hash would still pass even
   though reduce() is a shared contract for both tables. Seed
   long_hash[0][0] as well and assert it clamps to DFAST_EMPTY_SLOT
   after ensure_room_for runs.

2. The #[cfg(target_pointer_width = '64')] gate was prophylactic
   but unnecessary — trigger_abs = u32::MAX - DFAST_REBASE_GUARD_BAND
   + 1 = 0xC0000000 fits in usize on i686 (usize::MAX == u32::MAX
   there). Drop the gate so the packed-slot boundary path + the
   u32 ↔ usize round-trip is exercised on every target we ship.

* fix(bench): balance dealloc counter via per-allocation header flag

Both reviewers flagged the same correctness bug on the TRACKING_ENABLED
gate: closures returned a Vec, the Vec dropped AFTER measure_peak_alloc
flipped tracking back off, so dealloc skipped the fetch_sub and
ALLOC_CURRENT drifted upward across every sample. On the i686 bench
target the 32-bit AtomicUsize could wrap and corrupt later peak
measurements.

Switch to a header-prefix scheme: each TrackingAllocator allocation
over-allocates max(align, 16) bytes, stores a u8 "counted at alloc
time" flag at the header offset, and returns ptr+header to the user.
dealloc reads the flag and always subtracts the size when set, even
when TRACKING_ENABLED is now false. The fetch_add path on alloc still
checks TRACKING_ENABLED so criterion timing loops outside measurement
windows continue to skip the atomic RMW entirely.

Drop the alloc_zeroed / realloc overrides — GlobalAlloc's defaults
forward to our alloc/dealloc, which already do the right thing under
header accounting. (System.realloc would invalidate the header, so
the default alloc+copy+dealloc path is actually the correct one
here.)

Smoke at level_3_dfast keeps the same headline numbers (decodecorpus
~10 MiB Rust vs ~3.5 MiB FFI), confirming the fix doesn't shift the
measurement scale — only that ALLOC_CURRENT is now balanced across
the full bench process.

479/479 lib tests pass.

* refactor(bench): RSS sampler + per-CCtx customMem, drop global allocator

Memory measurement no longer biases timing samples and the FFI compare
runs the same operation for both metrics:

- Drop `#[global_allocator] TrackingAllocator`. Even with the per-call
  gate flag, every Rust allocation in the bench binary paid an atomic
  load + branch + extra 16-byte header — biasing criterion's timing
  loops against the FFI side which has no such wrapper.
- Replace with OS RSS sampling for the Rust side: a background poller
  reads `mach_task_basic_info.resident_size` (macOS) or
  `/proc/self/statm` (Linux) every 250 µs and updates a peak counter.
  Hot-path overhead is zero — the kernel maintains RSS independently.
- FFI side keeps `ZSTD_customMem` but routes through a per-CCtx
  `FfiMemTracker` (passed as `opaque`), not a process-wide static. No
  global state, no atomic ops; libzstd's single-threaded-per-context
  allocator path is the only writer.
- Unify the encode and decode helpers: one `ffi_encode_to_vec` and one
  `FfiDCtxHandle::decompress_into`. Timing loops pass `mem = None`
  (default malloc, uninstrumented). Memory measurement passes
  `Some(&mut tracker)`. Same `ZSTD_compressStream2` /
  `ZSTD_decompressDCtx` call path in both cases — no
  aligned-vs-customMem split that could let the two metrics describe
  different operations. `FfiDCtxHandle` reuses one DCtx across decode
  iterations so the sample reflects steady-state (matches the pure-Rust
  `FrameDecoder` reuse pattern).

CI gates added to lint job:
- `compare_ffi` must not list `bench_internals` in `required-features`
  (would widen Rust-side visibility vs FFI, breaking parity).
- `zstd/src/` must not reference any bench-instrumentation identifier
  (`TrackingAllocator`, `FfiMemTracker`, `PeakWindow`, `ZSTD_customMem`,
  `customMem(`). Bench instrumentation stays in `zstd/benches/`.

Local verification on level_3_dfast / decodecorpus-z000033 (1 MiB):
  decompress c_ffi: 890 µs (no regression vs reusable DCtx baseline)
  compress: ratio + memory numbers unchanged shape vs prior baseline
  Lib tests: 533/533, clippy clean (both feature sets), production
  isolation grep: clean

* fix(ci): skip doc-only lines in bench-leak gate

The leak gate's rg invocation was raw text search and would trip on
identifier mentions inside doc blocks even though the rationale said
"identifier references in actual code are not". Pipe through a second
`rg -v` that drops lines whose content starts with a Rust documentation
prefix (`//`, `/*`, ` *`), so the gate fires only on real code uses of
`TrackingAllocator`, `FfiMemTracker`, `PeakWindow`, `ZSTD_customMem`,
or `customMem(`.

Verified locally:
  - real code (`let _ = TrackingAllocator;`) → gate trips
  - doc-block mentions of every flagged symbol → gate passes
  - current `zstd/src/` → clean (no false positives)

* chore(gitignore): ignore claude session state, pycache, zstd fuzz corpus

* refactor(bench): rename Rust memory field to peak_rss_delta_bytes

The Rust side metric was emitted as `rust_peak_alloc_bytes`, suggesting
parity with the FFI side's precise `ZSTD_customMem` byte count. In
reality it is OS resident-set-size delta from background sampling —
allocations satisfied from already-faulted pages or from the
allocator's cached arena do not bump RSS, so warm scenarios underreport
against true peak allocated bytes.

Renames:
- bench `emit_memory_report` field: `rust_peak_alloc_bytes` →
  `rust_peak_rss_delta_bytes`
- aggregator MEM_RE regex + JSON record key match the new name
- dashboard markdown column header reflects the proxy semantics
- FFI field unchanged — `ffi_peak_alloc_bytes` IS precise alloc counter

Dashboard metric group key (`peak_alloc_bytes`) is kept stable so the
plot continues to group rust vs ffi side-by-side; doc comments on both
the bench and the aggregator now spell out the asymmetry so cross-side
absolute values aren't read as directly comparable.

* fix(ci): parse Cargo.toml with tomllib for bench-feature gate

The awk/grep gate matched `required-features = [...]` only when the
array fit on one line. A future reformat to a multi-line TOML array
would let `bench_internals` slip into `compare_ffi`'s feature set
without tripping the gate. Switch to `tomllib.load` (stdlib on Python
3.11+, available on every GitHub runner we use) so single-line and
multi-line arrays are handled by the same parse path.

Verified locally:
  - current Cargo.toml → gate passes
  - same file rewritten with `required-features = [\n  "dict_builder",
    \n  "bench_internals"\n]` → gate trips

* refactor(bench): preserve decode RSS window + plain FfiMemTracker arithmetic

Three small bench corrections:

- Decode RSS sampling: `rust_window.finish()` now runs while `target`
  and `decoder` are still alive. Previously they sat in an inner block
  that dropped them before `finish` took its final on-thread sample —
  on sub-poll-interval payloads (small inputs decoding faster than the
  250 µs sampler), the background poller can miss the spike entirely
  and the post-drop final sample underreports actual decode footprint.
- `FfiMemTracker::current` updates with plain `+=` / `-=` instead of
  saturating arithmetic. A single-CCtx measurement bench cannot reach
  `usize::MAX` of live state under any realistic libzstd workload; if
  the counter ever did over/underflow that's a real bug (alloc/free
  imbalance or mis-routed opaque pointer) and a panic surfaces it
  instead of silently freezing at the saturation bound.
- Code comments at the FFI-peak read sites document the intentional
  asymmetry: the FFI tracker observes ONLY libzstd's customMem
  requests, NOT the Rust-owned output/chunk/target buffers the FFI
  helpers also allocate. Both sides allocate output via the system
  allocator and the metric is the libzstd-internal cost on top, so
  wrapping the whole FFI call in an RSS window would inflate the
  comparison with bookkeeping the Rust path also pays.

* fix(bench): take RSS baseline after sampler thread is live

`PeakWindow::start` previously snapshotted `baseline` before
`thread::spawn`, so the sampler thread's own startup cost (stack pages
faulted in, TLS init) leaked into every reported delta as a fixed
positive bias — most visible on the small scenarios where the metric
already lives near the noise floor.

Spawn the sampler first, have it store the first sample as the
initial `peak` value, and only then read `baseline = peak.load(...)`
on the main thread. The sampler thread's footprint is now absorbed
by `baseline` and the returned delta reflects only the workload
allocation.

An `Arc<AtomicBool>` ready flag plus a `thread::yield_now()` spin
gates the baseline read until the sampler has produced its first
sample. Spin is bounded by a single OS RSS query so wall-clock cost
is negligible.

* refactor(bench): split compare_ffi into timing + memory binaries

Single bench tried to do both timing and memory measurement, which
forced asymmetric observers (OS RSS sampling for Rust vs customMem
counter for FFI). Cross-side comparison was misleading because the
two metrics measure different things, and the timing path still paid
RSS-sampler thread overhead even when its data wasn't used.

Split into two binaries with disjoint responsibilities:

- `compare_ffi.rs` (timing + ratio only): runs under a vanilla system
  allocator. No `#[global_allocator]`, no `ZSTD_customMem` hooks, no
  RSS sampler. Criterion samples are unbiased.

- `compare_ffi_memory.rs` (new): standalone `fn main()` that installs
  a `#[global_allocator]` tracking wrapper and routes libzstd's
  `ZSTD_customMem` allocate/free callbacks through `std::alloc::alloc`
  / `std::alloc::dealloc`. A single counter sums Rust-side and FFI
  bytes — by construction symmetric, so the cross-side ratio is
  meaningful again. Per-allocation tracking overhead is fine here
  because this binary doesn't claim to measure timing.

Each binary registers in `Cargo.toml` as a separate `[[bench]]` and
neither requires `bench_internals`, so the existing CI lint gate
(no `bench_internals` in compare_ffi parity benches) extends naturally
to cover both names.

* ci(bench): wire compare_ffi_memory binary into main-push runs

CI plumbing for the new memory bench:

- bench-build step: cargo bench now builds both binaries in one
  invocation. zstd-sys/criterion/etc. only compile once; the binaries
  ship together in the `bench-binary-<target>` artifact.
- run-benchmarks.sh: optional `STRUCTURED_ZSTD_BENCH_MEMORY_BIN` env
  var. When set, the memory binary runs sequentially after the timing
  binary and its `REPORT_MEM` output is appended to the same raw file
  for uniform downstream parsing. When unset (current PR-shard path),
  no memory measurement is taken.
- benchmark shard job: passes `STRUCTURED_ZSTD_BENCH_MEMORY_BIN` only
  on `event_name == 'push'` (main pushes). PRs keep their fast
  review-cycle latency; main pushes get the full memory matrix into
  the dashboard.
- bench-leak gate: extended to forbid `bench_internals` in
  `compare_ffi_memory` too, and the identifier-leak grep now matches
  `ALLOC_PEAK` / `ALLOC_CURRENT` / `TRACKING_ENABLED` from the new
  binary in addition to the previous symbol set.
- run-benchmarks.sh `peak_alloc_bytes` aggregation: restores
  `delta_ratio` and a `delta>1 means Rust allocates more` style
  interpretation. Both columns now come from one observer, so the
  ratio is honest.

* fix(bench): correct allocator header math + bypass tracker for FFI sizes

Two correctness bugs in `compare_ffi_memory` flagged by CR/Copilot:

1. `augmented_layout` reserved a fixed 16-byte header but `alloc` /
   `dealloc` computed the actual header offset as
   `align.max(HEADER_BYTES)`. For SIMD types or any
   align-greater-than-16 allocation the user pointer ended up past
   the System.alloc-owned block — OOB writes during init plus a
   layout mismatch on `dealloc`. Centralised the header size in
   `tracker_header(layout) -> align.max(HEADER_BYTES)` and routed
   `augmented_layout`, `alloc`, and `dealloc` through the same
   helper so the three sites can't drift again.

2. `ffi_alloc` routed `std::alloc::alloc(size + FFI_HEADER)` through
   `TrackingAllocator`, so every libzstd allocation was counted as
   `size + 16` while the Rust side counts only `layout.size()` (the
   original request). That added a fixed 16-byte-per-alloc bias to
   `ffi_peak_alloc_bytes`, skewing the cross-side ratio Copilot
   flagged on PR #143. Switched `ffi_alloc` / `ffi_free` to
   `System.{alloc,dealloc}` (bypassing TrackingAllocator's header
   bookkeeping) and manually `fetch_add`/`fetch_sub` `size` (not
   `total`) into `ALLOC_CURRENT` / `ALLOC_PEAK`. libzstd allocations
   are now counted exactly like Rust allocations are. Dropped
   accordingly: customMem `decodecorpus-z000033 level_3_dfast`
   compress FFI peak 3,570,299 → 3,570,267 bytes (delta = 32 bytes
   header overhead across two allocations).

Two CI-side aux fixes flagged in the same round:

3. `run-benchmarks.sh` now treats an empty `memory_rows` list as
   informational, not fatal — PR shards intentionally don't run the
   memory bench (no `STRUCTURED_ZSTD_BENCH_MEMORY_BIN` set) and would
   have crashed the timing-shard reporter otherwise.

4. The post-artifact `chmod +x` step covers both bench binaries.
   `actions/download-artifact` strips the executable bit, and only
   `compare_ffi` was being marked +x — the memory binary would have
   failed run-benchmarks.sh's `-x` check on main pushes.

Also refreshes the stale "OS RSS delta" wording in the aggregator
comment to describe the unified single-observer setup.

* fix(ci): omit Peak Allocation section when memory bench did not run

On PR shards `memory_rows` is empty (the memory binary is intentionally
not invoked there). The Markdown report builder still emitted the
`## Peak Allocation Bytes` heading and table header in that case,
leaving a blank section. Guard the heading + table builder behind
`if memory_rows:` so the section is omitted entirely when there's no
data — matches the earlier INFO log that announced the omission.

* fix(ci): scope timing fallback to canonical alerts; keep zero-ffi memory rows

Two narrow correctness fixes in run-benchmarks.sh:

- benchmark-results fallback (timing): previously, when the smoke
  filter matched zero rows on a strategy shard that contained a
  canonical alert level, the fallback rebuilt benchmark_results from
  the unfiltered `timings` list. On a `dfast` shard that also carries
  `level_2_dfast` this reintroduced the non-canonical sibling into
  the github-action-benchmark feed, breaking the "alerts fire ONLY
  on level_3_dfast + level_22_btultra2" contract. The fallback now
  filters `timing_rows` to `REGRESSION_STAGES ∩ regression_levels`
  so the canonical contract holds even when scenario mapping drifts.
- peak_alloc_bytes row emission: `if delta_ratio is None: continue`
  dropped entire rows when a scenario had zero libzstd allocations,
  hiding valid datapoints. Keep the row, set `delta_ratio` and
  `delta_percent` to null when undefined — the dashboard still gets
  both `rust_value` and `ffi_value`.

* docs(bench): align observer wording with current customMem-bypass path

Two stale comments after the FFI tracker switched to `System.alloc` /
`System.dealloc` bypass + manual atomic updates (commit d35a7bd):

- Module-level docstring in `compare_ffi_memory.rs` still said the
  FFI side "routes through the same Rust allocator". Updated to
  describe the actual setup: Rust via `#[global_allocator]`, FFI via
  customMem callbacks calling `System.alloc` directly and manually
  updating shared atomics with the libzstd-requested size only
  (header bytes excluded to avoid double-counting).
- Two equivalent comment blocks in `run-benchmarks.sh` (record
  builder + markdown table intro) carried the same old wording.
  Rewritten to match. No behavior change.

* fix(bench): override realloc in TrackingAllocator + assert exact decode length

Three fixes from CR/Copilot on compare_ffi_memory.rs:

- TrackingAllocator now implements realloc. The default
  GlobalAlloc::realloc does alloc+copy+dealloc, which keeps both old
  and new buffers live during the copy window and temporarily
  inflates ALLOC_PEAK by the old buffer's size. Vec growth (the
  dominant realloc shape in the measured path) was therefore counted
  larger than what System.realloc actually does in production
  (typically in-place resize). New impl routes through System.realloc
  and updates counters by the size delta only — measured peak drops
  ~5% on Rust-side compress for the decodecorpus scenario at
  level_3_dfast (10,500,563 → 9,953,841 bytes), bringing the metric
  in line with real allocator behaviour.
- ffi_decode now asserts written == expected_len. The Rust decode
  measurement already asserts exact length; without the matching
  assertion on the FFI side a partial/truncated decode would still
  emit a ffi_peak_alloc_bytes value, hiding the failure behind a
  misleading metric.
- Module-level doc comment describing the observer setup was stale
  ("FFI callbacks route through std::alloc::alloc"). The customMem
  callbacks have been bypassing TrackingAllocator since commit
  d35a7bd — only the atomic counters are shared. Rewrote the doc to
  describe the actual two-path / one-counter setup.

* perf(bench): reduce sample_size from 10 to 3 for Corpus/Entropy/Large/Silesia

Criterion warned 'Unable to complete 10 samples in 4.0s' on heavy
levels (level_22_btultra2 + 1 MiB+ payloads) because a single encode
pass already takes well over a second. Drop sample_size to 3 for the
three scenario classes that fired the warning. Three samples + the
existing measurement_time budget is enough for the dashboard's
regression-band sensitivity (±5%) on canonical alert levels while
keeping per-shard wall time bounded. Small scenarios stay at
sample_size=30 (they finish in microseconds, more samples cheap).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: long distance matching (LDM) for compression

2 participants