Skip to content

perf(encoder): donor-parity greedy parse at L4 — ratio + speed win#179

Merged
polaz merged 7 commits into
mainfrom
perf/#178-dedicated-greedy
May 18, 2026
Merged

perf(encoder): donor-parity greedy parse at L4 — ratio + speed win#179
polaz merged 7 commits into
mainfrom
perf/#178-dedicated-greedy

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 18, 2026

Summary

L4 (StrategyTag::Greedy) was routed through the Row backend's plain start_matching with lazy_depth = 1, paying for an unconditional depth-1 lookahead while still calling itself "greedy". Donor ZSTD_compressBlock_lazy_generic at depth == 0 (zstd_lazy.c:1560) is structurally different — that's why the strategy is named "greedy": not because the matches are long, but because each iteration greedily prefers the cheaper 1 literal + repcode encoding over a 0-literal regular match.

This PR adds RowMatchGenerator::start_matching_greedy mirroring donor's depth-0 lazy_generic flow, and dispatches L4 (lazy_depth == 0) to it via a debug_assert_eq!(matcher.lazy_depth, 0) invariant — the runtime side of the routing is pinned in tests via driver.row_matcher().lazy_depth == 0. L5+ keep the existing plain start_matching path (their lazy lookahead already gives ratio wins vs donor); the plain row-lazy methods are kept as #[allow(dead_code)] scaffolding for a future routing of L5+ through the Row backend.

Structural changes carried over from donor

  • Default start = pos + 1: probe the repcode bank at abs_pos + 1 before regular search. A near-repcode beats a longer regular offset on encoding cost (4-5 bits vs 9-13 bits for the offset code).
  • Hybrid commit (not pure donor goto): donor's pure commit-on-first-rep was tried and cost +1.1pp ratio on z000033 because donor recovers that loss via components we don't replicate (minMatch=5, block splitting). Our hybrid still runs the regular search at abs_pos and picks the longer of (rep@+1, regular@pos) — keeps the speed shape, reclaims the ratio.
  • Skip-step grows on miss, shift = 10: donor uses 8 (kSearchStrength). Shift = 8 cost ~3pp ratio loss; shift = infinity cost ~+14% speed. Shift = 10 sits at the sweet spot — only kicks in past ~1 KiB literal run, so on structured data the step stays at 1.
  • Repcode min_match = 4 (donor MEM_read32 + ZSTD_count). ROW_MIN_MATCH_LEN = 6 gates the regular row-table search; the independent repcode probe benefits from the lower threshold because 4-5-byte reps are cheap to encode and frequently beat the literal alternative. Outer-loop guard is pos + GREEDY_MIN_LOOKAHEAD <= current_len (= 5) so the tail position whose abs_pos + 1 repcode probe needs 4 bytes of lookahead is reachable.

pick_lazy_match_shared is intentionally not called from the new path — depth == 0 means no lookahead, emit on first viable hit.

Note: donor's immediate-repcode loop after store (probe offset_2 for back-to-back hits + swap) was implemented and then removed. A panic! probe inside the inner loop ran the full 528-test suite + bench matrix without firing. The donor design is single-rep; ours is three-rep (repcode_candidate_shared evaluates rep1, rep2, rep3 plus the ll0 fallback), so the inner-loop slot is subsumed by the next main-loop iteration's rep probe. Removing it nudged decodecorpus-z000033 from 538142 to 537897 bytes (-245).

Measurements (L4 only, criterion --baseline main, p < 0.05 unless noted)

Scenario rust_bytes vs main time vs main
small-1k-random 0 +3.4%
small-10k-random 0 -0.2% (p=0.96, noise)
small-4k-log-lines 0 -5.1%
decodecorpus-z000033 -1522 bytes -24.0%
high-entropy-1m 0 -7.8%
low-entropy-1m 0 -2.4%
large-log-stream -2 bytes +9.0%

Ratio: 5 cells equal, 2 cells better, 0 cells worse.
Speed: 4 wins (incl. -24% on z000033 — the cell that motivated #178), 1 noise, 2 regressions.

Tests

8 new L4-targeted tests (plus a private l4_greedy_round_trip helper):

  • driver_level4_selects_row_backend — backend routing + lazy_depth == 0 assertion (pins greedy vs lazy choice in release tests too).
  • driver_level4_greedy_round_trip_single_slice — non-empty triple emission on a repeating pattern.
  • driver_level4_greedy_round_trip_cross_slice — cross-slice match (offset >= chunk.len()).
  • driver_level4_greedy_tail_rep_only_reachable — guards the GREEDY_MIN_LOOKAHEAD = REP_MIN_MATCH_LEN + 1 loop bound; without the fix the 5-byte-tail rep-only position was unreachable.
  • driver_level4_greedy_empty_input_emits_nothing — current_len == 0 early-return path.
  • driver_level4_greedy_sub_min_lookahead_input — 4-byte payload below GREEDY_MIN_LOOKAHEAD; outer loop never executes.
  • driver_level4_greedy_incompressible_input — pseudo-random 256B exercises the miss-branch.
  • driver_level4_greedy_long_literal_run_skip_step_growth — 2 KiB pseudo-random drives lit_len >> SKIP_STRENGTH.
  • driver_level4_greedy_all_zeros_heavy_rep1 — max_offset == 1 rep1 commit-on-first-hit.
  • driver_level4_greedy_periodic_pattern_rep_cascade — 16-byte period steady-state cascade.

row/mod.rs coverage: 89.12% to 96.69%. Remaining gaps are micro-branches (rep probe at exact end-of-buffer, regular-strictly-longer-than-rep arm, compact_history drain) and the #[allow(dead_code)] scaffold.

Known follow-up

The +9% large-log-stream regression and the +3.4% small-1k-random one reflect a different hotpath. samply profile on L4 large-log-stream shows RowMatchGenerator::skip_matching_with_hint consuming ~25% of self time — it inserts every position of each 128 KiB block into the row hash table (~102M insert_position calls on a 100 MB stream), dominating any per-iteration delta in the parse loop. Independent of this PR; tracked as #180.

Test plan

  • cargo nextest run: 528/528 passed (3 skipped)
  • cargo clippy --all-targets --features dict_builder: clean
  • cargo fmt --check: clean
  • cargo llvm-cov --lib: row/mod.rs at 96.69%
  • compare_ffi L4 matrix: full scenario sweep, ratios captured above
  • Ratio gate: rust_bytes <= ffi_bytes constraint from CLAUDE.md respected — no ratio regression on any cell

Related

Summary by CodeRabbit

  • Performance

    • Level 4 Row backend now uses a more aggressive greedy matching path, improving matching behavior for repeats and periodic patterns and yielding faster/more consistent compression.
  • Tests

    • Added extensive Level 4 test coverage: routing/invariant checks, single- and cross-slice greedy round-trips, and multiple regression cases (empty input, incompressible data, long literals, heavy repeats).

Review Change Stack

L4 (`StrategyTag::Greedy`) was routed through the Row backend's plain
`start_matching` with `lazy_depth = 1`, paying for an unconditional
depth-1 lookahead while still calling itself "greedy". Donor
`ZSTD_compressBlock_lazy_generic` at `depth == 0` (`zstd_lazy.c:1560`)
is structurally different — that's why the strategy is named "greedy":
not because the matches are long, but because each iteration greedily
prefers the cheaper `1 literal + repcode` encoding over a 0-literal
regular match.

Add `RowMatchGenerator::start_matching_greedy` that mirrors donor's
depth-0 lazy_generic flow and dispatch L4 (lazy_depth = 0) to it.
Structural features carried over:

  - **Default `start = pos + 1`**: probe the repcode bank at `abs_pos +
    1` before regular search, so a near-repcode beats a longer regular
    offset on encoding cost.
  - **`if depth == 0: goto _storeSequence`**: a rep hit ends the
    iteration without paying for the regular search at `abs_pos`. Our
    hybrid still runs the regular search and picks the longer of (rep@
    +1, regular@pos) — pure donor commit-on-first-rep was tried and
    cost +1.1pp ratio vs main because donor recovers the loss via
    components we don't replicate (`minMatch = 5`, block splitting).
    The hybrid keeps the speed shape and reclaims the ratio.
  - **Immediate-repcode loop after store**: scan forward for
    back-to-back `offset_2` hits with `lit_len = 0`. The donor
    `offset_1 ↔ offset_2` swap is reproduced by
    `encode_offset_with_history(actual = offset_hist[1], lit_len = 0)`.
  - **Skip-step grows on miss**: shift right by 10 (donor uses 8).
    Shift = 8 cost ~3pp ratio loss; shift = ∞ (no grow) cost ~+14%
    speed. Shift = 10 sits at the sweet spot — only kicks in past ~1
    KiB literal run, so on structured data the step stays at 1.
  - **Repcode min_match = 4** (donor `MEM_read32 + ZSTD_count`).
    `ROW_MIN_MATCH_LEN = 6` gates the regular row-table search; the
    independent repcode probe benefits from the lower threshold because
    4–5-byte reps are cheap to encode and frequently beat the literal
    alternative.

`pick_lazy_match_shared` is intentionally not called from the new path
— depth == 0 means no lookahead, emit on first viable hit.

L4 measurements vs main (criterion `--baseline`, p < 0.05 unless noted):

| Scenario              | rust_bytes Δ | time Δ |
|-----------------------|--------------|--------|
| small-1k-random       | 0            | +3.4%  |
| small-10k-random      | 0            | -0.2% (noise, p=0.96) |
| small-4k-log-lines    | 0            | -5.1%  |
| decodecorpus-z000033  | -1277 bytes  | -24.0% |
| high-entropy-1m       | 0            | -7.8%  |
| low-entropy-1m        | 0            | -2.4%  |
| large-log-stream      | -2 bytes     | +9.0%  |

Ratio: 5 cells equal, 2 cells better, **0 cells worse**. The 2 speed
regressions (small-1k-random, large-log-stream) reflect a different
hotpath: profile on large-log-stream shows
`RowMatchGenerator::skip_matching_with_hint` consuming ~25% of self
time, dominating any per-iteration delta in the parse loop. Tracked
separately (see follow-up issue) — independent of this PR.

Refs #178
Copilot AI review requested due to automatic review settings May 18, 2026 15:18
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c27fcc1f-d69f-4821-96d0-85b5ba8bb646

📥 Commits

Reviewing files that changed from the base of the PR and between db07f98 and 3250a07.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

📝 Walkthrough

Walkthrough

Adds a depth-0 greedy parse path to the Row matcher, sets level 4’s Row matcher lazy_depth = 0, routes the backend dispatcher to call the greedy path, and expands tests for level‑4 greedy single-slice, cross-slice, and edge-case behaviors.

Changes

Greedy Matching Path for Compression Level 4

Layer / File(s) Summary
Greedy matcher implementation
zstd/src/encoding/row/mod.rs
Implements RowMatchGenerator::start_matching_greedy (depth‑0 greedy loop), table backfill/insert logic, rep-vs-row candidate selection, skip-step on misses, offset_hist updates, tail insertion, and trailing Sequence::Literals; adds lazy-style docs and scoped #[allow(dead_code)] on helpers.
Dispatcher wiring, level 4 configuration, and tests
zstd/src/encoding/match_generator.rs
Sets LEVEL_TABLE level 4 Row lazy_depth = 0; changes BackendTag::Row to call start_matching_greedy with debug_assert_eq!(matcher.lazy_depth, 0) and adds Level(4) greedy round-trip and regression tests (single-slice, cross-slice, empty input, sub-min lookahead, incompressible data, long literal-run skip growth, heavy rep1, periodic-pattern rep cascades).

Sequence Diagram

sequenceDiagram
  participant TestHarness
  participant MatchGeneratorDriver
  participant RowMatchGenerator
  participant RowTable
  participant OffsetHistory
  TestHarness->>MatchGeneratorDriver: request compress_block(Level(4), slices)
  MatchGeneratorDriver->>RowMatchGenerator: debug_assert_lazy_depth_then_start_matching_greedy()
  RowMatchGenerator->>RowTable: backfill & lookup (abs_pos, abs_pos+1)
  RowMatchGenerator->>OffsetHistory: encode_offset_with_history(offset)
  RowMatchGenerator->>MatchGeneratorDriver: emit Sequence::Triple / Sequence::Literals
  MatchGeneratorDriver->>TestHarness: reconstructed output / match metadata
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

"I hopped through rows with tuned delight,
Probing repcodes by day and emitting triples by night.
Level four taught me to choose the longer chase,
Backfilled tables and left literals a trace.
A greedy rabbit's hop—swift, clever, and light."

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf(encoder): donor-parity greedy parse at L4 — ratio + speed win' clearly and specifically describes the main change: implementing a donor-parity greedy parse for compression level 4 with performance improvements.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#178-dedicated-greedy

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 92.64706% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/match_generator.rs 91.13% 18 Missing ⚠️
zstd/src/encoding/row/mod.rs 97.10% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/row/mod.rs`:
- Around line 290-315: The outer greedy loop is still gated by ROW_MIN_MATCH_LEN
which prevents evaluating the rep-only tail where REP_MIN_MATCH_LEN = 4 should
apply; introduce a GREEDY_MIN_LOOKAHEAD = REP_MIN_MATCH_LEN + 1 (or simply use
REP_MIN_MATCH_LEN + 1) and replace the loop guard `while pos + ROW_MIN_MATCH_LEN
<= current_len` with `while pos + GREEDY_MIN_LOOKAHEAD <= current_len` so the
`rep_probe_pos = abs_pos + 1` / `rep_match` probe (which checks `rep_probe_pos +
REP_MIN_MATCH_LEN <= self.history_abs_end()`) becomes reachable for the final
bytes. Ensure REP_MIN_MATCH_LEN is defined/visible before the loop so the new
GREEDY_MIN_LOOKAHEAD can reference it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2274cc9f-d4b5-46c2-9afb-d3d5c1be2110

📥 Commits

Reviewing files that changed from the base of the PR and between c71559e and d832376.

📒 Files selected for processing (2)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/row/mod.rs

Comment thread zstd/src/encoding/row/mod.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines the level-4 (Greedy) encoder path by adding a dedicated donor-parity greedy loop for the Row backend and routing L4 to it (via lazy_depth = 0), aiming to reduce overhead from unnecessary lazy lookahead while preserving (or improving) ratio via repcode-first probing and an immediate rep loop.

Changes:

  • Added RowMatchGenerator::start_matching_greedy() implementing a donor-inspired greedy parse loop for lazy_depth == 0.
  • Updated level-4 parameters to set lazy_depth: 0 so Greedy no longer routes through the depth-1 lazy machinery.
  • Added Row-backend dispatch logic to select the greedy path when lazy_depth == 0.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
zstd/src/encoding/row/mod.rs Adds the new greedy parsing loop (start_matching_greedy) with repcode-first probing, miss skip-step, and immediate rep loop.
zstd/src/encoding/match_generator.rs Routes level 4 to lazy_depth = 0 and dispatches Row matching to the new greedy path.

Comment thread zstd/src/encoding/row/mod.rs Outdated
Comment thread zstd/src/encoding/row/mod.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs Outdated
`row/mod.rs:290`: outer-loop guard relaxed from
`pos + ROW_MIN_MATCH_LEN <= current_len` to
`pos + GREEDY_MIN_LOOKAHEAD <= current_len` where
`GREEDY_MIN_LOOKAHEAD = REP_MIN_MATCH_LEN + 1 = 5`. The previous
bound was over-strict: a repcode probe at `abs_pos + 1` only needs
4 bytes of lookahead beyond that probe point, so the last 5-byte
position was unreachable even though its rep probe would succeed.
`REP_MIN_MATCH_LEN` is hoisted out of the loop body so the new
outer guard can reference it.

`row/mod.rs:242` (doc): bullet (2) of the greedy parse description
claimed "the regular search at abs_pos is skipped entirely" — that
is donor's depth-0 behavior, not ours. Rewritten to document the
hybrid form (rep probe at +1 *and* regular search at pos, longer
wins, ties go to rep). Explicitly notes why donor's pure
commit-on-first-rep was rejected: ratio cliff without minMatch=5
and superblock entropy sharing.

`row/mod.rs:367` (doc): comment said `SKIP_STRENGTH = 12` (~4 KiB
miss-run threshold) but the constant is 10 (~1 KiB). Comment
realigned to the actual value.

`match_generator.rs:1223` (dispatch): the `Row` backend arm had a
runtime `if matcher.lazy_depth == 0` with a fall-through to the
plain `start_matching` path, but `StrategyTag::backend` maps only
`Greedy` → `Row`, so the `else` was unreachable in normal
operation. Replaced with unconditional `start_matching_greedy`
plus a `debug_assert_eq!(matcher.lazy_depth, 0, …)` documenting
the invariant.

The plain `start_matching` / `best_match` / `pick_lazy_match` /
`repcode_candidate` methods on `RowMatchGenerator` now have no
in-tree caller. Marked `#[allow(dead_code)]` rather than deleted:
a future routing of a lazy strategy through the Row backend would
otherwise have to re-derive the rep+row best-of-two pick and the
`pick_lazy_match` driver. Doc comment makes the rationale explicit.

`match_generator.rs` tests: new
`driver_level4_greedy_round_trip_single_slice` locks down L4 greedy
reconstruction on a repeating-pattern single slice and asserts at
least one `Sequence::Triple` is emitted (the rep probe must produce
a match). `driver_level4_greedy_round_trip_cross_slice` commits two
identical slices and asserts the parse matches into the first
slice's history (offset >= chunk.len()). Covers the new
`start_matching_greedy` entry point + cross-block hash-table
seeding.

Verified locally:
  - cargo nextest run: 521 / 521 pass (3 skipped)
  - cargo clippy --all-targets --features dict_builder: clean
  - cargo fmt --check: clean
  - compare_ffi L4 REPORT lines: ratio unchanged on all scenarios
    (538142 z000033, 8947 large-log-stream, ...)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 3997-4118: The tests only verify round-trip reconstruction but
don't assert that Level(4) actually uses the greedy path; after calling
driver.reset(CompressionLevel::Level(4)) assert the matcher routing state (e.g.
assert_eq!(driver.row_matcher().lazy_depth, 0)) to prove greedy selection, and
add a targeted fixture in driver_level4_greedy_round_trip_single_slice (or a new
test) that inspects the first emitted sequence from start_matching (or uses a
probe of start_matching_greedy vs lazy behavior) to assert a greedy-specific
behavior (for example expecting an immediate rep Triple at start rather than the
lazy parser's different first sequence), referencing
MatchGeneratorDriver::reset, MatchGeneratorDriver::row_matcher, and
start_matching/start_matching_greedy to locate changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d39f8572-2f5f-4744-bb66-d9ce477cdaec

📥 Commits

Reviewing files that changed from the base of the PR and between d832376 and 6bf8628.

📒 Files selected for processing (2)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/row/mod.rs

Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/row/mod.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs Outdated
polaz added 2 commits May 18, 2026 19:04
The immediate-rep loop inside `start_matching_greedy` was unreachable
on every test and bench workload: a `panic!` probe placed inside the
loop body ran the full 528-test suite without firing. The donor design
it mirrors is single-rep, where the inner loop catches hits the main
loop's single-slot probe wouldn't. Our main-loop rep probe goes
through `repcode_candidate_shared` which already evaluates rep1, rep2,
rep3 plus the `ll0` fallback — the inner-loop slot is subsumed by the
next outer-iteration rep probe of the same slot, so the inner loop is
dead by construction. Removed.

Side effect on `decodecorpus-z000033` (L4): rust_bytes 538142 -> 537897
(-245 bytes), because the loop's offset_hist rotations on the few
inputs where its slice-equality check passed turned out to bias
subsequent encoding decisions slightly the wrong way. Other scenarios
unchanged.

Tests added (all hit `start_matching_greedy` via L4 driver):
- `l4_greedy_round_trip` private test helper: commits multi-slice data
  per-slice with drive interleaved (single-shot driving loses earlier
  slices' bytes once they fall out of the window).
- `driver_level4_greedy_tail_rep_only_reachable` — cross-slice payload
  whose tail position lives in the 5-byte window that was unreachable
  under the old `pos + ROW_MIN_MATCH_LEN <= current_len` guard. Asserts
  at least one match emitted from that region.
- `driver_level4_greedy_empty_input_emits_nothing` — `current_len == 0`
  early-return guard exercises.
- `driver_level4_greedy_sub_min_lookahead_input` — 4-byte payload
  below `GREEDY_MIN_LOOKAHEAD = 5`; outer loop never runs, all bytes
  emit as literals.
- `driver_level4_greedy_incompressible_input` — 256 pseudo-random
  bytes hit the miss-branch + skip-step path with no rep / row
  candidate.
- `driver_level4_greedy_long_literal_run_skip_step_growth` — 2 KiB of
  pseudo-random data drives the literal-run length past the
  `SKIP_STRENGTH = 10` threshold so the per-miss step grows beyond 1.
- `driver_level4_greedy_all_zeros_heavy_rep1` — heavy rep1 path
  (offset = 1, byte-against-prev-byte). Asserts `max_offset == 1`.
- `driver_level4_greedy_periodic_pattern_rep_cascade` — 16-byte period
  repeated 32x; locks the steady-state rep-cascade at offset >= 16.

Doc on `start_matching_greedy` updated: the donor immediate-rep
paragraph is rewritten to explain WHY it's omitted (single-rep donor
vs three-rep us) and that removal was empirically validated via the
panic probe.

Coverage on `row/mod.rs`: 89.12% -> 96.69%. Remaining gaps are micro-
branches (rep probe at exact end-of-buffer, regular-strictly-longer-
than-rep arm, history compaction drain) and the `#[allow(dead_code)]`
`start_matching` / `best_match` / `pick_lazy_match` / `repcode_candidate`
scaffold for future lazy-on-Row routing.

Verified:
  - cargo nextest run: 528 / 528 pass (3 skipped)
  - cargo clippy --all-targets --features dict_builder: clean
  - cargo fmt --check: clean
  - compare_ffi L4 REPORT (post-removal):
      decodecorpus-z000033: rust_bytes 537897 (was 538142, -245)
      large-log-stream:     rust_bytes 8947  (unchanged)
      others:               unchanged
`driver_level4_selects_row_backend`: extended to also assert
`driver.row_matcher().lazy_depth == 0` after `reset(Level(4))`. The
dispatcher in `MatchGeneratorDriver::start_matching` has a
`debug_assert_eq!(matcher.lazy_depth, 0)` invariant that routes
`BackendTag::Row` unconditionally into `start_matching_greedy` — but
that invariant is only checked in debug builds. The runtime assertion
here pins the routing choice in release tests too, so a future change
that rerouted L4 through the lazy parse (`start_matching` with
`lazy_depth >= 1`) would surface before the round-trip suite runs.
Round-trip correctness alone passes on either parser, which is why
that suite needs a routing-specific companion.

Doc fixes:
  - `row/mod.rs:237` (start_matching_greedy docs): reworded "from a
    depth-1 lazy with `lazy_depth = 0`" — internally inconsistent
    (lazy_depth = 0 IS greedy, not depth-1 lazy). Now reads "from
    the lazy parse in `start_matching` which `lazy_depth >= 1`
    strategies use", removing the contradictory phrasing.
  - Round-trip test doc updated to acknowledge that round-trip alone
    does NOT pin the greedy-vs-lazy choice (the lazy_depth assertion
    in the sibling test does), and to describe what the round-trip
    suite actually guards (non-empty triple emission on structured
    input).

Verified:
  - cargo nextest run: 528 / 528 pass (3 skipped)
@polaz polaz requested a review from Copilot May 18, 2026 16:14
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/row/mod.rs`:
- Around line 409-425: The code currently calls self.insert_positions(abs_pos,
candidate.start + candidate.match_len) which fails to index bytes between
candidate.start and abs_pos when extend_backwards_shared moves candidate.start
backwards; update the insertion range to start from candidate.start (i.e. call
insert_positions with candidate.start as the first arg) so the committed match
region [candidate.start, candidate.start + candidate.match_len) is backfilled
into the row table before emitting the sequence (affecting insert_positions,
candidate.start, abs_pos, and candidate.match_len).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0fb07c76-aba3-45c2-b9bc-daad938da461

📥 Commits

Reviewing files that changed from the base of the PR and between 6bf8628 and 6fb4d37.

📒 Files selected for processing (2)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/row/mod.rs

Comment thread zstd/src/encoding/row/mod.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/row/mod.rs
Comment thread zstd/src/encoding/match_generator.rs Outdated
… bound

Two doc-only corrections + one declined suggestion with measurement
rationale baked into the source comment.

`row/mod.rs:172` (dead-code `start_matching` doc): the dispatch
invariant referenced by this scaffold method lives in the
`BackendTag::Row` arm of `MatchGeneratorDriver::compress_block`
(`match_generator.rs:1230` area), not in
`MatchGeneratorDriver::start_matching`. The doc comment had it on the
wrong method, which would mislead anyone navigating the call stack
from this scaffold back to its sole caller.

`match_generator.rs:3995` (`driver_level4_selects_row_backend` test
comment): same reference fix — the `debug_assert_eq!` lives inside
`compress_block` not `start_matching`.

`row/mod.rs:411` (`insert_positions` bound): added an explanatory
comment pinning the `abs_pos` lower bound and explaining WHY the
narrower range is intentional. The wider range
`[candidate.start, candidate.start + match_len)` was tried and
regresses `rust_bytes` on `decodecorpus-z000033` by +447
(537897 -> 538344). `extend_backwards_shared` absorbs literal bytes
that were already row-indexed on earlier miss iterations via
`insert_position(abs_pos)`, so re-indexing them at emit time
overwrites the same `abs_pos -> position` mapping a second time
and evicts more recent row-slot tenants. The narrow `abs_pos` bound
keeps the row table healthier and the ratio +0 / -447 the right
direction.

No code-behavior change (the function body's emit logic is the same).

Verified:
  - cargo nextest run: 528 / 528 pass (3 skipped)
  - cargo clippy --all-targets --features dict_builder: clean
  - cargo fmt --check: clean
  - compare_ffi L4 z000033 rust_bytes: 537897 (unchanged from
    pre-doc-fix baseline)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 4216-4225: The test currently asserts on total triples (triples >=
1) which can be satisfied by matches inside the first slice; instead inspect the
detailed match information returned by l4_greedy_round_trip and assert that at
least one match originates in the second slice (i.e., has a start position >=
first.len()). Replace the weak assert with a check on the second return value
from l4_greedy_round_trip (the matches/list/positions it returns) to ensure
there exists a match whose position >= first.len(), referencing the variables
first, second, combined and the l4_greedy_round_trip call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ff899b25-64b2-4a30-ad1d-21318756ca7f

📥 Commits

Reviewing files that changed from the base of the PR and between 6fb4d37 and c2df3f9.

📒 Files selected for processing (2)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/row/mod.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

…sertion

The previous `driver_level4_greedy_tail_rep_only_reachable` asserted
`triples >= 1` on a combined-payload round-trip — too weak, because
the original first slice (`"BEEF_BEEF_aaaaaa"`) already contained
"BEEF" twice and would emit at least one match through the regular
row-table path even if the `GREEDY_MIN_LOOKAHEAD = REP_MIN_MATCH_LEN
+ 1` guard relaxation regressed and the second-slice tail rep stayed
unreachable. Pass-through assertion masking the case under test.

Replace with a slice-by-slice drive that counts triples emitted from
the **second slice's** `start_matching` pass and a deliberately
constructed payload that triggers the exact tail boundary:

- First slice: `"ABCDABCDABCDABCD"` (16 bytes, strict period 4)
  guarantees `offset_hist[0] = 4` by the time the parse reaches the
  end of the slice.
- Second slice: `"ABCDA"` — exactly 5 bytes
  (= `GREEDY_MIN_LOOKAHEAD`). Outer loop runs once at `pos = 0`. The
  regular `row_candidate` requires 6 bytes from `abs_pos`, which is
  past the live-history end, so the only viable hit is the
  `abs_pos + 1` rep probe. `second[1..5] == "BCDA"` matches the
  4-byte sequence `first[13..16] ++ second[0] == "BCDA"` at offset
  4, and `extend_backwards_shared` then absorbs `second[0]` for a
  5-byte rep match. With the old `pos + ROW_MIN_MATCH_LEN <=
  current_len` guard the outer loop would have refused to enter
  (0 + 6 > 5); with the relaxed `pos + GREEDY_MIN_LOOKAHEAD <=
  current_len` guard it enters and the rep fires.

Without this strengthening, a future regression on the loop guard
would not surface here. With it, the test fails immediately.

Verified:
  - cargo nextest run: 528 / 528 pass (3 skipped)
  - cargo clippy --all-targets --features dict_builder: clean
  - cargo fmt --check: clean
@polaz polaz requested a review from Copilot May 18, 2026 21:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/match_generator.rs Outdated
…y asserts

The `driver_level4_greedy_long_literal_run_skip_step_growth` test
comment claimed it would catch "a future change to SKIP_STRENGTH"
that would silently skip valid match positions, but the test body
only asserts bit-exact round-trip — round-trip would pass on
SKIP_STRENGTH = 6 or 14 as well, since both produce valid sequences,
just at different per-iteration costs. The doc overstated the
guarantee.

Rewrite the comment to describe what the test actually verifies: a
smoke that the miss branch + step-grow path in
start_matching_greedy survives 2 KiB of incompressible input without
panicking or violating round-trip. Pinning the exact step growth
would require returning step / iteration metadata from the parse,
which is invasive plumbing for a constant that has been stable since
the original tuning round.

Doc-only change. No code-behavior delta. Existing test continues to
run as a stress smoke for the skip-step path.

Verified compile: cargo build --release clean.
@polaz polaz requested a review from Copilot May 18, 2026 21:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@polaz polaz merged commit 62fe6e1 into main May 18, 2026
25 checks passed
@polaz polaz deleted the perf/#178-dedicated-greedy branch May 18, 2026 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants