Skip to content

perf(encoder): inline hash-chain walk into hash_chain_candidate (lazy L1)#185

Merged
polaz merged 2 commits into
mainfrom
feat/#184-lazy-investigation
May 19, 2026
Merged

perf(encoder): inline hash-chain walk into hash_chain_candidate (lazy L1)#185
polaz merged 2 commits into
mainfrom
feat/#184-lazy-investigation

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 19, 2026

Summary

Inline the hash-chain walk directly into hash_chain_candidate to
eliminate the 4 KiB-per-call stack array that chain_candidates
materialized. With lazy_depth = 2 (levels 7+), pick_lazy_match
triggers three chain walks per committed position, so the array form
cost ~12 KiB of stack zero-fill + return-copy traffic per accepted
match before any useful comparison happened. Donor
ZSTD_HcFindBestMatch runs a single fused loop with no intermediate
buffer; this mirrors that.

chain_candidates itself stays live — the chain-walk unit tests
drive it directly, and the BT-optimal HC candidate collector in
match_generator.rs (around line 2437) consumes it through a macro
pipeline that inherits the array form. Inlining the array out of that
BT-optimal site is a separate, larger refactor and is NOT in this PR.

Scope (only lazy hot path)

Single file changed: zstd/src/encoding/hc/mod.rs.

  • hash_chain_candidate: chain walk is now inlined. Loop body fuses
    candidate-position extraction, range check, donor speculative tail
    gate, common_prefix_len, extend_backwards, and best update.
  • chain_candidates: unchanged signature and behavior, still used by
    the BT-optimal HC collector and the chain-walk unit tests.
  • No public API changes, no behavior changes outside the lazy band's
    internal candidate selection (same candidates considered, same
    better_candidate ordering).

Measurements

compress/level_{5,8,12,15}_lazy/decodecorpus-z000033/matrix/pure_rust
— criterion 10 samples each, clean back-to-back vs origin/main,
p = 0.00 across all four cells:

level main thrpt branch thrpt speedup
L5 lazy 13.5 MiB/s 25.8 MiB/s 1.91×
L8 lazy 9.6 MiB/s 17.0 MiB/s 1.77×
L12 lazy 8.3 MiB/s 14.0 MiB/s 1.69×
L15 lazy 8.1 MiB/s 13.7 MiB/s 1.70×

Ratio (full lazy × scenario matrix via REPORT lines, 77 cells):
bit-identical to origin/main. No ratio change anywhere, no
correctness change — the inlined walk visits the same chain links in
the same order and applies the same predicates.

Verification

  • 534/534 lib tests pass (debug profile)
  • lint pass — clean
  • format check — clean
  • Ratio matrix unchanged vs main

Out of scope (tracked in #184)

  • L2: fuse chain-walk and speculative gate further; add donor
    PREFETCH_L1(chain_table[next & chain_mask])
  • L3: share rep + chain results across the lazy lookahead at
    pos, pos+1, pos+2
  • L4: validate target_len early-exit parity vs donor
  • BT-optimal chain_candidates callsite inlining

Related

Summary by CodeRabbit

  • Refactor
    • Optimized internal compression matching logic to improve efficiency and reduce memory overhead.

Review Change Stack

`hash_chain_candidate` previously consumed the output of
`chain_candidates`, which returned `[usize; MAX_HC_SEARCH_DEPTH]` —
a 4 KiB stack array that was zero-filled on entry and returned by
value. With `lazy_depth = 2` (levels 7+) `pick_lazy_match` runs three
chain walks per committed position, so the array form spent ~12 KiB of
stack zero-fill and return-copy traffic per accepted match before any
useful work happened.

Inline the chain walk directly into `hash_chain_candidate`: one fused
loop that produces a candidate, runs the donor speculative tail check,
runs `common_prefix_len`, and updates `best` — no intermediate buffer.
Mirrors donor `zstd_lazy.c` `ZSTD_HcFindBestMatch`, which never
materializes a candidate array. `chain_candidates` is kept as the
dump-style helper that the chain-walk unit tests still drive directly.

Verified on `compress/level_{5,8,12,15}_lazy/decodecorpus-z000033/matrix/pure_rust`
(criterion 10 samples, clean back-to-back vs origin/main, p = 0.00 across the board):

| level | main thrpt | this thrpt | speedup |
|---|---|---|---|
| L5 lazy | 13.5 MiB/s | 25.8 MiB/s | 1.91× |
| L8 lazy | 9.6 MiB/s | 17.0 MiB/s | 1.77× |
| L12 lazy | 8.3 MiB/s | 14.0 MiB/s | 1.69× |
| L15 lazy | 8.1 MiB/s | 13.7 MiB/s | 1.70× |

Ratio matrix (lazy band × all 7 scenarios): bit-identical to
origin/main. 534/534 lib tests pass, clippy and fmt clean.

Part of #184.
Copilot AI review requested due to automatic review settings May 19, 2026 07:11
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 25e490ea-8caa-4be8-ae24-5f648802c099

📥 Commits

Reviewing files that changed from the base of the PR and between 62b8f5e and cc63fe2.

📒 Files selected for processing (1)
  • zstd/src/encoding/hc/mod.rs

📝 Walkthrough

Walkthrough

HcMatcher::hash_chain_candidate is refactored to inline hash-chain traversal with self-loop detection and speculative 4-byte tail gating. The chain is walked directly using cached hash-table state, candidates are filtered to the live window, and matching is evaluated with a gate that skips expensive prefix computation when monotonicity fails.

Changes

Hash-chain candidate matching optimization

Layer / File(s) Summary
Inline chain walk initialization
zstd/src/encoding/hc/mod.rs
Chain traversal is set up inline: hash chain and mask are computed, the current chain cursor is initialized, and max iteration steps are capped by min(self.search_depth, MAX_HC_SEARCH_DEPTH).
Speculative matching in chain traversal loop
zstd/src/encoding/hc/mod.rs
The while-loop walks the hash chain with self-loop detection, filters candidates to the live window [history_abs_start, abs_pos), and applies speculative 4-byte tail gating only when best exists and new_offset >= best.offset. When gating fails, the candidate is skipped; otherwise full common_prefix_len and backward extension are performed. Early return triggers when best.match_len >= target_len.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I hopped along the hash-chain line,
Sniffed self-loops, skipped the wasted time,
Tail-gated matches, quick and lean,
No buffer baggage in between,
A tiny hop for faster rhyme.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: inlining the hash-chain walk into hash_chain_candidate for performance optimization in the encoder, with scope limited to the lazy L1 path.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#184-lazy-investigation

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 98.14815% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/hc/mod.rs 98.14% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the HC (hash-chain) match finder by inlining the chain-walk loop directly into HcMatcher::hash_chain_candidate, avoiding materializing a large fixed-size candidate buffer on the stack and reducing per-position stack traffic in lazy parsing.

Changes:

  • Inline the hash-chain walk into hash_chain_candidate (replacing the chain_candidates() buffer materialization in this hot path).
  • Preserve existing behaviors during the walk (window filtering, speculative tail gating, self-loop handling, and search depth cap).

Comment thread zstd/src/encoding/hc/mod.rs Outdated
Comment thread zstd/src/encoding/hc/mod.rs Outdated
Comment thread zstd/src/encoding/hc/mod.rs
Two doc-only adjustments to the inlined chain walk:

- Outer rationale block: correct the claim that `chain_candidates` is
  a test-only helper. It is still consumed by the BT-optimal HC
  candidate collector in match_generator.rs (around the
  `chain_candidates(...).into_iter()` callsite). Inlining the array
  out of that BT path is a separate refactor and is called out as
  out-of-scope.

- Per-iteration block inside the chain loop: drop the duplicate
  speculative-tail-gate rationale that restated the outer block.
  Keep one short pointer to the outer comment so the hot path stays
  readable.

No code-behavior change; 534/534 lib tests pass, clippy and fmt clean.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@polaz polaz merged commit af4fddd into main May 19, 2026
25 checks passed
@polaz polaz deleted the feat/#184-lazy-investigation branch May 19, 2026 07:34
@sw-release-bot sw-release-bot Bot mentioned this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants