perf(encoder): inline hash-chain walk into hash_chain_candidate (lazy L1)#185
Conversation
`hash_chain_candidate` previously consumed the output of
`chain_candidates`, which returned `[usize; MAX_HC_SEARCH_DEPTH]` —
a 4 KiB stack array that was zero-filled on entry and returned by
value. With `lazy_depth = 2` (levels 7+) `pick_lazy_match` runs three
chain walks per committed position, so the array form spent ~12 KiB of
stack zero-fill and return-copy traffic per accepted match before any
useful work happened.
Inline the chain walk directly into `hash_chain_candidate`: one fused
loop that produces a candidate, runs the donor speculative tail check,
runs `common_prefix_len`, and updates `best` — no intermediate buffer.
Mirrors donor `zstd_lazy.c` `ZSTD_HcFindBestMatch`, which never
materializes a candidate array. `chain_candidates` is kept as the
dump-style helper that the chain-walk unit tests still drive directly.
Verified on `compress/level_{5,8,12,15}_lazy/decodecorpus-z000033/matrix/pure_rust`
(criterion 10 samples, clean back-to-back vs origin/main, p = 0.00 across the board):
| level | main thrpt | this thrpt | speedup |
|---|---|---|---|
| L5 lazy | 13.5 MiB/s | 25.8 MiB/s | 1.91× |
| L8 lazy | 9.6 MiB/s | 17.0 MiB/s | 1.77× |
| L12 lazy | 8.3 MiB/s | 14.0 MiB/s | 1.69× |
| L15 lazy | 8.1 MiB/s | 13.7 MiB/s | 1.70× |
Ratio matrix (lazy band × all 7 scenarios): bit-identical to
origin/main. 534/534 lib tests pass, clippy and fmt clean.
Part of #184.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughHcMatcher::hash_chain_candidate is refactored to inline hash-chain traversal with self-loop detection and speculative 4-byte tail gating. The chain is walked directly using cached hash-table state, candidates are filtered to the live window, and matching is evaluated with a gate that skips expensive prefix computation when monotonicity fails. ChangesHash-chain candidate matching optimization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR optimizes the HC (hash-chain) match finder by inlining the chain-walk loop directly into HcMatcher::hash_chain_candidate, avoiding materializing a large fixed-size candidate buffer on the stack and reducing per-position stack traffic in lazy parsing.
Changes:
- Inline the hash-chain walk into
hash_chain_candidate(replacing thechain_candidates()buffer materialization in this hot path). - Preserve existing behaviors during the walk (window filtering, speculative tail gating, self-loop handling, and search depth cap).
Two doc-only adjustments to the inlined chain walk: - Outer rationale block: correct the claim that `chain_candidates` is a test-only helper. It is still consumed by the BT-optimal HC candidate collector in match_generator.rs (around the `chain_candidates(...).into_iter()` callsite). Inlining the array out of that BT path is a separate refactor and is called out as out-of-scope. - Per-iteration block inside the chain loop: drop the duplicate speculative-tail-gate rationale that restated the outer block. Keep one short pointer to the outer comment so the hot path stays readable. No code-behavior change; 534/534 lib tests pass, clippy and fmt clean.
Summary
Inline the hash-chain walk directly into
hash_chain_candidatetoeliminate the 4 KiB-per-call stack array that
chain_candidatesmaterialized. With
lazy_depth = 2(levels 7+),pick_lazy_matchtriggers three chain walks per committed position, so the array form
cost ~12 KiB of stack zero-fill + return-copy traffic per accepted
match before any useful comparison happened. Donor
ZSTD_HcFindBestMatchruns a single fused loop with no intermediatebuffer; this mirrors that.
chain_candidatesitself stays live — the chain-walk unit testsdrive it directly, and the BT-optimal HC candidate collector in
match_generator.rs(around line 2437) consumes it through a macropipeline that inherits the array form. Inlining the array out of that
BT-optimal site is a separate, larger refactor and is NOT in this PR.
Scope (only lazy hot path)
Single file changed:
zstd/src/encoding/hc/mod.rs.hash_chain_candidate: chain walk is now inlined. Loop body fusescandidate-position extraction, range check, donor speculative tail
gate,
common_prefix_len,extend_backwards, andbestupdate.chain_candidates: unchanged signature and behavior, still used bythe BT-optimal HC collector and the chain-walk unit tests.
internal candidate selection (same candidates considered, same
better_candidateordering).Measurements
compress/level_{5,8,12,15}_lazy/decodecorpus-z000033/matrix/pure_rust— criterion 10 samples each, clean back-to-back vs
origin/main,p = 0.00across all four cells:Ratio (full lazy × scenario matrix via REPORT lines, 77 cells):
bit-identical to
origin/main. No ratio change anywhere, nocorrectness change — the inlined walk visits the same chain links in
the same order and applies the same predicates.
Verification
Out of scope (tracked in #184)
PREFETCH_L1(chain_table[next & chain_mask])pos,pos+1,pos+2target_lenearly-exit parity vs donorchain_candidatescallsite inliningRelated
Summary by CodeRabbit