perf(decode): unroll HUF 4-stream burst inner loop via const-generic#296
Conversation
The HUF literal burst inner loop used a runtime `for _ in
0..symbols_per_burst` trip count, leaving an induction variable +
per-iteration check that LLVM could not eliminate. Donor's
`HUF_decompress4X1_usingDTable_internal_fast_c_loop` hardcodes
`for symbol in 0..5`, fully unrolling the 5×4 decode into
straight-line code.
Pure-decode profiling (decode_loop, i9, z000033 L2 dfast) showed
the burst body (`decompress_literals_avx2`) at 27.6% self-time vs
donor's ~4-5% for the entire HUF literal decode — on literal-heavy
weak-compression frames (the worst dashboard r/f points: 0.12-0.29)
this is the single largest decode gap.
Extract the inner symbol loop into `burst_decode_symbols::<SPB>`,
monomorphised on the compile-time symbol count, and dispatch on the
loop-invariant `symbols_per_burst`. SPB=5 covers max_num_bits ∈
{10,11} (large-alphabet literal-heavy — the dominant cost), 6 covers
{8,9}, 7 covers {7}. Rarer small-max tables (few symbols, cheap
overall) keep the dynamic loop. The match is loop-invariant so it
unswitches out of the `while`; each arm gets a fully-unrolled body.
653/653 nextest pass.
|
Warning Review limit reached
More reviews will be available in 49 minutes. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Refactors the inner HUF 4-stream burst loop to extract per-symbol decode into a burst_decode_symbols::<const SPB: usize> helper, dispatching on symbols_per_burst values 5/6/7 for full LLVM unrolling, with a dynamic fallback for other cases. Matches donor's hardcoded for symbol in 0..5 unroll structure to reduce decode cost on literal-heavy frames.
Changes:
- Add
burst_decode_symbols<const SPB: usize>helper containing the 4-stream symbol decode body - Dispatch the inner burst loop on
symbols_per_burstwith monomorphised arms (5, 6, 7) and a dynamic fallback
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary
Const-generic unroll of the HUF 4-stream literal burst inner loop. Extracts the per-symbol decode into
burst_decode_symbols::<SPB>, dispatched on the loop-invariantsymbols_per_burst(5/6/7 monomorphised + dynamic fallback), matching donor's hardcodedfor symbol in 0..5full unroll inHUF_decompress4X1_usingDTable_internal_fast_c_loop.Bench (i9, decompress pure_rust, vs main)
Neutral-or-better across both fixture classes, zero regression. Donor-parity structure; foundation for further HUF burst work.
Context
Profiling (
decode_looppure-decode) showed the HUF burst dominates on literal-heavy frames (decompress_literals_avx227.6% self vs donor's ~0.08% asm_fast_asm_loop). The full gap is asm-gated (donor uses hand-writtenhuf_decompress_amd64.Swith register-perfect pipelining — issue #205); LLVM-level changes (unroll, software-pipeline, scalarise) are either marginal or fixture trade-offs on the shared decode hot path. This PR keeps ONLY the unambiguously-safe unroll. Reverted experiments (scalarise: literal-heavy −0.98% but match-heavy +0.85%; software-pipeline: −3.9%) are documented out of scope.Related to #178 / #247.