perf(decode): unroll HUF 4-stream burst inner loop via const-generic by polaz · Pull Request #296 · structured-world/structured-zstd

polaz · 2026-05-29T03:36:34Z

Summary

Const-generic unroll of the HUF 4-stream literal burst inner loop. Extracts the per-symbol decode into burst_decode_symbols::<SPB>, dispatched on the loop-invariant symbols_per_burst (5/6/7 monomorphised + dynamic fallback), matching donor's hardcoded for symbol in 0..5 full unroll in HUF_decompress4X1_usingDTable_internal_fast_c_loop.

Bench (i9, decompress pure_rust, vs main)

Fixture	change
z000033 L2 dfast (literal-heavy)	−0.01% (p=0.83, neutral)
low-entropy-1m btultra2 (match-heavy)	−0.45% (p=0.00)

Neutral-or-better across both fixture classes, zero regression. Donor-parity structure; foundation for further HUF burst work.

Context

Profiling (decode_loop pure-decode) showed the HUF burst dominates on literal-heavy frames (decompress_literals_avx2 27.6% self vs donor's ~0.08% asm _fast_asm_loop). The full gap is asm-gated (donor uses hand-written huf_decompress_amd64.S with register-perfect pipelining — issue #205); LLVM-level changes (unroll, software-pipeline, scalarise) are either marginal or fixture trade-offs on the shared decode hot path. This PR keeps ONLY the unambiguously-safe unroll. Reverted experiments (scalarise: literal-heavy −0.98% but match-heavy +0.85%; software-pipeline: −3.9%) are documented out of scope.

Related to #178 / #247.

The HUF literal burst inner loop used a runtime `for _ in 0..symbols_per_burst` trip count, leaving an induction variable + per-iteration check that LLVM could not eliminate. Donor's `HUF_decompress4X1_usingDTable_internal_fast_c_loop` hardcodes `for symbol in 0..5`, fully unrolling the 5×4 decode into straight-line code. Pure-decode profiling (decode_loop, i9, z000033 L2 dfast) showed the burst body (`decompress_literals_avx2`) at 27.6% self-time vs donor's ~4-5% for the entire HUF literal decode — on literal-heavy weak-compression frames (the worst dashboard r/f points: 0.12-0.29) this is the single largest decode gap. Extract the inner symbol loop into `burst_decode_symbols::<SPB>`, monomorphised on the compile-time symbol count, and dispatch on the loop-invariant `symbols_per_burst`. SPB=5 covers max_num_bits ∈ {10,11} (large-alphabet literal-heavy — the dominant cost), 6 covers {8,9}, 7 covers {7}. Rarer small-max tables (few symbols, cheap overall) keep the dynamic loop. The match is loop-invariant so it unswitches out of the `while`; each arm gets a fully-unrolled body. 653/653 nextest pass.

coderabbitai · 2026-05-29T03:36:41Z

Warning

Review limit reached

@polaz, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 49 minutes. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 470bdac1-7eed-4087-9c38-cc69938d14b5

📥 Commits

Reviewing files that changed from the base of the PR and between 71ba37d and 55b70f1.

📒 Files selected for processing (1)

zstd/src/decoding/literals_section_decoder.rs

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#178-huf-burst-unroll

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Refactors the inner HUF 4-stream burst loop to extract per-symbol decode into a burst_decode_symbols::<const SPB: usize> helper, dispatching on symbols_per_burst values 5/6/7 for full LLVM unrolling, with a dynamic fallback for other cases. Matches donor's hardcoded for symbol in 0..5 unroll structure to reduce decode cost on literal-heavy frames.

Changes:

Add burst_decode_symbols<const SPB: usize> helper containing the 4-stream symbol decode body
Dispatch the inner burst loop on symbols_per_burst with monomorphised arms (5, 6, 7) and a dynamic fallback

codecov · 2026-05-29T07:41:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 29, 2026 03:36

Copilot started reviewing on behalf of polaz May 29, 2026 03:36 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

style: rustfmt burst_decode_symbols call sites to multiline

55b70f1

polaz requested a review from Copilot May 29, 2026 07:40

Copilot started reviewing on behalf of polaz May 29, 2026 07:40 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

polaz merged commit 1efbb1d into main May 29, 2026
23 checks passed

polaz deleted the perf/#178-huf-burst-unroll branch May 29, 2026 07:58

sw-release-bot Bot mentioned this pull request May 28, 2026

chore: release v0.0.27 #284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decode): unroll HUF 4-stream burst inner loop via const-generic#296

perf(decode): unroll HUF 4-stream burst inner loop via const-generic#296
polaz merged 2 commits into
mainfrom
perf/#178-huf-burst-unroll

polaz commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Review limit reached

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

polaz commented May 29, 2026

Summary

Bench (i9, decompress pure_rust, vs main)

Context

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

codecov Bot commented May 29, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 29, 2026 •

edited

Loading