perf(decode): HUF 4-stream rewrite — sentinel-bit refill + 5-symbol unroll + batch RELOAD + const-generic kernel

## Status

Follow-up to **#178 Phase 2** (closed with c_stream gap 3.69× → 2.11×, but `rust_stream` still 3.60× and HUF 4-stream loop unaddressed). Encoder-heavy `L=4+ rust_stream` profile shows ~50 % self-time concentrated in `literals_section_decoder::decode_literals` — HUF decode is the dominant remaining decode bottleneck on M1 aarch64.

Roadmap #28 already references this as the residual "HUF rewrite — tracked as backlog, combining former decode-side tasks #135 (register-resident bit reader) + #136 (const-generic kernel dispatch) into a single coherent rewrite (~250–350 LOC, est. 1.5–2× HUF speedup)".

## Current Rust hot loop

`literals_section_decoder.rs:148-191` decodes **1 symbol per iteration per stream** across the 4 parallel streams. Per symbol: `decode4_symbols_and_num_bits` batches the 4 table lookups, then 4 independent `advance_state_by_bits` calls each refill on demand via `BitReaderReversed::get_bits`. Kernel dispatch (Scalar/Bmi2/Avx2/Vbmi2/Neon/Sve) is a runtime `match` inside `decode_symbol_and_advance` — paid per symbol.

`BitReaderReversed` (`bit_io/bit_reader_reverse.rs`) keeps an explicit `bits_consumed: u8` counter; every `get_bits(n)` does a conditional branch `if self.bits_consumed + n > 64`.

## Donor hot loop (huf_decompress.c:738-819)

`HUF_decompress4X1_usingDTable_internal_fast_c_loop` decodes **5 symbols per stream per iteration** (20 symbols total), then issues **one batch RELOAD for all 4 streams** via `HUF_4X1_RELOAD_STREAM` (lines 795-804, 817-818). Key trick: `bits[stream]` is a 64-bit container with a **sentinel 1 in the MSB**. `ZSTD_countTrailingZeros64(bits[stream])` gives the exact consumed-bit offset without an explicit counter — no per-`get_bits` branch.

Kernel-specialised bodies are macro-generated via `HUF_DGEN(fn)` (lines 98-133) into `fn_default` and `fn_bmi2` — the kernel flag is checked **once per function entry**, not per symbol.

## Adaptation plan (ranked)

Land in this order; later items compose on top of earlier ones.

### A. Sentinel-bit `BitReaderReversed` + ctz refill (~80 LOC)

**File:** `zstd/src/bit_io/bit_reader_reverse.rs:154-300`
**Donor ref:** `bitstream.h:384-405`, `huf_decompress.c:751,789-802`
**Effect:** ~+3-5 % decode. Drops the per-`get_bits` `bits_consumed + n > 64` branch. Foundational — items B and D compose cleaner once the reader exposes the sentinel-bit shape.

### B. 5-symbols-per-stream unroll (~150 LOC)

**File:** `zstd/src/decoding/literals_section_decoder.rs:148-191` (primary target)
**Donor ref:** `huf_decompress.c:809-815` (HUF_4X1_DECODE_SYMBOL macro × 5 × 4)
**Effect:** ~+8-12 %. Cuts main-loop bound checks 5×. Requires explicit 20-symbol unroll inside the per-iteration body.

### D. Batch RELOAD after the 5-symbol burst (~120 LOC, combines with B in one PR)

**File:** `zstd/src/decoding/literals_section_decoder.rs:148-191`
**Donor ref:** `huf_decompress.c:817-818`
**Effect:** ~+5-7 %. Separates symbol-lookup from state advance so 4 RELOADs run together and the OOO engine overlaps them.

### E. Const-generic kernel dispatch (~200 LOC)

**Files:** `zstd/src/huff0/huff0_decoder.rs`, `zstd/src/decoding/literals_section_decoder.rs`
**Donor ref:** `huf_decompress.c:98-133` (`HUF_DGEN(fn)` → `fn_default` / `fn_bmi2`)
**Effect:** ~+1-2 %. Monomorphise `decode_literals::<Scalar>` / `::<X86Bmi2>` / `::<Aarch64Neon>` etc.; kernel match is paid once per function entry, not per symbol.

### C. `refill_fast` vs `refill_slow` split (~60 LOC)

**File:** `zstd/src/bit_io/bit_reader_reverse.rs:200-257`
**Donor ref:** `bitstream.h:400-405` (`BIT_reloadDStreamFast`)
**Effect:** ~+2-3 %. Inline-always fast path with the overflow early-exit, fallback to existing slow path. Builds on top of A.

### G. Pre-computed `olimit` per iteration (~40 LOC)

**File:** `zstd/src/decoding/literals_section_decoder.rs:148-191`
**Donor ref:** `huf_decompress.c:749-765`
**Effect:** ~+1-2 %. Replaces per-symbol multi-stream bound checks with a single `op[3] < olimit` guard.

### F. x86-64 inline-asm fast loop *(stretch goal, only if A-E don't hit the target)*

**File:** new `zstd/src/huff0/huff0_decoder_x86_asm.rs` (or inline)
**Donor ref:** `huf_decompress.c:714-920` (`HUF_decompress4X1_usingDTable_internal_fast_asm_loop`)
**Effect:** +0-1 % on top. High implementation cost (~300 LOC of inline asm), pursue only if A-E leave a measurable gap on x86_64.

## Expected aggregate

A + B + D + E ≈ **15-25 % speedup on HUF 4-stream**, which translates to ~7-12 % on `literals_section_decoder::decode_literals` (HUF is ~50 % self-time there).

Donor parity reference target: `decompress/level_4_dfast/decodecorpus-z000033/rust_stream/matrix/pure_rust` should drop from current 3.60× donor toward `≤ 2.5×`.

## Acceptance criteria

- [ ] A landed: `BitReaderReversed` carries sentinel-bit, `bits_consumed` derived via `ctz`. Per-`get_bits` branch removed. Bench delta on `c_stream`/`rust_stream` recorded.
- [ ] B + D landed (same PR): `decode_literals` 4-stream loop decodes 5 symbols per stream then batch-RELOADs. Bench delta recorded.
- [ ] E landed: const-generic kernel dispatch on the decode-literals entry point. Runtime kernel `match` removed from the per-symbol path.
- [ ] No ratio regression on `compare_ffi` REPORT lines.
- [ ] 548/548 nextest + clippy clean across all 3 feature combos.
- [ ] Hot-path flamegraph attached before/after on `rust_stream/level_4_dfast/decodecorpus-z000033`.

## Out of scope

- C and G are nice-to-have follow-ups, not gates for closing this issue.
- F (inline asm) is stretch — separate issue if pursued.
- Compression-side HUF (encoder) not touched; this issue is decode-only.

## Estimate

5-7 working days (A: 1d, B+D bundled: 2-3d, E: 1-2d, validation + flamegraphs: 1d).

## Related

- Closes the HUF residual from **#178 Phase 2** (closed with c_stream kill-switch hit but `rust_stream` still 3.60×).
- Part of **#28** Tier-1 follow-up after #178.
- Supersedes backlog references to former task IDs #135 (register-resident bit reader) + #136 (const-generic kernel dispatch).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decode): HUF 4-stream rewrite — sentinel-bit refill + 5-symbol unroll + batch RELOAD + const-generic kernel #199

Status

Current Rust hot loop

Donor hot loop (huf_decompress.c:738-819)

Adaptation plan (ranked)

A. Sentinel-bit `BitReaderReversed` + ctz refill (~80 LOC)

B. 5-symbols-per-stream unroll (~150 LOC)

D. Batch RELOAD after the 5-symbol burst (~120 LOC, combines with B in one PR)

E. Const-generic kernel dispatch (~200 LOC)

C. `refill_fast` vs `refill_slow` split (~60 LOC)

G. Pre-computed `olimit` per iteration (~40 LOC)

F. x86-64 inline-asm fast loop (stretch goal, only if A-E don't hit the target)

Expected aggregate

Acceptance criteria

Out of scope

Estimate

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(decode): HUF 4-stream rewrite — sentinel-bit refill + 5-symbol unroll + batch RELOAD + const-generic kernel #199

Description

Status

Current Rust hot loop

Donor hot loop (huf_decompress.c:738-819)

Adaptation plan (ranked)

A. Sentinel-bit BitReaderReversed + ctz refill (~80 LOC)

B. 5-symbols-per-stream unroll (~150 LOC)

D. Batch RELOAD after the 5-symbol burst (~120 LOC, combines with B in one PR)

E. Const-generic kernel dispatch (~200 LOC)

C. refill_fast vs refill_slow split (~60 LOC)

G. Pre-computed olimit per iteration (~40 LOC)

F. x86-64 inline-asm fast loop (stretch goal, only if A-E don't hit the target)

Expected aggregate

Acceptance criteria

Out of scope

Estimate

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

A. Sentinel-bit `BitReaderReversed` + ctz refill (~80 LOC)

C. `refill_fast` vs `refill_slow` split (~60 LOC)

G. Pre-computed `olimit` per iteration (~40 LOC)