Skip to content

perf(decode): HUF 4-stream rewrite — sentinel-bit refill + 5-symbol unroll + batch RELOAD + const-generic kernel #199

@polaz

Description

@polaz

Status

Follow-up to #178 Phase 2 (closed with c_stream gap 3.69× → 2.11×, but rust_stream still 3.60× and HUF 4-stream loop unaddressed). Encoder-heavy L=4+ rust_stream profile shows ~50 % self-time concentrated in literals_section_decoder::decode_literals — HUF decode is the dominant remaining decode bottleneck on M1 aarch64.

Roadmap #28 already references this as the residual "HUF rewrite — tracked as backlog, combining former decode-side tasks #135 (register-resident bit reader) + #136 (const-generic kernel dispatch) into a single coherent rewrite (~250–350 LOC, est. 1.5–2× HUF speedup)".

Current Rust hot loop

literals_section_decoder.rs:148-191 decodes 1 symbol per iteration per stream across the 4 parallel streams. Per symbol: decode4_symbols_and_num_bits batches the 4 table lookups, then 4 independent advance_state_by_bits calls each refill on demand via BitReaderReversed::get_bits. Kernel dispatch (Scalar/Bmi2/Avx2/Vbmi2/Neon/Sve) is a runtime match inside decode_symbol_and_advance — paid per symbol.

BitReaderReversed (bit_io/bit_reader_reverse.rs) keeps an explicit bits_consumed: u8 counter; every get_bits(n) does a conditional branch if self.bits_consumed + n > 64.

Donor hot loop (huf_decompress.c:738-819)

HUF_decompress4X1_usingDTable_internal_fast_c_loop decodes 5 symbols per stream per iteration (20 symbols total), then issues one batch RELOAD for all 4 streams via HUF_4X1_RELOAD_STREAM (lines 795-804, 817-818). Key trick: bits[stream] is a 64-bit container with a sentinel 1 in the MSB. ZSTD_countTrailingZeros64(bits[stream]) gives the exact consumed-bit offset without an explicit counter — no per-get_bits branch.

Kernel-specialised bodies are macro-generated via HUF_DGEN(fn) (lines 98-133) into fn_default and fn_bmi2 — the kernel flag is checked once per function entry, not per symbol.

Adaptation plan (ranked)

Land in this order; later items compose on top of earlier ones.

A. Sentinel-bit BitReaderReversed + ctz refill (~80 LOC)

File: zstd/src/bit_io/bit_reader_reverse.rs:154-300
Donor ref: bitstream.h:384-405, huf_decompress.c:751,789-802
Effect: ~+3-5 % decode. Drops the per-get_bits bits_consumed + n > 64 branch. Foundational — items B and D compose cleaner once the reader exposes the sentinel-bit shape.

B. 5-symbols-per-stream unroll (~150 LOC)

File: zstd/src/decoding/literals_section_decoder.rs:148-191 (primary target)
Donor ref: huf_decompress.c:809-815 (HUF_4X1_DECODE_SYMBOL macro × 5 × 4)
Effect: ~+8-12 %. Cuts main-loop bound checks 5×. Requires explicit 20-symbol unroll inside the per-iteration body.

D. Batch RELOAD after the 5-symbol burst (~120 LOC, combines with B in one PR)

File: zstd/src/decoding/literals_section_decoder.rs:148-191
Donor ref: huf_decompress.c:817-818
Effect: ~+5-7 %. Separates symbol-lookup from state advance so 4 RELOADs run together and the OOO engine overlaps them.

E. Const-generic kernel dispatch (~200 LOC)

Files: zstd/src/huff0/huff0_decoder.rs, zstd/src/decoding/literals_section_decoder.rs
Donor ref: huf_decompress.c:98-133 (HUF_DGEN(fn)fn_default / fn_bmi2)
Effect: ~+1-2 %. Monomorphise decode_literals::<Scalar> / ::<X86Bmi2> / ::<Aarch64Neon> etc.; kernel match is paid once per function entry, not per symbol.

C. refill_fast vs refill_slow split (~60 LOC)

File: zstd/src/bit_io/bit_reader_reverse.rs:200-257
Donor ref: bitstream.h:400-405 (BIT_reloadDStreamFast)
Effect: ~+2-3 %. Inline-always fast path with the overflow early-exit, fallback to existing slow path. Builds on top of A.

G. Pre-computed olimit per iteration (~40 LOC)

File: zstd/src/decoding/literals_section_decoder.rs:148-191
Donor ref: huf_decompress.c:749-765
Effect: ~+1-2 %. Replaces per-symbol multi-stream bound checks with a single op[3] < olimit guard.

F. x86-64 inline-asm fast loop (stretch goal, only if A-E don't hit the target)

File: new zstd/src/huff0/huff0_decoder_x86_asm.rs (or inline)
Donor ref: huf_decompress.c:714-920 (HUF_decompress4X1_usingDTable_internal_fast_asm_loop)
Effect: +0-1 % on top. High implementation cost (~300 LOC of inline asm), pursue only if A-E leave a measurable gap on x86_64.

Expected aggregate

A + B + D + E ≈ 15-25 % speedup on HUF 4-stream, which translates to ~7-12 % on literals_section_decoder::decode_literals (HUF is ~50 % self-time there).

Donor parity reference target: decompress/level_4_dfast/decodecorpus-z000033/rust_stream/matrix/pure_rust should drop from current 3.60× donor toward ≤ 2.5×.

Acceptance criteria

  • A landed: BitReaderReversed carries sentinel-bit, bits_consumed derived via ctz. Per-get_bits branch removed. Bench delta on c_stream/rust_stream recorded.
  • B + D landed (same PR): decode_literals 4-stream loop decodes 5 symbols per stream then batch-RELOADs. Bench delta recorded.
  • E landed: const-generic kernel dispatch on the decode-literals entry point. Runtime kernel match removed from the per-symbol path.
  • No ratio regression on compare_ffi REPORT lines.
  • 548/548 nextest + clippy clean across all 3 feature combos.
  • Hot-path flamegraph attached before/after on rust_stream/level_4_dfast/decodecorpus-z000033.

Out of scope

  • C and G are nice-to-have follow-ups, not gates for closing this issue.
  • F (inline asm) is stretch — separate issue if pursued.
  • Compression-side HUF (encoder) not touched; this issue is decode-only.

Estimate

5-7 working days (A: 1d, B+D bundled: 2-3d, E: 1-2d, validation + flamegraphs: 1d).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions