Status
Follow-up to #178 Phase 2 (closed with c_stream gap 3.69× → 2.11×, but rust_stream still 3.60× and HUF 4-stream loop unaddressed). Encoder-heavy L=4+ rust_stream profile shows ~50 % self-time concentrated in literals_section_decoder::decode_literals — HUF decode is the dominant remaining decode bottleneck on M1 aarch64.
Roadmap #28 already references this as the residual "HUF rewrite — tracked as backlog, combining former decode-side tasks #135 (register-resident bit reader) + #136 (const-generic kernel dispatch) into a single coherent rewrite (~250–350 LOC, est. 1.5–2× HUF speedup)".
Current Rust hot loop
literals_section_decoder.rs:148-191 decodes 1 symbol per iteration per stream across the 4 parallel streams. Per symbol: decode4_symbols_and_num_bits batches the 4 table lookups, then 4 independent advance_state_by_bits calls each refill on demand via BitReaderReversed::get_bits. Kernel dispatch (Scalar/Bmi2/Avx2/Vbmi2/Neon/Sve) is a runtime match inside decode_symbol_and_advance — paid per symbol.
BitReaderReversed (bit_io/bit_reader_reverse.rs) keeps an explicit bits_consumed: u8 counter; every get_bits(n) does a conditional branch if self.bits_consumed + n > 64.
Donor hot loop (huf_decompress.c:738-819)
HUF_decompress4X1_usingDTable_internal_fast_c_loop decodes 5 symbols per stream per iteration (20 symbols total), then issues one batch RELOAD for all 4 streams via HUF_4X1_RELOAD_STREAM (lines 795-804, 817-818). Key trick: bits[stream] is a 64-bit container with a sentinel 1 in the MSB. ZSTD_countTrailingZeros64(bits[stream]) gives the exact consumed-bit offset without an explicit counter — no per-get_bits branch.
Kernel-specialised bodies are macro-generated via HUF_DGEN(fn) (lines 98-133) into fn_default and fn_bmi2 — the kernel flag is checked once per function entry, not per symbol.
Adaptation plan (ranked)
Land in this order; later items compose on top of earlier ones.
A. Sentinel-bit BitReaderReversed + ctz refill (~80 LOC)
File: zstd/src/bit_io/bit_reader_reverse.rs:154-300
Donor ref: bitstream.h:384-405, huf_decompress.c:751,789-802
Effect: ~+3-5 % decode. Drops the per-get_bits bits_consumed + n > 64 branch. Foundational — items B and D compose cleaner once the reader exposes the sentinel-bit shape.
B. 5-symbols-per-stream unroll (~150 LOC)
File: zstd/src/decoding/literals_section_decoder.rs:148-191 (primary target)
Donor ref: huf_decompress.c:809-815 (HUF_4X1_DECODE_SYMBOL macro × 5 × 4)
Effect: ~+8-12 %. Cuts main-loop bound checks 5×. Requires explicit 20-symbol unroll inside the per-iteration body.
D. Batch RELOAD after the 5-symbol burst (~120 LOC, combines with B in one PR)
File: zstd/src/decoding/literals_section_decoder.rs:148-191
Donor ref: huf_decompress.c:817-818
Effect: ~+5-7 %. Separates symbol-lookup from state advance so 4 RELOADs run together and the OOO engine overlaps them.
E. Const-generic kernel dispatch (~200 LOC)
Files: zstd/src/huff0/huff0_decoder.rs, zstd/src/decoding/literals_section_decoder.rs
Donor ref: huf_decompress.c:98-133 (HUF_DGEN(fn) → fn_default / fn_bmi2)
Effect: ~+1-2 %. Monomorphise decode_literals::<Scalar> / ::<X86Bmi2> / ::<Aarch64Neon> etc.; kernel match is paid once per function entry, not per symbol.
C. refill_fast vs refill_slow split (~60 LOC)
File: zstd/src/bit_io/bit_reader_reverse.rs:200-257
Donor ref: bitstream.h:400-405 (BIT_reloadDStreamFast)
Effect: ~+2-3 %. Inline-always fast path with the overflow early-exit, fallback to existing slow path. Builds on top of A.
G. Pre-computed olimit per iteration (~40 LOC)
File: zstd/src/decoding/literals_section_decoder.rs:148-191
Donor ref: huf_decompress.c:749-765
Effect: ~+1-2 %. Replaces per-symbol multi-stream bound checks with a single op[3] < olimit guard.
F. x86-64 inline-asm fast loop (stretch goal, only if A-E don't hit the target)
File: new zstd/src/huff0/huff0_decoder_x86_asm.rs (or inline)
Donor ref: huf_decompress.c:714-920 (HUF_decompress4X1_usingDTable_internal_fast_asm_loop)
Effect: +0-1 % on top. High implementation cost (~300 LOC of inline asm), pursue only if A-E leave a measurable gap on x86_64.
Expected aggregate
A + B + D + E ≈ 15-25 % speedup on HUF 4-stream, which translates to ~7-12 % on literals_section_decoder::decode_literals (HUF is ~50 % self-time there).
Donor parity reference target: decompress/level_4_dfast/decodecorpus-z000033/rust_stream/matrix/pure_rust should drop from current 3.60× donor toward ≤ 2.5×.
Acceptance criteria
Out of scope
- C and G are nice-to-have follow-ups, not gates for closing this issue.
- F (inline asm) is stretch — separate issue if pursued.
- Compression-side HUF (encoder) not touched; this issue is decode-only.
Estimate
5-7 working days (A: 1d, B+D bundled: 2-3d, E: 1-2d, validation + flamegraphs: 1d).
Related
Status
Follow-up to #178 Phase 2 (closed with c_stream gap 3.69× → 2.11×, but
rust_streamstill 3.60× and HUF 4-stream loop unaddressed). Encoder-heavyL=4+ rust_streamprofile shows ~50 % self-time concentrated inliterals_section_decoder::decode_literals— HUF decode is the dominant remaining decode bottleneck on M1 aarch64.Roadmap #28 already references this as the residual "HUF rewrite — tracked as backlog, combining former decode-side tasks #135 (register-resident bit reader) + #136 (const-generic kernel dispatch) into a single coherent rewrite (~250–350 LOC, est. 1.5–2× HUF speedup)".
Current Rust hot loop
literals_section_decoder.rs:148-191decodes 1 symbol per iteration per stream across the 4 parallel streams. Per symbol:decode4_symbols_and_num_bitsbatches the 4 table lookups, then 4 independentadvance_state_by_bitscalls each refill on demand viaBitReaderReversed::get_bits. Kernel dispatch (Scalar/Bmi2/Avx2/Vbmi2/Neon/Sve) is a runtimematchinsidedecode_symbol_and_advance— paid per symbol.BitReaderReversed(bit_io/bit_reader_reverse.rs) keeps an explicitbits_consumed: u8counter; everyget_bits(n)does a conditional branchif self.bits_consumed + n > 64.Donor hot loop (huf_decompress.c:738-819)
HUF_decompress4X1_usingDTable_internal_fast_c_loopdecodes 5 symbols per stream per iteration (20 symbols total), then issues one batch RELOAD for all 4 streams viaHUF_4X1_RELOAD_STREAM(lines 795-804, 817-818). Key trick:bits[stream]is a 64-bit container with a sentinel 1 in the MSB.ZSTD_countTrailingZeros64(bits[stream])gives the exact consumed-bit offset without an explicit counter — no per-get_bitsbranch.Kernel-specialised bodies are macro-generated via
HUF_DGEN(fn)(lines 98-133) intofn_defaultandfn_bmi2— the kernel flag is checked once per function entry, not per symbol.Adaptation plan (ranked)
Land in this order; later items compose on top of earlier ones.
A. Sentinel-bit
BitReaderReversed+ ctz refill (~80 LOC)File:
zstd/src/bit_io/bit_reader_reverse.rs:154-300Donor ref:
bitstream.h:384-405,huf_decompress.c:751,789-802Effect: ~+3-5 % decode. Drops the per-
get_bitsbits_consumed + n > 64branch. Foundational — items B and D compose cleaner once the reader exposes the sentinel-bit shape.B. 5-symbols-per-stream unroll (~150 LOC)
File:
zstd/src/decoding/literals_section_decoder.rs:148-191(primary target)Donor ref:
huf_decompress.c:809-815(HUF_4X1_DECODE_SYMBOL macro × 5 × 4)Effect: ~+8-12 %. Cuts main-loop bound checks 5×. Requires explicit 20-symbol unroll inside the per-iteration body.
D. Batch RELOAD after the 5-symbol burst (~120 LOC, combines with B in one PR)
File:
zstd/src/decoding/literals_section_decoder.rs:148-191Donor ref:
huf_decompress.c:817-818Effect: ~+5-7 %. Separates symbol-lookup from state advance so 4 RELOADs run together and the OOO engine overlaps them.
E. Const-generic kernel dispatch (~200 LOC)
Files:
zstd/src/huff0/huff0_decoder.rs,zstd/src/decoding/literals_section_decoder.rsDonor ref:
huf_decompress.c:98-133(HUF_DGEN(fn)→fn_default/fn_bmi2)Effect: ~+1-2 %. Monomorphise
decode_literals::<Scalar>/::<X86Bmi2>/::<Aarch64Neon>etc.; kernel match is paid once per function entry, not per symbol.C.
refill_fastvsrefill_slowsplit (~60 LOC)File:
zstd/src/bit_io/bit_reader_reverse.rs:200-257Donor ref:
bitstream.h:400-405(BIT_reloadDStreamFast)Effect: ~+2-3 %. Inline-always fast path with the overflow early-exit, fallback to existing slow path. Builds on top of A.
G. Pre-computed
olimitper iteration (~40 LOC)File:
zstd/src/decoding/literals_section_decoder.rs:148-191Donor ref:
huf_decompress.c:749-765Effect: ~+1-2 %. Replaces per-symbol multi-stream bound checks with a single
op[3] < olimitguard.F. x86-64 inline-asm fast loop (stretch goal, only if A-E don't hit the target)
File: new
zstd/src/huff0/huff0_decoder_x86_asm.rs(or inline)Donor ref:
huf_decompress.c:714-920(HUF_decompress4X1_usingDTable_internal_fast_asm_loop)Effect: +0-1 % on top. High implementation cost (~300 LOC of inline asm), pursue only if A-E leave a measurable gap on x86_64.
Expected aggregate
A + B + D + E ≈ 15-25 % speedup on HUF 4-stream, which translates to ~7-12 % on
literals_section_decoder::decode_literals(HUF is ~50 % self-time there).Donor parity reference target:
decompress/level_4_dfast/decodecorpus-z000033/rust_stream/matrix/pure_rustshould drop from current 3.60× donor toward≤ 2.5×.Acceptance criteria
BitReaderReversedcarries sentinel-bit,bits_consumedderived viactz. Per-get_bitsbranch removed. Bench delta onc_stream/rust_streamrecorded.decode_literals4-stream loop decodes 5 symbols per stream then batch-RELOADs. Bench delta recorded.matchremoved from the per-symbol path.compare_ffiREPORT lines.rust_stream/level_4_dfast/decodecorpus-z000033.Out of scope
Estimate
5-7 working days (A: 1d, B+D bundled: 2-3d, E: 1-2d, validation + flamegraphs: 1d).
Related
rust_streamstill 3.60×).