You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Investigation + implementation track. Decode parity vs donor regressed broadly on cold-budget paths (negative levels + small-block frames), with CI guard x86_64-musl triggering 0.1136 (-88.64%) outside band on decompress/level_-1_fast/decodecorpus-z000033/c_stream. Local re-measurement on main (post-PR #170) confirms the gap is structural — decode is 3-5× slower than donor across the negative-level corpus, not a one-off. The fix touches shared decode/encode primitives (XXH64, bitstream readers, SIMD wildcopy, FSE state machine), so a single optimization sweep benefits both sides. Same investigation also fills the missing dedicated greedy encoder strategy at L4 — currently routed through the Row backend with lazy_depth = 1, which is a placeholder, not a donor-parity greedy impl.
Measured regression
Local A/B on main @ c71559eb, compare_ffi --features dict_builder, scenario decodecorpus-z000033 (1 MiB structured corpus), aarch64-darwin (the CI ratio gate triggers on x86_64-musl; numbers below are aarch64 baseline but the structural conclusion holds across targets):
Level
Stream source
Rust decode
C decode
Gap
-1_fast
c_stream (decompress C-encoded blob)
1.77 ms (550 MiB/s)
479 µs (1.99 GiB/s)
3.69×
-1_fast
rust_stream (decompress Rust-encoded blob)
4.73 ms (205 MiB/s)
885 µs (1.07 GiB/s)
5.35×
The rust_stream cell is worse because the Rust encoder emits more compressed blocks (vs C at L-1 which leans on raw blocks for low-entropy spans) — more compressed blocks → more FSE / Huffman state-machine work per decoded byte. Either way both source paths are far behind donor on the SAME decoder.
User-flagged signal: CI guard on x86_64-musl reports 0.1136 (-88.64%) outside band — that's a normalized throughput ratio (Rust / donor) of 11.36%, equivalent to ~8.8× slower on the worst cell observed by the regression checker. The 3-5× we measure locally is the slower end of the same distribution.
samply on --bench 'decompress/level_-1_fast/decodecorpus-z000033/c_stream/matrix/pure_rust' (5 s profile-time):
% inclusive
Function
~21%
frame_decoder::FrameDecoder::decode_blocks (the giant state machine — block dispatch, sequence loop, output drain)
~3%
xxhash64::Hasher::write (frame-checksum XXH64 over decoded bytes — twox_hash crate, scalar)
~2.5%
decoding::decode_buffer::DecodeBuffer::repeat (match-execution wildcopy on the rolling-history window)
~0.6%
decoding::simd_copy::copy_scalar (the no-SIMD fallback inside the SIMD-dispatch family — indicates not all overshooting copies are hitting the SIMD path)
Donor decode hot path (ZSTD_decompressBlock_internal → ZSTD_decompressSequences* → BIT_reloadDStream → HUF_decompress*_usingDTable_internal → ZSTD_execSequence*) compiles into ~5× less wall-clock for the same input bytes. The structural difference is concentrated in:
Sequence-execution / wildcopy inner loop — donor uses overshoot-tolerant SIMD with separate safe-copy tail; our simd_copy::copy_bytes_overshooting bails to scalar on overlapping copies.
Currently L4 → StrategyTag::Greedy → routes through BackendTag::Row with lazy_depth = 1 (zstd/src/encoding/match_generator.rs:286). This is a placeholder — correct output but uses the Row matcher's full lazy-depth machinery without the lazy_depth == 0 short-circuit. Donor ZSTD_compressBlock_greedy* has a dedicated hot loop with no lazy-bookkeeping overhead.
Donor greedy:
Walks input one position at a time.
Single hash-table lookup per position (no row-bucket scan).
Emits the match immediately on hit (no second-position retry).
Skip-step on miss: linear (no skip-by-distance heuristic that lazy uses).
Our Row backend with lazy_depth = 1 does all of the above PLUS row-bucket scan, second-position lookahead, skip-by-distance miss handler. For L4 specifically all of that is wasted work.
Acceptance criteria
Land in independent PRs, in priority order:
Phase 1 — dedicated greedy strategy (FOUNDATIONAL — do first)
This is core encoder architecture. Builds the proper greedy path before any shared-primitive optimization, because optimizing primitives on top of a placeholder strategy bakes the placeholder in.
Add StrategyTag::Greedy dispatch arm in MatchGeneratorDriver::compress_block that uses a dedicated greedy matcher (single hash table, single-position walk, linear skip-step) — donor-parity with ZSTD_compressBlock_greedy_extDict_generic.
New GreedyMatchGenerator (or fold into existing MatchGenerator with lazy_depth == 0 short-circuit through a separate dedicated code path — no shared lazy bookkeeping on the hot loop).
Benchmark L4 on compare_ffi --features dict_builder matrix; expect 10-30% encode speedup at L4 without ratio change.
Verify ratio preserved (rust_bytes <= ffi_bytes per CLAUDE.md "Ratio first" rule).
Phase 2 — shared decode/encode primitives
Now that we have correct dedicated greedy, optimize the hot primitives that both sides share. Run profile cleanly first on x86_64-musl (current profile is aarch64-darwin) to confirm targets.
simd_copy::copy_scalar audit: identify which overshooting-copy call sites end up on the scalar fallback on x86_64-musl. Likely: small-match (≤8 bytes) execution + RLE/pattern matches where offset < copy_len. Add a small-match-specialized SIMD path or split the dispatch table so the overlap-safe variant is also SIMD-accelerated.
Bitstream reader BMI2 specialization (bit_reader_reverse.rs): runtime is_x86_feature_detected!("bmi2") dispatch for bextr-based variable-width bit extraction. Hot in FSE/Huffman decode kernels.
Phase 3 — decode_blocks body slim-down
If after Phase 1+2 the decode gap is still > 2×, attack FrameDecoder::decode_blocks directly: identify which of the multiple offsets the samples cluster around, extract sub-functions with explicit #[inline] for the per-iteration sequence-execution loop, let the compiler specialize / vectorize.
Kill-switch criteria
Phase 1 (greedy) must not regress L4 ratio vs donor. If dedicated greedy produces worse ratio than the current Row-backend placeholder, keep the placeholder and close greedy as "no win." (Per CLAUDE.md: ratio first.)
Phase 2 must close the L-1 c_stream gap from 3.69× to ≤ 2.0× donor. If it doesn't, proceed to Phase 3.
chore: release v0.0.20 #112 Decompress: apply SIMD/inline fixes to top 3-5 hot funcs from profile (renamed — this issue is the implementation track for that local task)
Status
Investigation + implementation track. Decode parity vs donor regressed broadly on cold-budget paths (negative levels + small-block frames), with CI guard
x86_64-musltriggering0.1136 (-88.64%)outside band ondecompress/level_-1_fast/decodecorpus-z000033/c_stream. Local re-measurement onmain(post-PR #170) confirms the gap is structural — decode is 3-5× slower than donor across the negative-level corpus, not a one-off. The fix touches shared decode/encode primitives (XXH64, bitstream readers, SIMD wildcopy, FSE state machine), so a single optimization sweep benefits both sides. Same investigation also fills the missing dedicated greedy encoder strategy at L4 — currently routed through the Row backend withlazy_depth = 1, which is a placeholder, not a donor-parity greedy impl.Measured regression
Local A/B on
main@c71559eb,compare_ffi --features dict_builder, scenariodecodecorpus-z000033(1 MiB structured corpus), aarch64-darwin (the CI ratio gate triggers on x86_64-musl; numbers below are aarch64 baseline but the structural conclusion holds across targets):-1_fastc_stream(decompress C-encoded blob)-1_fastrust_stream(decompress Rust-encoded blob)The
rust_streamcell is worse because the Rust encoder emits more compressed blocks (vs C at L-1 which leans on raw blocks for low-entropy spans) — more compressed blocks → more FSE / Huffman state-machine work per decoded byte. Either way both source paths are far behind donor on the SAME decoder.User-flagged signal: CI guard on
x86_64-muslreports0.1136 (-88.64%)outside band — that's a normalized throughput ratio (Rust / donor) of 11.36%, equivalent to ~8.8× slower on the worst cell observed by the regression checker. The 3-5× we measure locally is the slower end of the same distribution.Profile findings (decode subtree only, filtered from
compress_ffibench iteration window)samplyon--bench 'decompress/level_-1_fast/decodecorpus-z000033/c_stream/matrix/pure_rust'(5 s profile-time):frame_decoder::FrameDecoder::decode_blocks(the giant state machine — block dispatch, sequence loop, output drain)xxhash64::Hasher::write(frame-checksum XXH64 over decoded bytes —twox_hashcrate, scalar)decoding::decode_buffer::DecodeBuffer::repeat(match-execution wildcopy on the rolling-history window)decoding::simd_copy::copy_scalar(the no-SIMD fallback inside the SIMD-dispatch family — indicates not all overshooting copies are hitting the SIMD path)Donor decode hot path (
ZSTD_decompressBlock_internal→ZSTD_decompressSequences*→BIT_reloadDStream→HUF_decompress*_usingDTable_internal→ZSTD_execSequence*) compiles into ~5× less wall-clock for the same input bytes. The structural difference is concentrated in:simd_copy::copy_bytes_overshootingbails to scalar on overlapping copies.BIT_reloadDStream+FSE_decodeSymbolheavily inlined / target_feature-specialized; ourbit_reader_reverse.rsis generic.twox_hash; already tracked as perf(decoding): integrate AVX2 unroll2 wildcopy candidate #108.Hot primitives shared with encoder
simd_copy::copy_bytes_overshooting: match execution (decode) + match-table population (encode).bit_reader_reverse(decode) +bit_writer(encode), same packing convention.state_table_flat(PR perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State> #166); decoderFSETable::decode.Greedy strategy — missing dedicated implementation
Currently L4 →
StrategyTag::Greedy→ routes throughBackendTag::Rowwithlazy_depth = 1(zstd/src/encoding/match_generator.rs:286). This is a placeholder — correct output but uses the Row matcher's full lazy-depth machinery without thelazy_depth == 0short-circuit. DonorZSTD_compressBlock_greedy*has a dedicated hot loop with no lazy-bookkeeping overhead.Donor greedy:
Our Row backend with
lazy_depth = 1does all of the above PLUS row-bucket scan, second-position lookahead, skip-by-distance miss handler. For L4 specifically all of that is wasted work.Acceptance criteria
Land in independent PRs, in priority order:
Phase 1 — dedicated greedy strategy (FOUNDATIONAL — do first)
This is core encoder architecture. Builds the proper greedy path before any shared-primitive optimization, because optimizing primitives on top of a placeholder strategy bakes the placeholder in.
StrategyTag::Greedydispatch arm inMatchGeneratorDriver::compress_blockthat uses a dedicated greedy matcher (single hash table, single-position walk, linear skip-step) — donor-parity withZSTD_compressBlock_greedy_extDict_generic.GreedyMatchGenerator(or fold into existingMatchGeneratorwithlazy_depth == 0short-circuit through a separate dedicated code path — no shared lazy bookkeeping on the hot loop).compare_ffi --features dict_buildermatrix; expect 10-30% encode speedup at L4 without ratio change.rust_bytes <= ffi_bytesper CLAUDE.md "Ratio first" rule).Phase 2 — shared decode/encode primitives
Now that we have correct dedicated greedy, optimize the hot primitives that both sides share. Run profile cleanly first on
x86_64-musl(current profile isaarch64-darwin) to confirm targets.simd_copy::copy_scalaraudit: identify which overshooting-copy call sites end up on the scalar fallback onx86_64-musl. Likely: small-match (≤8 bytes) execution + RLE/pattern matches whereoffset < copy_len. Add a small-match-specialized SIMD path or split the dispatch table so the overlap-safe variant is also SIMD-accelerated.bit_reader_reverse.rs): runtimeis_x86_feature_detected!("bmi2")dispatch forbextr-based variable-width bit extraction. Hot in FSE/Huffman decode kernels.Phase 3 — decode_blocks body slim-down
If after Phase 1+2 the decode gap is still > 2×, attack
FrameDecoder::decode_blocksdirectly: identify which of the multiple offsets the samples cluster around, extract sub-functions with explicit#[inline]for the per-iteration sequence-execution loop, let the compiler specialize / vectorize.Kill-switch criteria
Related