Skip to content

perf(decode + encode-greedy): close 3-5× donor gap on negative-level decompress; share SIMD primitives + add dedicated greedy strategy #178

@polaz

Description

@polaz

Status

Investigation + implementation track. Decode parity vs donor regressed broadly on cold-budget paths (negative levels + small-block frames), with CI guard x86_64-musl triggering 0.1136 (-88.64%) outside band on decompress/level_-1_fast/decodecorpus-z000033/c_stream. Local re-measurement on main (post-PR #170) confirms the gap is structural — decode is 3-5× slower than donor across the negative-level corpus, not a one-off. The fix touches shared decode/encode primitives (XXH64, bitstream readers, SIMD wildcopy, FSE state machine), so a single optimization sweep benefits both sides. Same investigation also fills the missing dedicated greedy encoder strategy at L4 — currently routed through the Row backend with lazy_depth = 1, which is a placeholder, not a donor-parity greedy impl.

Measured regression

Local A/B on main @ c71559eb, compare_ffi --features dict_builder, scenario decodecorpus-z000033 (1 MiB structured corpus), aarch64-darwin (the CI ratio gate triggers on x86_64-musl; numbers below are aarch64 baseline but the structural conclusion holds across targets):

Level Stream source Rust decode C decode Gap
-1_fast c_stream (decompress C-encoded blob) 1.77 ms (550 MiB/s) 479 µs (1.99 GiB/s) 3.69×
-1_fast rust_stream (decompress Rust-encoded blob) 4.73 ms (205 MiB/s) 885 µs (1.07 GiB/s) 5.35×

The rust_stream cell is worse because the Rust encoder emits more compressed blocks (vs C at L-1 which leans on raw blocks for low-entropy spans) — more compressed blocks → more FSE / Huffman state-machine work per decoded byte. Either way both source paths are far behind donor on the SAME decoder.

User-flagged signal: CI guard on x86_64-musl reports 0.1136 (-88.64%) outside band — that's a normalized throughput ratio (Rust / donor) of 11.36%, equivalent to ~8.8× slower on the worst cell observed by the regression checker. The 3-5× we measure locally is the slower end of the same distribution.

Profile findings (decode subtree only, filtered from compress_ffi bench iteration window)

samply on --bench 'decompress/level_-1_fast/decodecorpus-z000033/c_stream/matrix/pure_rust' (5 s profile-time):

% inclusive Function
~21% frame_decoder::FrameDecoder::decode_blocks (the giant state machine — block dispatch, sequence loop, output drain)
~3% xxhash64::Hasher::write (frame-checksum XXH64 over decoded bytes — twox_hash crate, scalar)
~2.5% decoding::decode_buffer::DecodeBuffer::repeat (match-execution wildcopy on the rolling-history window)
~0.6% decoding::simd_copy::copy_scalar (the no-SIMD fallback inside the SIMD-dispatch family — indicates not all overshooting copies are hitting the SIMD path)

Donor decode hot path (ZSTD_decompressBlock_internalZSTD_decompressSequences*BIT_reloadDStreamHUF_decompress*_usingDTable_internalZSTD_execSequence*) compiles into ~5× less wall-clock for the same input bytes. The structural difference is concentrated in:

  1. Sequence-execution / wildcopy inner loop — donor uses overshoot-tolerant SIMD with separate safe-copy tail; our simd_copy::copy_bytes_overshooting bails to scalar on overlapping copies.
  2. FSE state-machine update — donor BIT_reloadDStream + FSE_decodeSymbol heavily inlined / target_feature-specialized; our bit_reader_reverse.rs is generic.
  3. XXH64 frame checksum — ~3% on scalar twox_hash; already tracked as perf(decoding): integrate AVX2 unroll2 wildcopy candidate #108.

Hot primitives shared with encoder

Greedy strategy — missing dedicated implementation

Currently L4 → StrategyTag::Greedy → routes through BackendTag::Row with lazy_depth = 1 (zstd/src/encoding/match_generator.rs:286). This is a placeholder — correct output but uses the Row matcher's full lazy-depth machinery without the lazy_depth == 0 short-circuit. Donor ZSTD_compressBlock_greedy* has a dedicated hot loop with no lazy-bookkeeping overhead.

Donor greedy:

  • Walks input one position at a time.
  • Single hash-table lookup per position (no row-bucket scan).
  • Emits the match immediately on hit (no second-position retry).
  • Skip-step on miss: linear (no skip-by-distance heuristic that lazy uses).

Our Row backend with lazy_depth = 1 does all of the above PLUS row-bucket scan, second-position lookahead, skip-by-distance miss handler. For L4 specifically all of that is wasted work.

Acceptance criteria

Land in independent PRs, in priority order:

Phase 1 — dedicated greedy strategy (FOUNDATIONAL — do first)

This is core encoder architecture. Builds the proper greedy path before any shared-primitive optimization, because optimizing primitives on top of a placeholder strategy bakes the placeholder in.

  • Add StrategyTag::Greedy dispatch arm in MatchGeneratorDriver::compress_block that uses a dedicated greedy matcher (single hash table, single-position walk, linear skip-step) — donor-parity with ZSTD_compressBlock_greedy_extDict_generic.
  • New GreedyMatchGenerator (or fold into existing MatchGenerator with lazy_depth == 0 short-circuit through a separate dedicated code path — no shared lazy bookkeeping on the hot loop).
  • Benchmark L4 on compare_ffi --features dict_builder matrix; expect 10-30% encode speedup at L4 without ratio change.
  • Verify ratio preserved (rust_bytes <= ffi_bytes per CLAUDE.md "Ratio first" rule).

Phase 2 — shared decode/encode primitives

Now that we have correct dedicated greedy, optimize the hot primitives that both sides share. Run profile cleanly first on x86_64-musl (current profile is aarch64-darwin) to confirm targets.

  • SIMD-XXH64 (already perf(decoding): integrate AVX2 unroll2 wildcopy candidate #108): rolls into this track. Donor parity for frame-checksum kernel; benefits both decode + encode.
  • simd_copy::copy_scalar audit: identify which overshooting-copy call sites end up on the scalar fallback on x86_64-musl. Likely: small-match (≤8 bytes) execution + RLE/pattern matches where offset < copy_len. Add a small-match-specialized SIMD path or split the dispatch table so the overlap-safe variant is also SIMD-accelerated.
  • Bitstream reader BMI2 specialization (bit_reader_reverse.rs): runtime is_x86_feature_detected!("bmi2") dispatch for bextr-based variable-width bit extraction. Hot in FSE/Huffman decode kernels.

Phase 3 — decode_blocks body slim-down

If after Phase 1+2 the decode gap is still > 2×, attack FrameDecoder::decode_blocks directly: identify which of the multiple offsets the samples cluster around, extract sub-functions with explicit #[inline] for the per-iteration sequence-execution loop, let the compiler specialize / vectorize.

Kill-switch criteria

  • Phase 1 (greedy) must not regress L4 ratio vs donor. If dedicated greedy produces worse ratio than the current Row-backend placeholder, keep the placeholder and close greedy as "no win." (Per CLAUDE.md: ratio first.)
  • Phase 2 must close the L-1 c_stream gap from 3.69× to ≤ 2.0× donor. If it doesn't, proceed to Phase 3.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highHigh priority — core functionalityenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions