perf(decode + encode-greedy): close 3-5× donor gap on negative-level decompress; share SIMD primitives + add dedicated greedy strategy

## Status

Investigation + implementation track. Decode parity vs donor regressed broadly on cold-budget paths (negative levels + small-block frames), with CI guard `x86_64-musl` triggering `0.1136 (-88.64%)` outside band on `decompress/level_-1_fast/decodecorpus-z000033/c_stream`. Local re-measurement on `main` (post-PR #170) confirms the gap is structural — decode is **3-5× slower than donor** across the negative-level corpus, not a one-off. The fix touches shared decode/encode primitives (XXH64, bitstream readers, SIMD wildcopy, FSE state machine), so a single optimization sweep benefits both sides. Same investigation also fills the missing **dedicated greedy encoder strategy** at L4 — currently routed through the Row backend with `lazy_depth = 1`, which is a placeholder, not a donor-parity greedy impl.

## Measured regression

Local A/B on `main` @ `c71559eb`, `compare_ffi --features dict_builder`, scenario `decodecorpus-z000033` (1 MiB structured corpus), aarch64-darwin (the CI ratio gate triggers on x86_64-musl; numbers below are aarch64 baseline but the structural conclusion holds across targets):

| Level | Stream source | Rust decode | C decode | Gap |
|-------|---------------|-------------|----------|-----|
| `-1_fast` | `c_stream` (decompress C-encoded blob) | 1.77 ms (550 MiB/s) | 479 µs (1.99 GiB/s) | **3.69×** |
| `-1_fast` | `rust_stream` (decompress Rust-encoded blob) | 4.73 ms (205 MiB/s) | 885 µs (1.07 GiB/s) | **5.35×** |

The `rust_stream` cell is worse because the Rust encoder emits more compressed blocks (vs C at L-1 which leans on raw blocks for low-entropy spans) — more compressed blocks → more FSE / Huffman state-machine work per decoded byte. Either way both source paths are far behind donor on the SAME decoder.

User-flagged signal: CI guard on `x86_64-musl` reports `0.1136 (-88.64%)` outside band — that's a normalized throughput ratio (Rust / donor) of 11.36%, equivalent to ~8.8× slower on the worst cell observed by the regression checker. The 3-5× we measure locally is the slower end of the same distribution.

## Profile findings (decode subtree only, filtered from `compress_ffi` bench iteration window)

`samply` on `--bench 'decompress/level_-1_fast/decodecorpus-z000033/c_stream/matrix/pure_rust'` (5 s profile-time):

| % inclusive | Function |
|---|---|
| ~21% | `frame_decoder::FrameDecoder::decode_blocks` (the giant state machine — block dispatch, sequence loop, output drain) |
| ~3% | `xxhash64::Hasher::write` (frame-checksum XXH64 over decoded bytes — `twox_hash` crate, scalar) |
| ~2.5% | `decoding::decode_buffer::DecodeBuffer::repeat` (match-execution wildcopy on the rolling-history window) |
| ~0.6% | `decoding::simd_copy::copy_scalar` (the no-SIMD fallback inside the SIMD-dispatch family — indicates not all overshooting copies are hitting the SIMD path) |

Donor decode hot path (`ZSTD_decompressBlock_internal` → `ZSTD_decompressSequences*` → `BIT_reloadDStream` → `HUF_decompress*_usingDTable_internal` → `ZSTD_execSequence*`) compiles into ~5× less wall-clock for the same input bytes. The structural difference is concentrated in:

1. **Sequence-execution / wildcopy inner loop** — donor uses overshoot-tolerant SIMD with separate safe-copy tail; our `simd_copy::copy_bytes_overshooting` bails to scalar on overlapping copies.
2. **FSE state-machine update** — donor `BIT_reloadDStream` + `FSE_decodeSymbol` heavily inlined / target_feature-specialized; our `bit_reader_reverse.rs` is generic.
3. **XXH64 frame checksum** — ~3% on scalar `twox_hash`; already tracked as #108.

## Hot primitives shared with encoder

- **XXH64**: frame-checksum (decode) + match-finder hashes (encode). Single SIMD-XXH64 lifts both. → #108
- **`simd_copy::copy_bytes_overshooting`**: match execution (decode) + match-table population (encode).
- **Bitstream**: `bit_reader_reverse` (decode) + `bit_writer` (encode), same packing convention.
- **FSE next-state layout**: encoder `state_table_flat` (PR #166); decoder `FSETable::decode`.

## Greedy strategy — missing dedicated implementation

Currently L4 → `StrategyTag::Greedy` → routes through `BackendTag::Row` with `lazy_depth = 1` (`zstd/src/encoding/match_generator.rs:286`). This is a placeholder — correct output but uses the Row matcher's full lazy-depth machinery without the `lazy_depth == 0` short-circuit. Donor `ZSTD_compressBlock_greedy*` has a dedicated hot loop with no lazy-bookkeeping overhead.

Donor greedy:
- Walks input one position at a time.
- Single hash-table lookup per position (no row-bucket scan).
- Emits the match immediately on hit (no second-position retry).
- Skip-step on miss: linear (no skip-by-distance heuristic that lazy uses).

Our Row backend with `lazy_depth = 1` does all of the above PLUS row-bucket scan, second-position lookahead, skip-by-distance miss handler. For L4 specifically all of that is wasted work.

## Acceptance criteria

Land in independent PRs, in priority order:

### Phase 1 — dedicated greedy strategy (FOUNDATIONAL — do first)

This is core encoder architecture. Builds the proper greedy path before any shared-primitive optimization, because optimizing primitives on top of a placeholder strategy bakes the placeholder in.

- [ ] Add `StrategyTag::Greedy` dispatch arm in `MatchGeneratorDriver::compress_block` that uses a dedicated greedy matcher (single hash table, single-position walk, linear skip-step) — donor-parity with `ZSTD_compressBlock_greedy_extDict_generic`.
- [ ] New `GreedyMatchGenerator` (or fold into existing `MatchGenerator` with `lazy_depth == 0` short-circuit through a separate dedicated code path — no shared lazy bookkeeping on the hot loop).
- [ ] Benchmark L4 on `compare_ffi --features dict_builder` matrix; expect 10-30% encode speedup at L4 without ratio change.
- [ ] Verify ratio preserved (`rust_bytes <= ffi_bytes` per CLAUDE.md "Ratio first" rule).

### Phase 2 — shared decode/encode primitives

Now that we have correct dedicated greedy, optimize the hot primitives that both sides share. Run profile cleanly first on `x86_64-musl` (current profile is `aarch64-darwin`) to confirm targets.

- [ ] **SIMD-XXH64** (already #108): rolls into this track. Donor parity for frame-checksum kernel; benefits both decode + encode.
- [ ] **`simd_copy::copy_scalar` audit**: identify which overshooting-copy call sites end up on the scalar fallback on `x86_64-musl`. Likely: small-match (≤8 bytes) execution + RLE/pattern matches where `offset < copy_len`. Add a small-match-specialized SIMD path or split the dispatch table so the overlap-safe variant is also SIMD-accelerated.
- [ ] **Bitstream reader BMI2 specialization** (`bit_reader_reverse.rs`): runtime `is_x86_feature_detected!("bmi2")` dispatch for `bextr`-based variable-width bit extraction. Hot in FSE/Huffman decode kernels.

### Phase 3 — decode_blocks body slim-down

If after Phase 1+2 the decode gap is still > 2×, attack `FrameDecoder::decode_blocks` directly: identify which of the multiple offsets the samples cluster around, extract sub-functions with explicit `#[inline]` for the per-iteration sequence-execution loop, let the compiler specialize / vectorize.

## Kill-switch criteria

- Phase 1 (greedy) must not regress L4 ratio vs donor. If dedicated greedy produces worse ratio than the current Row-backend placeholder, keep the placeholder and close greedy as "no win." (Per CLAUDE.md: ratio first.)
- Phase 2 must close the L-1 c_stream gap from 3.69× to ≤ 2.0× donor. If it doesn't, proceed to Phase 3.

## Related

- #108 SIMD-XXH64 (subsumes / blocks Phase 2 step 1)
- #112 Decompress: apply SIMD/inline fixes to top 3-5 hot funcs from profile (renamed — this issue is the implementation track for that local task)
- Part of #28 Phase 7 Lane B (Decompress) inside #111


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decode + encode-greedy): close 3-5× donor gap on negative-level decompress; share SIMD primitives + add dedicated greedy strategy #178

Status

Measured regression

Profile findings (decode subtree only, filtered from `compress_ffi` bench iteration window)

Hot primitives shared with encoder

Greedy strategy — missing dedicated implementation

Acceptance criteria

Phase 1 — dedicated greedy strategy (FOUNDATIONAL — do first)

Phase 2 — shared decode/encode primitives

Phase 3 — decode_blocks body slim-down

Kill-switch criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Level	Stream source	Rust decode	C decode	Gap
`-1_fast`	`c_stream` (decompress C-encoded blob)	1.77 ms (550 MiB/s)	479 µs (1.99 GiB/s)	3.69×
`-1_fast`	`rust_stream` (decompress Rust-encoded blob)	4.73 ms (205 MiB/s)	885 µs (1.07 GiB/s)	5.35×

% inclusive	Function
~21%	`frame_decoder::FrameDecoder::decode_blocks` (the giant state machine — block dispatch, sequence loop, output drain)
~3%	`xxhash64::Hasher::write` (frame-checksum XXH64 over decoded bytes — `twox_hash` crate, scalar)
~2.5%	`decoding::decode_buffer::DecodeBuffer::repeat` (match-execution wildcopy on the rolling-history window)
~0.6%	`decoding::simd_copy::copy_scalar` (the no-SIMD fallback inside the SIMD-dispatch family — indicates not all overshooting copies are hitting the SIMD path)

perf(decode + encode-greedy): close 3-5× donor gap on negative-level decompress; share SIMD primitives + add dedicated greedy strategy #178

Description

Status

Measured regression

Profile findings (decode subtree only, filtered from compress_ffi bench iteration window)

Hot primitives shared with encoder

Greedy strategy — missing dedicated implementation

Acceptance criteria

Phase 1 — dedicated greedy strategy (FOUNDATIONAL — do first)

Phase 2 — shared decode/encode primitives

Phase 3 — decode_blocks body slim-down

Kill-switch criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Profile findings (decode subtree only, filtered from `compress_ffi` bench iteration window)