perf(decode): collapse FrameDecoder → block_decoder → ... 9-layer chain to donor's 3-layer ZSTD_decompressBlock_internal shape

## Context

Layer count from `FrameDecoder::decode_to_slice_trusted` to actual byte-level work:

```
FrameDecoder::decode_to_slice_trusted
  → decode_block_content_from_slice<DirectScratch>
    → decompress_block_inplace<DirectScratch>
      → decompress_block_inplace_with_parts<UserSliceBackend>
        → decode_and_execute_sequences<UserSliceBackend>
          → run_pipelined_sequence_loop<UserSliceBackend>
            → execute_one_sequence_pipelined<UserSliceBackend>
              → buffer.push / buffer.repeat_lookahead_prefetched
                → UserSliceBackend::extend / extend_from_within_unchecked
                  → simd_copy::copy_bytes_overshooting
                    → single_op_copy_16 / chunked SIMD
```

That's **9+ layers** of function call indirection per sequence (most `#[inline(always)]` but LLVM doesn't always honour deep nesting under fat LTO + codegen-units=1, and even when it does the IR bloat increases pressure on later passes). PR #263 collapsed the last 4 layers for UserSliceBackend's donor inline path; the outer 5 remain.

Donor zstd's equivalent shape (`lib/decompress/zstd_decompress_block.c::ZSTD_decompressBlock_internal`):

```
ZSTD_decompressFrame
  → ZSTD_decompressBlock_internal
    → ZSTD_decodeLiteralsBlock + ZSTD_decompressSequences_body
```

3 layers max, all explicitly `FORCE_INLINE_TEMPLATE`. The compiler emits one flat `ZSTD_decompressFrame` with the body of every inner function pasted in line.

## Proposal

Collapse the decoder's outer block-driver chain into a single donor-shape `decompress_block_donor` function for the `UserSliceBackend` direct-decode path. Specifically:

1. New `block_decoder_donor.rs` module with `unsafe fn decompress_block_donor(...)` — mirrors donor's `ZSTD_decompressBlock_internal` shape: receives header, source ptr, dest ptr, scratch refs; dispatches block type (Raw / RLE / Compressed); for Compressed, decodes literals THEN runs the sequence loop inline (no `decompress_block_inplace_*` wrappers).
2. `FrameDecoder::decode_to_slice_trusted` calls this directly for compressed blocks, skipping the BlockDecoder layer entirely.
3. Existing `block_decoder.rs` stays for FlatBuf / RingBuffer paths — preserves the legacy abstraction surface on backends that need it.

Out of scope (separate refactors): legacy `decode_all` chain, RingBuffer multi-segment streaming.

## Acceptance criteria

- [ ] `block_decoder_donor.rs` with the collapsed body.
- [ ] `FrameDecoder::decode_to_slice_trusted` dispatches to it on the UserSliceBackend + non-dict path.
- [ ] Layer count from entry to byte work drops from 9+ to ≤ 4 (verify via `cargo asm` / disasm).
- [ ] All 637+ nextest pass.
- [ ] Bench delta on i9 across `decompress/level_-1_fast/{decodecorpus-z000033/c_stream, low-entropy-1m/rust_stream}/matrix/pure_rust_direct` and `small-10k-random/level_-6_fast/c_stream`: ≥ 5% time reduction averaged, no regression on any single fixture.

Part of #247.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decode): collapse FrameDecoder → block_decoder → ... 9-layer chain to donor's 3-layer ZSTD_decompressBlock_internal shape #265

Context

Proposal

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(decode): collapse FrameDecoder → block_decoder → ... 9-layer chain to donor's 3-layer ZSTD_decompressBlock_internal shape #265

Description

Context

Proposal

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions