perf(decode): match prefetch 4-stage pipeline for long-distance matches

## Context

`execute_sequences` reads from history at `cursor - offset` for each match.
For long-distance matches (offset > L2/L3 size), the source line is cold
in cache and the read stalls the pipeline.

Donor `ZSTD_decompressSequencesLong` (huf_decompress.c-adjacent path) uses
a 4-stage software pipeline: decode N+3 ahead of execute N, prefetching the
match source line 3-4 iterations in advance.

## Goal

Port the 4-stage prefetch pipeline:

1. Decode sequence N+3 → emit prefetch for source line of match N+3
2. Decode sequence N+2 → continue prefetching
3. Decode sequence N+1 → source line for N now warm, execute match N
4. Steady state: every iteration prefetches one ahead, executes one behind

## Acceptance criteria

- Pipeline triggered when match offset > prefetch threshold (~32 KiB or
  L2 line residency proxy)
- No regression on small-offset matches (short-offset path stays
  prefetch-free since data is already warm)
- Bench on at least one large-window corpus (`decodecorpus-large-window`
  or equivalent) shows measurable improvement
- All existing decode tests pass; no ratio regression

## Files involved

- zstd/src/decoding/sequence_execution.rs (or wherever execute_sequences
  lives post-#133 interleave)
- zstd/src/decoding/decode_buffer.rs

## References

- Donor: long-distance match prefetch in `ZSTD_decompressSequencesLong`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decode): match prefetch 4-stage pipeline for long-distance matches #208

Context

Goal

Acceptance criteria

Files involved

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(decode): match prefetch 4-stage pipeline for long-distance matches #208

Description

Context

Goal

Acceptance criteria

Files involved

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions