Context
execute_sequences reads from history at cursor - offset for each match.
For long-distance matches (offset > L2/L3 size), the source line is cold
in cache and the read stalls the pipeline.
Donor ZSTD_decompressSequencesLong (huf_decompress.c-adjacent path) uses
a 4-stage software pipeline: decode N+3 ahead of execute N, prefetching the
match source line 3-4 iterations in advance.
Goal
Port the 4-stage prefetch pipeline:
- Decode sequence N+3 → emit prefetch for source line of match N+3
- Decode sequence N+2 → continue prefetching
- Decode sequence N+1 → source line for N now warm, execute match N
- Steady state: every iteration prefetches one ahead, executes one behind
Acceptance criteria
- Pipeline triggered when match offset > prefetch threshold (~32 KiB or
L2 line residency proxy)
- No regression on small-offset matches (short-offset path stays
prefetch-free since data is already warm)
- Bench on at least one large-window corpus (
decodecorpus-large-window
or equivalent) shows measurable improvement
- All existing decode tests pass; no ratio regression
Files involved
References
- Donor: long-distance match prefetch in
ZSTD_decompressSequencesLong
Context
execute_sequencesreads from history atcursor - offsetfor each match.For long-distance matches (offset > L2/L3 size), the source line is cold
in cache and the read stalls the pipeline.
Donor
ZSTD_decompressSequencesLong(huf_decompress.c-adjacent path) usesa 4-stage software pipeline: decode N+3 ahead of execute N, prefetching the
match source line 3-4 iterations in advance.
Goal
Port the 4-stage prefetch pipeline:
Acceptance criteria
L2 line residency proxy)
prefetch-free since data is already warm)
decodecorpus-large-windowor equivalent) shows measurable improvement
Files involved
lives post-ci(deps): bump actions/github-script from 7 to 9 #133 interleave)
References
ZSTD_decompressSequencesLong