Skip to content

perf(decode): match prefetch 4-stage pipeline for long-distance matches #208

@polaz

Description

@polaz

Context

execute_sequences reads from history at cursor - offset for each match.
For long-distance matches (offset > L2/L3 size), the source line is cold
in cache and the read stalls the pipeline.

Donor ZSTD_decompressSequencesLong (huf_decompress.c-adjacent path) uses
a 4-stage software pipeline: decode N+3 ahead of execute N, prefetching the
match source line 3-4 iterations in advance.

Goal

Port the 4-stage prefetch pipeline:

  1. Decode sequence N+3 → emit prefetch for source line of match N+3
  2. Decode sequence N+2 → continue prefetching
  3. Decode sequence N+1 → source line for N now warm, execute match N
  4. Steady state: every iteration prefetches one ahead, executes one behind

Acceptance criteria

  • Pipeline triggered when match offset > prefetch threshold (~32 KiB or
    L2 line residency proxy)
  • No regression on small-offset matches (short-offset path stays
    prefetch-free since data is already warm)
  • Bench on at least one large-window corpus (decodecorpus-large-window
    or equivalent) shows measurable improvement
  • All existing decode tests pass; no ratio regression

Files involved

References

  • Donor: long-distance match prefetch in ZSTD_decompressSequencesLong

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions