Skip to content

ratio(encoder): fast L1 emits 9.66× more bytes than donor on large-log-stream #186

@polaz

Description

@polaz

Problem

On the large-log-stream fixture (100 MiB of repeating 4-line log
cycle, ~380 byte period), level_1_fast produces 94 175 bytes of
compressed output. Donor FFI at the same level produces 9 752 bytes.
We are 9.66× larger than donor on this fixture.

REPORT (origin/main af4fddd6):

level=level_1_fast  rust_bytes=94175  ffi_bytes=9752
level=level_2_dfast rust_bytes=8947   ffi_bytes=9750   ← we are better here
level=level_4_greedy rust_bytes=8947  ffi_bytes=9748   ← we are better here

dfast (L2/L3) and greedy (L4) on the same fixture produce 8 947 bytes
(we beat donor). Fast (L1) somehow fails to find the cycle's long
matches and emits ~10× the bytes. dashboard x86_64-gnu • compress • large-log-stream • level_1_fast: 8.911 flags it as the largest
ratio-vs-FFI gap in the entire matrix.

Investigation scope

Phase 1 — narrow the failure mode (no code changes):

  • Decompress our 94 175-byte output and verify it round-trips to the
    original 100 MiB (sanity — make sure this is "bigger compressed"
    not "broken output").
  • Dump our sequence stream for L1 fast × large-log-stream and count:
    total sequence count, match-length distribution, average literal
    run, repcode hit rate.
  • Same dump for donor at L1 fast (via stand-alone FFI program).
  • Diff the two.

Phase 2 — root cause from the diff:

  • Fast strategy in our impl lives behind BackendTag::Simple. The
    per-block matcher is SimpleMatcher (see
    match_generator.rs::MatcherStorage::Simple). Reading entry point
    is simple_mut().next_sequence(...) loop at line 1219.
  • Donor lib/compress/zstd_fast.c ZSTD_compressBlock_fast is the
    reference. Key items to compare:
    • hash table fill / dual-position lookahead (ip, ip + step)
    • skip-step growth on miss (donor: step = 1 baseline, grows on
      consecutive misses)
    • repcode probe before chain hash
    • emitted-match length lower bound (mls — donor L1 default uses
      hashLog=14, mls=7 per clevels.h row 1).
  • Hypothesis short list:
    1. Our Fast matcher uses an mls/hash kernel that misses the cycle's
      periodic match because the hash collides differently on the
      mostly-ASCII log payload.
    2. Our window-log is 17 (= 128 KiB), donor's L1 default is
      windowLog=19 (= 512 KiB). 380-byte cycle fits in both, so this
      should not be it — but verify the match-distance cap doesn't
      trim valid candidates.
    3. The Simple backend may not emit repcode-aware sequences, missing
      the offset_1 = period-length rep that kicks in on the second
      cycle and dominates the rest.

Phase 3 — fix scope decided by Phase 2.

Acceptance criteria

  • L1 fast × large-log-stream ratio improves from 9.66× to ≤ 1.5× FFI
    (we may not match dfast's 0.92×, but should not be 10× worse).
  • No regression on speed for L1 fast (currently fast is already the
    speed champion; do not trade speed for ratio).
  • No ratio change on incompressible scenarios (small-1k-random,
    high-entropy-1m).

Files involved (likely)

  • zstd/src/encoding/match_generator.rsSimpleMatcher /
    BackendTag::Simple path
  • zstd/src/encoding/strategy.rs — Fast strategy definitions
  • Donor reference: lib/compress/zstd_fast.c,
    lib/compress/clevels.h

Related

Estimate

  • Phase 1 (sequence-stream diff): 4h
  • Phase 2 (root cause): 6h
  • Phase 3 (fix subtasks): TBD per findings

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highHigh priority — core functionalityperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions