ratio(encoder): fast L1 emits 9.66× more bytes than donor on large-log-stream

## Problem

On the `large-log-stream` fixture (100 MiB of repeating 4-line log
cycle, ~380 byte period), `level_1_fast` produces 94 175 bytes of
compressed output. Donor FFI at the same level produces 9 752 bytes.
We are **9.66× larger than donor** on this fixture.

REPORT (origin/main `af4fddd6`):
```
level=level_1_fast  rust_bytes=94175  ffi_bytes=9752
level=level_2_dfast rust_bytes=8947   ffi_bytes=9750   ← we are better here
level=level_4_greedy rust_bytes=8947  ffi_bytes=9748   ← we are better here
```

dfast (L2/L3) and greedy (L4) on the same fixture produce 8 947 bytes
(we beat donor). Fast (L1) somehow fails to find the cycle's long
matches and emits ~10× the bytes. dashboard `x86_64-gnu • compress •
large-log-stream • level_1_fast: 8.911` flags it as the largest
ratio-vs-FFI gap in the entire matrix.

## Investigation scope

Phase 1 — narrow the failure mode (no code changes):
- Decompress our 94 175-byte output and verify it round-trips to the
  original 100 MiB (sanity — make sure this is "bigger compressed"
  not "broken output").
- Dump our sequence stream for L1 fast × large-log-stream and count:
  total sequence count, match-length distribution, average literal
  run, repcode hit rate.
- Same dump for donor at L1 fast (via stand-alone FFI program).
- Diff the two.

Phase 2 — root cause from the diff:
- Fast strategy in our impl lives behind `BackendTag::Simple`. The
  per-block matcher is `SimpleMatcher` (see
  `match_generator.rs::MatcherStorage::Simple`). Reading entry point
  is `simple_mut().next_sequence(...)` loop at line 1219.
- Donor `lib/compress/zstd_fast.c` `ZSTD_compressBlock_fast` is the
  reference. Key items to compare:
  - hash table fill / dual-position lookahead (`ip`, `ip + step`)
  - skip-step growth on miss (donor: `step = 1` baseline, grows on
    consecutive misses)
  - repcode probe before chain hash
  - emitted-match length lower bound (mls — donor L1 default uses
    `hashLog=14, mls=7` per `clevels.h` row 1).
- Hypothesis short list:
  1. Our Fast matcher uses an mls/hash kernel that misses the cycle's
     periodic match because the hash collides differently on the
     mostly-ASCII log payload.
  2. Our window-log is 17 (= 128 KiB), donor's L1 default is
     `windowLog=19` (= 512 KiB). 380-byte cycle fits in both, so this
     should not be it — but verify the match-distance cap doesn't
     trim valid candidates.
  3. The Simple backend may not emit repcode-aware sequences, missing
     the offset_1 = period-length rep that kicks in on the second
     cycle and dominates the rest.

Phase 3 — fix scope decided by Phase 2.

## Acceptance criteria

- L1 fast × large-log-stream ratio improves from 9.66× to ≤ 1.5× FFI
  (we may not match dfast's 0.92×, but should not be 10× worse).
- No regression on speed for L1 fast (currently fast is already the
  speed champion; do not trade speed for ratio).
- No ratio change on incompressible scenarios (`small-1k-random`,
  `high-entropy-1m`).

## Files involved (likely)

- `zstd/src/encoding/match_generator.rs` — `SimpleMatcher` /
  `BackendTag::Simple` path
- `zstd/src/encoding/strategy.rs` — Fast strategy definitions
- Donor reference: `lib/compress/zstd_fast.c`,
  `lib/compress/clevels.h`

## Related

- #184 (lazy investigation) — closely related but different code path
- #182 G4+G5 — surfaced this gap during ratio matrix scan

## Estimate

- Phase 1 (sequence-stream diff): 4h
- Phase 2 (root cause): 6h
- Phase 3 (fix subtasks): TBD per findings


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ratio(encoder): fast L1 emits 9.66× more bytes than donor on large-log-stream #186

Problem

Investigation scope

Acceptance criteria

Files involved (likely)

Related

Estimate

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ratio(encoder): fast L1 emits 9.66× more bytes than donor on large-log-stream #186

Description

Problem

Investigation scope

Acceptance criteria

Files involved (likely)

Related

Estimate

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions