Problem
On the large-log-stream fixture (100 MiB of repeating 4-line log
cycle, ~380 byte period), level_1_fast produces 94 175 bytes of
compressed output. Donor FFI at the same level produces 9 752 bytes.
We are 9.66× larger than donor on this fixture.
REPORT (origin/main af4fddd6):
level=level_1_fast rust_bytes=94175 ffi_bytes=9752
level=level_2_dfast rust_bytes=8947 ffi_bytes=9750 ← we are better here
level=level_4_greedy rust_bytes=8947 ffi_bytes=9748 ← we are better here
dfast (L2/L3) and greedy (L4) on the same fixture produce 8 947 bytes
(we beat donor). Fast (L1) somehow fails to find the cycle's long
matches and emits ~10× the bytes. dashboard x86_64-gnu • compress • large-log-stream • level_1_fast: 8.911 flags it as the largest
ratio-vs-FFI gap in the entire matrix.
Investigation scope
Phase 1 — narrow the failure mode (no code changes):
- Decompress our 94 175-byte output and verify it round-trips to the
original 100 MiB (sanity — make sure this is "bigger compressed"
not "broken output").
- Dump our sequence stream for L1 fast × large-log-stream and count:
total sequence count, match-length distribution, average literal
run, repcode hit rate.
- Same dump for donor at L1 fast (via stand-alone FFI program).
- Diff the two.
Phase 2 — root cause from the diff:
- Fast strategy in our impl lives behind
BackendTag::Simple. The
per-block matcher is SimpleMatcher (see
match_generator.rs::MatcherStorage::Simple). Reading entry point
is simple_mut().next_sequence(...) loop at line 1219.
- Donor
lib/compress/zstd_fast.c ZSTD_compressBlock_fast is the
reference. Key items to compare:
- hash table fill / dual-position lookahead (
ip, ip + step)
- skip-step growth on miss (donor:
step = 1 baseline, grows on
consecutive misses)
- repcode probe before chain hash
- emitted-match length lower bound (mls — donor L1 default uses
hashLog=14, mls=7 per clevels.h row 1).
- Hypothesis short list:
- Our Fast matcher uses an mls/hash kernel that misses the cycle's
periodic match because the hash collides differently on the
mostly-ASCII log payload.
- Our window-log is 17 (= 128 KiB), donor's L1 default is
windowLog=19 (= 512 KiB). 380-byte cycle fits in both, so this
should not be it — but verify the match-distance cap doesn't
trim valid candidates.
- The Simple backend may not emit repcode-aware sequences, missing
the offset_1 = period-length rep that kicks in on the second
cycle and dominates the rest.
Phase 3 — fix scope decided by Phase 2.
Acceptance criteria
- L1 fast × large-log-stream ratio improves from 9.66× to ≤ 1.5× FFI
(we may not match dfast's 0.92×, but should not be 10× worse).
- No regression on speed for L1 fast (currently fast is already the
speed champion; do not trade speed for ratio).
- No ratio change on incompressible scenarios (
small-1k-random,
high-entropy-1m).
Files involved (likely)
zstd/src/encoding/match_generator.rs — SimpleMatcher /
BackendTag::Simple path
zstd/src/encoding/strategy.rs — Fast strategy definitions
- Donor reference:
lib/compress/zstd_fast.c,
lib/compress/clevels.h
Related
Estimate
- Phase 1 (sequence-stream diff): 4h
- Phase 2 (root cause): 6h
- Phase 3 (fix subtasks): TBD per findings
Problem
On the
large-log-streamfixture (100 MiB of repeating 4-line logcycle, ~380 byte period),
level_1_fastproduces 94 175 bytes ofcompressed output. Donor FFI at the same level produces 9 752 bytes.
We are 9.66× larger than donor on this fixture.
REPORT (origin/main
af4fddd6):dfast (L2/L3) and greedy (L4) on the same fixture produce 8 947 bytes
(we beat donor). Fast (L1) somehow fails to find the cycle's long
matches and emits ~10× the bytes. dashboard
x86_64-gnu • compress • large-log-stream • level_1_fast: 8.911flags it as the largestratio-vs-FFI gap in the entire matrix.
Investigation scope
Phase 1 — narrow the failure mode (no code changes):
original 100 MiB (sanity — make sure this is "bigger compressed"
not "broken output").
total sequence count, match-length distribution, average literal
run, repcode hit rate.
Phase 2 — root cause from the diff:
BackendTag::Simple. Theper-block matcher is
SimpleMatcher(seematch_generator.rs::MatcherStorage::Simple). Reading entry pointis
simple_mut().next_sequence(...)loop at line 1219.lib/compress/zstd_fast.cZSTD_compressBlock_fastis thereference. Key items to compare:
ip,ip + step)step = 1baseline, grows onconsecutive misses)
hashLog=14, mls=7perclevels.hrow 1).periodic match because the hash collides differently on the
mostly-ASCII log payload.
windowLog=19(= 512 KiB). 380-byte cycle fits in both, so thisshould not be it — but verify the match-distance cap doesn't
trim valid candidates.
the offset_1 = period-length rep that kicks in on the second
cycle and dominates the rest.
Phase 3 — fix scope decided by Phase 2.
Acceptance criteria
(we may not match dfast's 0.92×, but should not be 10× worse).
speed champion; do not trade speed for ratio).
small-1k-random,high-entropy-1m).Files involved (likely)
zstd/src/encoding/match_generator.rs—SimpleMatcher/BackendTag::Simplepathzstd/src/encoding/strategy.rs— Fast strategy definitionslib/compress/zstd_fast.c,lib/compress/clevels.hRelated
Estimate