feat(encoding): add Best compression level (zstd level 11, btlazy2 strategy) by polaz · Pull Request #53 · structured-world/structured-zstd

polaz · 2026-04-02T15:52:34Z

Summary

Implement CompressionLevel::Best — roughly equivalent to zstd level 11
Uses enhanced hash-chain matcher with deep lazy2 matching: 16 MiB window, 2M/1M hash/chain tables, 32-candidate search depth, 128 target match length
HC matcher parameters (hash_log, chain_log, search_depth, target_len) are now configurable per level via configure() method
Full streaming encoder and frame compressor support
CLI level 4 maps to Best

Compression ratio vs C zstd

Dataset	Input	Better	Best	C-L7	C-L11	Best/C11	Better/C7
corpus z000033	1.0M	541,592	540,153	510,030	509,317	106.1%	106.2%
huffman 4M	4.0M	2,957,814	2,959,960	2,949,731	2,949,773	100.3%	100.3%
log lines 4M	4.0M	582	580	526	866	67.0%	110.6%
log lines 16M	16M	1,830	1,828	1,684	3,077	59.4%	108.7%

Best/C-L11 gap (6.1%) matches Better/C-L7 gap (6.2%) — systematic encoder pipeline gap
On structured log data, Best beats C-L11 by 1.5–1.7× (C btlazy2 degrades on repetitive data)
Improvement Better→Best (0.27% on corpus) exceeds C-L7→C-L11 improvement (0.14%)

Test plan

11 roundtrip integrity tests (compressible, random, multi-block, streaming, edge cases, repeat offsets, large window)
Cross-validation: Rust Best → C FFI decompress on corpus
Ratio assertion: Best < Better (strict) on corpus proxy
Large-window regression: 9 MiB gap between duplicated regions (Best beats Better)
Streaming encoder roundtrip at Best level
CLI level 4 parsing and compression
All 204 tests pass, zero warnings

Closes #7

Summary by CodeRabbit

New Features
- "Best" compression level is now supported in the CLI and library for frame and streaming compression (accessible via the CLI level option), offering higher compression with a larger window and deeper matching (≈ zstd level 11).
Documentation
- README and benchmark docs updated to mark Best implemented and to describe its behavior and limits.
Tests
- Added/expanded tests for Best roundtrips, streaming correctness, cross-language decompression, and comparative compression quality.

…rategy) - Implement Best level using hash-chain matcher with deep lazy2 matching: 16 MiB window, 2M/1M hash/chain tables, 32-candidate search depth, 128 target match length - Make HC matcher parameters (hash_log, chain_log, search_depth, target_len) configurable per level via new configure() method - Add Best to streaming encoder, frame compressor, and CLI (level 4) - Add 11 roundtrip integrity tests covering compressible, random, multi-block, streaming, edge cases, repeat offsets, and large window - Add cross-validation: Rust Best → C FFI decompress, ratio assertion (Best beats Better on corpus proxy) - Add Best to benchmark level matrix (vs C zstd level 11) - Update README checklist and BENCHMARKS level mapping Closes #7

All CompressionLevel variants are now covered exhaustively after adding Best; the wildcard arm was dead code producing a warning.

coderabbitai · 2026-04-02T15:52:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 916aa631-026a-4eff-afc4-6fbf08a5240f

📥 Commits

Reviewing files that changed from the base of the PR and between 18bf7cc and 0f44b8c.

📒 Files selected for processing (1)

zstd/src/tests/roundtrip_integrity.rs

📝 Walkthrough

Walkthrough

Implements CompressionLevel::Best end-to-end: CLI mapping and bench entry, encoder dispatch and streaming support, hash‑chain match‑generator tunables/configuration for Best, and added unit and cross‑validation tests exercising Best-level compression.

Changes

Cohort / File(s)	Summary
Documentation `BENCHMARKS.md`, `README.md`, `zstd/src/encoding/mod.rs`	Marked `Best` implemented in docs/benchmarks and README; expanded `CompressionLevel::Best` doc comment describing hash‑chain + lazy2 behavior, larger tables, window/target_len tuning, and large-frame limitations.
CLI `cli/src/main.rs`	Extended `--level` allowed range `0..=3` → `0..=4`, mapped `--level 4` → `CompressionLevel::Best`, and updated parse-time test expectation.
Bench harness `zstd/benches/support/mod.rs`	Expanded `supported_levels()` array from 3 → 4 entries by adding `best` mapped to `ffi_level = 11` and `rust_level = Best`.
Encoder dispatch `zstd/src/encoding/frame_compressor.rs`	Routed `CompressionLevel::Best` into existing compressed-block pipeline (removed prior `unimplemented!()` fallback).
Match generator (HC) `zstd/src/encoding/match_generator.rs`	Added per-instance tunables (`hash_log`, `chain_log`, `search_depth`, `target_len`) + `configure()`; `Best` selects `HashChain` backend with `BEST_*` params; table sizing and candidate search depth now driven by runtime fields; added cleanup when switching backends; tests for resizing/cleanup.
Streaming encoder `zstd/src/encoding/streaming_encoder.rs`	Enabled matcher allocation/encoding for `Best`, updated `ensure_level_supported()` and block encode path to handle `Best`; replaced unsupported-level error test with a Best streaming roundtrip test.
Tests (roundtrip & cross‑validation) `zstd/src/tests/roundtrip_integrity.rs`, `zstd/tests/cross_validation.rs`	Added Best roundtrip helpers and test-suite macro for Better+Best; new Best-focused tests validating size, long-window matching, streaming roundtrip, and cross-language corpus assertions.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI/User
    participant FC as FrameCompressor
    participant MG as MatchGenerator (HashChain)
    participant Enc as Encoder

    CLI->>FC: compress(data, level=Best)
    activate FC
    FC->>MG: configure(level=Best)
    activate MG
    MG->>MG: set hash_log/chain_log/search_depth/target_len
    MG-->>FC: configured matcher
    deactivate MG

    loop per block
        FC->>MG: find_matches(block)
        activate MG
        MG->>MG: traverse hash_chain (up to search_depth)
        MG-->>FC: matches (lazy2 selection)
        deactivate MG
        FC->>Enc: encode block with matches
    end

    Enc-->>CLI: compressed frame
    deactivate FC

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(encoding): add Better compression level (zstd level 7, lazy2 strategy) #48 — Adds a new CompressionLevel and touches CLI, bench, encoder, match-generator, and tests; highly overlapping changes.
perf(encoding): align fastest matcher with zstd fast path #39 — Modifies match_generator internals and runtime-config plumbing; related to HC configurability introduced here.
feat(encoding): add streaming write encoder #45 — Updates StreamingEncoder and CLI level handling; overlaps with streaming and CLI adjustments.

🐰 I hop through chains of hashed delight,
Lazy2 dreams search deep into the night,
Bigger windows, deeper chase,
Rabbity matches find their place,
Best is baked — small bytes, warm and tight!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the main change: implementation of Best compression level matching zstd level 11 with btlazy2 strategy, directly aligned with the file changes and objectives.
Linked Issues check	✅ Passed	The PR fully implements the coding requirements from issue `#7`: CompressionLevel::Best no longer panics, roundtrip/interop tests pass, compression ratio is within 10% of C zstd level 11, and Best compresses better than Better across tested datasets.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing CompressionLevel::Best: encoder/matcher updates, CLI integration, documentation, and comprehensive testing—no out-of-scope modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/#7-feat-implement-best-compression-level-zstd-level-1

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds a new highest compression preset (CompressionLevel::Best) to the pure-Rust encoder pipeline, extending the existing hash-chain + lazy matching infrastructure to use larger tables, deeper search, and a larger window, and wires it through streaming, CLI, tests, and benchmarks.

Changes:

Implement CompressionLevel::Best using the hash-chain backend with level-specific HC parameters (16 MiB window, larger tables, deeper search, higher target length).
Extend streaming encoder + frame compressor + CLI to accept/use Best.
Add regression/interop tests and update benchmark/docs to include Best.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`zstd/src/encoding/match_generator.rs`	Adds Best-level HC configuration knobs and routes `CompressionLevel::Best` to the HC backend with a 16 MiB window.
`zstd/src/encoding/mod.rs`	Documents `CompressionLevel::Best` behavior and limitations.
`zstd/src/encoding/streaming_encoder.rs`	Enables Best for streaming encoding and adds a streaming roundtrip test.
`zstd/src/encoding/frame_compressor.rs`	Routes Best through the shared block-encoding pipeline (removes `unimplemented!()` path).
`cli/src/main.rs`	Adds CLI `--level 4` mapping to Best and updates parsing/tests.
`zstd/tests/cross_validation.rs`	Adds Rust Best → FFI decode regression and Best-vs-Better size assertion on corpus proxy.
`zstd/src/tests/roundtrip_integrity.rs`	Adds multiple roundtrip tests covering Best (including large-window and streaming).
`zstd/benches/support/mod.rs`	Adds Best to benchmark level set (maps to FFI level 11).
`BENCHMARKS.md`	Updates benchmark matrix to include Best.
`README.md`	Updates feature checklist/docs to mark Best as implemented.

zstd/src/encoding/streaming_encoder.rs

zstd/src/encoding/match_generator.rs

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 280-287: The public CompressionLevel::Best is currently routed to
MatcherBackend::HashChain; update the mapping in
zstd/src/encoding/match_generator.rs (the function that maps CompressionLevel ->
MatcherBackend) so that CompressionLevel::Best selects the DUBT/btlazy2 backend
(e.g., MatcherBackend::Btlazy2 or the Dubt variant used in the repo) with the
intended tuned parameters instead of MatcherBackend::HashChain, and adjust any
associated parameterization or tests that assume HashChain for Best.

In `@zstd/src/encoding/match_generator.rs`:
- Around line 1426-1438: The configure method can accept a search_depth larger
than MAX_HC_SEARCH_DEPTH which causes chain_candidates to write past its fixed
[usize; MAX_HC_SEARCH_DEPTH] buffer; clamp/bound the value assigned to
self.search_depth inside configure (and the other configure-like block around
chain_candidates usage) by setting self.search_depth =
search_depth.min(MAX_HC_SEARCH_DEPTH) (or equivalent), ensuring any code paths
in chain_candidates and related functions use a capped search depth that cannot
exceed MAX_HC_SEARCH_DEPTH.

In `@zstd/src/encoding/streaming_encoder.rs`:
- Around line 752-763: The test best_level_streaming_roundtrip currently uses a
payload too small to trigger multi-block emission for StreamingEncoder with
CompressionLevel::Best; update the test so the encoder emits at least two blocks
by increasing the payload size or changing chunking (e.g., repeat the base
string more times or use smaller chunk sizes so total data exceeds a
single-block threshold), ensuring StreamingEncoder::new/CompressionLevel::Best
path crosses a block boundary and the roundtrip still decodes correctly.

In `@zstd/tests/cross_validation.rs`:
- Around line 281-295: The test best_level_beats_better_on_corpus_proxy
currently uses <= so it will not detect if CompressionLevel::Best accidentally
aliases CompressionLevel::Better; update the assertion to require a strict
improvement (use <) when comparing best.len() and better.len() in the test that
calls compress_to_vec with CompressionLevel::Best and CompressionLevel::Better;
if equal sizes are acceptable for this corpus instead, add a separate
fixture/test where Best is known to win (or add a ratio-check) so the regression
(Best == Better aliasing) is caught.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ad4f5a7b-8dfd-4eae-aaca-7240ba2d2cec

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2fe9b and 301b643.

📒 Files selected for processing (10)

BENCHMARKS.md
README.md
cli/src/main.rs
zstd/benches/support/mod.rs
zstd/src/encoding/frame_compressor.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs
zstd/src/encoding/streaming_encoder.rs
zstd/src/tests/roundtrip_integrity.rs
zstd/tests/cross_validation.rs

zstd/src/encoding/frame_compressor.rs

zstd/src/encoding/match_generator.rs

zstd/src/encoding/streaming_encoder.rs

zstd/tests/cross_validation.rs

- Clamp search_depth to MAX_HC_SEARCH_DEPTH in configure() to prevent OOB panic if a future level config exceeds the fixed-size array - Strict assertion in best_beats_better test (< not <=) to catch accidental Best/Better aliasing - Increase best_level_streaming_roundtrip payload to 200 KiB to cross the 128 KiB block boundary - Document ensure_level_supported exhaustive guard and chain_candidates fixed-size array design

codecov · 2026-04-02T16:33:55Z

Codecov Report

❌ Patch coverage is 97.77778% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/encoding/match_generator.rs	97.43%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

zstd/tests/cross_validation.rs

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 543-569: Summary: The test's use of generate_compressible for
region allows Better to match within the region itself, weakening the assertion
that Best's larger 16 MiB window is responsible for the compression advantage.
Fix: in roundtrip_best_level_large_window replace the compressible region
generation (generate_compressible(...)) with a high-entropy region that has low
internal repetition (e.g., generate_random or generate_high_entropy(...) of the
same size) so the only meaningful matches come from the long-distance duplicate;
keep the rest of the test (gap generation, concatenation, calls to
compress_to_vec and CompressionLevel::Best/Better, and the final length
comparison) unchanged so the size win requires window reach rather than
intra-region matches.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: aa07cf7a-1b08-498e-81a6-1730e3a45a94

📥 Commits

Reviewing files that changed from the base of the PR and between 301b643 and 7f16fcc.

📒 Files selected for processing (6)

cli/src/main.rs
zstd/benches/support/mod.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/streaming_encoder.rs
zstd/src/tests/roundtrip_integrity.rs
zstd/tests/cross_validation.rs

zstd/src/tests/roundtrip_integrity.rs

sw-release-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite	Current: `0f44b8c`	Previous: `8f2fe9b`	Ratio
`compress/better/small-1k-random/matrix/pure_rust`	`0.173` ms	`0.141` ms	`1.23`
`compress/better/small-1k-random/matrix/c_ffi`	`0.108` ms	`0.086` ms	`1.26`
`compress/default/small-4k-log-lines/matrix/c_ffi`	`0.021` ms	`0.018` ms	`1.17`
`compress/better/decodecorpus-z000033/matrix/pure_rust`	`65.38` ms	`52.89` ms	`1.24`
`compress/fastest/high-entropy-1m/matrix/c_ffi`	`0.284` ms	`0.234` ms	`1.21`
`compress/default/high-entropy-1m/matrix/c_ffi`	`0.353` ms	`0.287` ms	`1.23`
`compress/better/high-entropy-1m/matrix/pure_rust`	`71.464` ms	`62.008` ms	`1.15`
`compress/better/high-entropy-1m/matrix/c_ffi`	`0.674` ms	`0.578` ms	`1.17`
`compress/fastest/low-entropy-1m/matrix/c_ffi`	`0.18` ms	`0.133` ms	`1.35`
`compress/better/low-entropy-1m/matrix/pure_rust`	`5.117` ms	`4.38` ms	`1.17`
`compress/fastest/large-log-stream/matrix/c_ffi`	`2.656` ms	`2.134` ms	`1.24`
`compress/default/large-log-stream/matrix/pure_rust`	`109.811` ms	`91.296` ms	`1.20`
`decompress/default/small-10k-random/rust_stream/matrix/c_ffi`	`0.002` ms	`0.001` ms	`2`
`decompress/better/small-10k-random/rust_stream/matrix/c_ffi`	`0.002` ms	`0.001` ms	`2`
`compress-dict/fastest/small-10k-random/matrix/c_ffi_with_dict`	`0.018` ms	`0.015` ms	`1.20`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

Use generate_data (random) instead of generate_compressible for the duplicated region so the size win depends on window reach, not intra-region compression.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Line 13: The import group in the use statement is out of alphabetical order;
reorder the items inside crate::encoding::{...} so identifiers starting with
uppercase come before lowercase (e.g., CompressionLevel, FrameCompressor,
compress_to_vec) to satisfy cargo fmt; update the use line that currently
references compress_to_vec, CompressionLevel, FrameCompressor accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0f8c879d-0abd-4e1e-a47c-9cf99b80d714

📥 Commits

Reviewing files that changed from the base of the PR and between 7f16fcc and 146d571.

📒 Files selected for processing (1)

zstd/src/tests/roundtrip_integrity.rs

zstd/src/tests/roundtrip_integrity.rs

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

zstd/src/encoding/match_generator.rs

The stack array size depends on pointer width; "256 bytes" is only correct on 64-bit targets.

polaz · 2026-04-02T19:01:18Z

@coderabbitai re-review

coderabbitai · 2026-04-02T19:01:32Z

@polaz Sure, I'll re-review the PR now!

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/encoding/frame_compressor.rs:280-287
Timestamp: 2026-04-02T16:31:58.167Z
Learning: In `structured-zstd`, `CompressionLevel::Best` intentionally uses `MatcherBackend::HashChain` (not a binary-tree/DUBT backend) due to memory constraints: a BT matcher for a 16 MiB window requires ~128–256 MB of tables (C zstd level 11 uses chainLog=25 → 256 MB), which is impractical. Best is distinguished from Better by deeper HC search (32 vs 16 candidates), larger hash/chain tables (2M/1M vs 1M/512K), and higher target match length (128 vs 48), keeping memory under 20 MB. It achieves a proportional compression gain (~0.27% better than Better on corpus) comparable to C zstd's L7→L11 delta (~0.14%). A proper BT backend is deferred as a future optimization. Do not flag Best's use of HashChain as missing a BT implementation.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:40.012Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1470-1490
Timestamp: 2026-04-02T07:27:15.870Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::pick_lazy_match` intentionally queries `find_best_match(abs_pos + 1, ...)` and `find_best_match(abs_pos + 2, ...)` *before* those positions are inserted into the hash-chain tables. This matches C zstd lazy evaluation semantics: pre-seeding those positions would allow a position to match against itself, producing invalid self-matches. Do not flag the absent pre-seeding as a bug.

✅ Actions performed

Full review triggered.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 39-47: Replace the positional usize parameters to the HC
configure(...) API with a typed HcConfig struct and constant instances to avoid
parameter-order mistakes: define a HcConfig { hash_log: usize, chain_log: usize,
search_depth: usize, target_len: usize, default_window_size: usize } and replace
the individual BEST_HC_* constants with a single BEST_HC_CONFIG constant; change
the configure(...) function signature to take HcConfig (or &HcConfig) instead of
multiple usize args and update all call sites (including other occurrences
around chain_candidates and higher-level configurations) to pass the named
HcConfig constants (e.g., BEST_HC_CONFIG) so each level’s tuning is bundled and
order-sensitive positional arguments are removed.

In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 479-523: Multiple tests duplicate the same fixtures for "Best"
that mirror the "Better" suite; extract a shared test matrix to avoid
duplication and drift: create a helper function or macro (e.g., a function
run_roundtrip_matrix(roundtrip_fn: fn(&[u8])->Vec<u8>, streaming_roundtrip_fn:
Option<fn(&[u8])->Vec<u8>>) or a macro run_roundtrip_tests! that accepts the
pair of functions) and replace the repeated assertions that call roundtrip_best
and roundtrip_best_streaming (and their "Better" equivalents) with invocations
of that helper, moving the common data generators (generate_compressible,
generate_data, repeat_offset_fixture, generate_huffman_friendly, edge cases)
into the shared matrix so both Best and Better test modules call the same table
of cases.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0c1ecd83-7cda-450b-8035-d22139bba2c7

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2fe9b and c1b224e.

📒 Files selected for processing (10)

BENCHMARKS.md
README.md
cli/src/main.rs
zstd/benches/support/mod.rs
zstd/src/encoding/frame_compressor.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs
zstd/src/encoding/streaming_encoder.rs
zstd/src/tests/roundtrip_integrity.rs
zstd/tests/cross_validation.rs

zstd/src/encoding/match_generator.rs

zstd/src/tests/roundtrip_integrity.rs

…te macro - Replace positional configure(hash_log, chain_log, search_depth, target_len) with configure(HcConfig), eliminating parameter-order hazard. Add HC_CONFIG and BEST_HC_CONFIG const presets. - Extract level_roundtrip_suite! macro that generates the standard 7-test roundtrip suite (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for any compression level. Better and Best now share the same test matrix via module-scoped macro invocations.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

zstd/src/encoding/match_generator.rs (1)

275-283: ⚠️ Potential issue | 🟠 Major

Release oversized HC tables when switching away from Best.

MatchGeneratorDriver keeps hc_match_generator alive across frames, and the backend-switch path only calls hc.reset(). After one CompressionLevel::Best frame, the 2M/1M HC tables remain resident even if later frames use Default or Fastest, so a long-lived compressor keeps the higher Best memory floor until the matcher is dropped or re-enters HC and reallocates smaller tables.

💡 Possible fix

             MatcherBackend::HashChain => {
                 if let Some(hc) = self.hc_match_generator.as_mut() {
+                    if backend != MatcherBackend::HashChain
+                        && (hc.hash_log != HC_HASH_LOG || hc.chain_log != HC_CHAIN_LOG)
+                    {
+                        hc.hash_table = Vec::new();
+                        hc.chain_table = Vec::new();
+                    }
                     let vec_pool = &mut self.vec_pool;
                     hc.reset(|mut data| {
                         data.resize(data.capacity(), 0);
                         vec_pool.push(data);
                     });
                 }
             }

Also applies to: 317-325

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@zstd/src/encoding/match_generator.rs` around lines 275 - 283, The
hc_match_generator instance keeps large HC tables alive after a Best-frame even
when switching back to non-HashChain backends; update the backend-switch path in
MatchGeneratorDriver so that when leaving MatcherBackend::HashChain you release
those oversized tables by dropping or replacing the hc_match_generator instead
of only calling hc.reset() — e.g., if self.hc_match_generator.is_some() then
take() and drop it (self.hc_match_generator = None) or call a dedicated release
method on the hc before returning, ensuring hc_match_generator and its large
buffers are freed when switching away from HashChain.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 275-283: The hc_match_generator instance keeps large HC tables
alive after a Best-frame even when switching back to non-HashChain backends;
update the backend-switch path in MatchGeneratorDriver so that when leaving
MatcherBackend::HashChain you release those oversized tables by dropping or
replacing the hc_match_generator instead of only calling hc.reset() — e.g., if
self.hc_match_generator.is_some() then take() and drop it
(self.hc_match_generator = None) or call a dedicated release method on the hc
before returning, ensuring hc_match_generator and its large buffers are freed
when switching away from HashChain.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bbe00f70-bb21-4a0e-a40f-dc95dbce9027

📥 Commits

Reviewing files that changed from the base of the PR and between c1b224e and c4c79e5.

📒 Files selected for processing (2)

zstd/src/encoding/match_generator.rs
zstd/src/tests/roundtrip_integrity.rs

When MatchGeneratorDriver switches from HashChain to another backend, drop the hash/chain tables so Best's larger 2M/1M allocations don't persist across frames that use Default or Fastest.

polaz · 2026-04-02T21:01:38Z

@coderabbitai Fixed — HC tables are now released when switching away from HashChain backend, so Best's larger allocations don't persist across frames using other levels.

coderabbitai · 2026-04-02T21:03:26Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me verify the fix in the current code.

🏁 Script executed:

#!/bin/bash
# Look at the reset/configure area in match_generator.rs for HC table release logic
rg -n "release|drop|clear|tables|configure|HashChain|reset" zstd/src/encoding/match_generator.rs -A2 -B2 | head -120

Length of output: 4895

🏁 Script executed:

#!/bin/bash
# Get the full reset() function context
sed -n '248,360p' zstd/src/encoding/match_generator.rs

Length of output: 4998

@polaz The fix looks correct. In reset(), when the active backend is HashChain and the new backend is not (e.g., switching from Best to Fastest/Default), the guard at line 279:

if backend != MatcherBackend::HashChain {
    hc.hash_table = Vec::new();
    hc.chain_table = Vec::new();
}

drops the oversized allocations before hc.reset() runs, so the 2M/1M hash/chain tables from Best don't persist. The complementary HC→HC case (e.g., Best → Better, both staying on HashChain) is covered by hc.configure(HC_CONFIG) in the second match, which already clears tables when hash_log/chain_log shrink. Both paths are handled cleanly — this is good to go.

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/encoding/frame_compressor.rs:280-287
Timestamp: 2026-04-02T16:31:58.167Z
Learning: In `structured-zstd`, `CompressionLevel::Best` intentionally uses `MatcherBackend::HashChain` (not a binary-tree/DUBT backend) due to memory constraints: a BT matcher for a 16 MiB window requires ~128–256 MB of tables (C zstd level 11 uses chainLog=25 → 256 MB), which is impractical. Best is distinguished from Better by deeper HC search (32 vs 16 candidates), larger hash/chain tables (2M/1M vs 1M/512K), and higher target match length (128 vs 48), keeping memory under 20 MB. It achieves a proportional compression gain (~0.27% better than Better on corpus) comparable to C zstd's L7→L11 delta (~0.14%). A proper BT backend is deferred as a future optimization. Do not flag Best's use of HashChain as missing a BT implementation.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:40.012Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1470-1490
Timestamp: 2026-04-02T07:27:15.870Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::pick_lazy_match` intentionally queries `find_best_match(abs_pos + 1, ...)` and `find_best_match(abs_pos + 2, ...)` *before* those positions are inserted into the hash-chain tables. This matches C zstd lazy evaluation semantics: pre-seeding those positions would allow a position to match against itself, producing invalid self-matches. Do not flag the absent pre-seeding as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:336-339
Timestamp: 2026-04-02T12:03:03.856Z
Learning: In `zstd/src/encoding/match_generator.rs`, `min_primed_tail = 4` is correct for both `MatcherBackend::Dfast` and `MatcherBackend::HashChain` because `HcMatchGenerator::insert_position` unconditionally requires 4 bytes of hash lookahead (`if idx + 4 > concat.len() { return; }`). `backfill_boundary_positions` only re-inserts positions already fully within extended history; it cannot make a <4-byte dictionary tail fragment hashable. Do not suggest lowering `min_primed_tail` below 4 for the `HashChain` backend.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:479-523
Timestamp: 2026-04-02T19:19:41.513Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), the per-level roundtrip test suites (Better, Best) share a common `level_roundtrip_suite!` macro that expands the same 7-test matrix (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for each level via module-scoped macro invocations. Do not suggest re-inlining these as separate flat test functions — the macro was added intentionally to eliminate cross-level drift.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1531-1535
Timestamp: 2026-04-01T15:33:17.229Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` intentionally casts `abs_pos: usize` to `u32` (wrapping on streams > 4 GiB). This is safe by design: (1) the chain table slot is computed as `(pos_u32 as usize) & chain_mask`, so only the low bits matter and wrapping leaves the addressed slot unchanged; (2) `chain_candidates` filters any recovered `cur as usize` value against `history_abs_start` — a wrapped (small) value is always less than the current sliding-window start and is discarded. Do not flag this cast as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

- driver_best_to_fastest_releases_oversized_hc_tables: verifies HC hash/chain tables are freed when switching from Best to Fastest - driver_better_to_best_resizes_hc_tables: verifies tables grow when switching from Better (1M/512K) to Best (2M/1M)

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

zstd/src/encoding/mod.rs

zstd/src/encoding/streaming_encoder.rs

zstd/src/encoding/match_generator.rs

- Best doc: point to Default (not Better) for >4 GiB streams since Better also uses HC with u32 positions - Remove redundant `if backend != HashChain` check inside the `active_backend != backend` branch where it is always true

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 498-509: The test best_level_compresses_close_to_better currently
allows a 1% regression between CompressionLevel::Best and
CompressionLevel::Better; change the assertion to a strict non-regression check
so compressed_best is never larger than compressed_better (e.g.
assert!(compressed_best.len() <= compressed_better.len(), "Best level must not
produce larger output than Better: best={}, better={}", compressed_best.len(),
compressed_better.len())); update the comment to reflect a strict check if
present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5eac94da-d62a-42d9-adb8-c6fb7b8c3598

📥 Commits

Reviewing files that changed from the base of the PR and between c1b224e and 18bf7cc.

📒 Files selected for processing (3)

zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs
zstd/src/tests/roundtrip_integrity.rs

zstd/src/tests/roundtrip_integrity.rs

…ound Rename to best_level_does_not_regress_vs_better and keep <= assertion: on this repetitive fixture HC finds identical matches at any depth (best == better == 30243 bytes). The strict < check lives in cross_validation::best_level_beats_better_on_corpus_proxy where the more diverse decodecorpus sample differentiates the levels.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

polaz added 2 commits April 2, 2026 18:30

fix: remove unreachable pattern in frame_compressor

301b643

All CompressionLevel variants are now covered exhaustively after adding Best; the wildcard arm was dead code producing a warning.

Copilot AI review requested due to automatic review settings April 2, 2026 15:52

Copilot started reviewing on behalf of polaz April 2, 2026 15:53 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

zstd/src/encoding/streaming_encoder.rs Show resolved Hide resolved

zstd/src/encoding/match_generator.rs Show resolved Hide resolved

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

zstd/src/encoding/frame_compressor.rs Show resolved Hide resolved

zstd/src/encoding/match_generator.rs Outdated Show resolved Hide resolved

zstd/src/encoding/streaming_encoder.rs Show resolved Hide resolved

zstd/tests/cross_validation.rs Outdated Show resolved Hide resolved

polaz added 2 commits April 2, 2026 19:28

style: apply cargo fmt

71e5d00

polaz requested a review from Copilot April 2, 2026 16:32

Copilot started reviewing on behalf of polaz April 2, 2026 16:33 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

zstd/tests/cross_validation.rs Show resolved Hide resolved

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

zstd/src/tests/roundtrip_integrity.rs Show resolved Hide resolved

sw-release-bot bot reviewed Apr 2, 2026

View reviewed changes

test(encoding): use high-entropy region in large-window test

146d571

Use generate_data (random) instead of generate_compressible for the duplicated region so the size win depends on window reach, not intra-region compression.

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

zstd/src/tests/roundtrip_integrity.rs Outdated Show resolved Hide resolved

polaz requested a review from Copilot April 2, 2026 18:45

Copilot started reviewing on behalf of polaz April 2, 2026 18:46 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

zstd/src/encoding/match_generator.rs Outdated Show resolved Hide resolved

polaz added 2 commits April 2, 2026 21:58

docs(encoding): remove platform-specific byte count from array doc

5c546a5

The stack array size depends on pointer width; "256 bytes" is only correct on 64-bit targets.

style: apply cargo fmt to import order

c1b224e

polaz requested a review from Copilot April 2, 2026 19:01

Copilot started reviewing on behalf of polaz April 2, 2026 19:01 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

zstd/src/encoding/match_generator.rs Outdated Show resolved Hide resolved

zstd/src/tests/roundtrip_integrity.rs Outdated Show resolved Hide resolved

polaz requested a review from Copilot April 2, 2026 19:22

Copilot started reviewing on behalf of polaz April 2, 2026 19:22 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

fix(encoding): release oversized HC tables when switching away from Best

c61ef97

When MatchGeneratorDriver switches from HashChain to another backend, drop the hash/chain tables so Best's larger 2M/1M allocations don't persist across frames that use Default or Fastest.

polaz requested a review from Copilot April 2, 2026 21:33

Copilot started reviewing on behalf of polaz April 2, 2026 21:34 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

zstd/src/encoding/mod.rs Outdated Show resolved Hide resolved

zstd/src/encoding/streaming_encoder.rs Show resolved Hide resolved

zstd/src/encoding/match_generator.rs Outdated Show resolved Hide resolved

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

zstd/src/tests/roundtrip_integrity.rs Outdated Show resolved Hide resolved

polaz requested a review from Copilot April 2, 2026 22:56

Copilot started reviewing on behalf of polaz April 2, 2026 22:56 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

polaz merged commit d402315 into main Apr 2, 2026
14 of 15 checks passed

polaz deleted the feat/#7-feat-implement-best-compression-level-zstd-level-1 branch April 2, 2026 23:05

sw-release-bot bot mentioned this pull request Apr 2, 2026

chore: release v0.0.5 #52

Merged

Conversation

polaz commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compression ratio vs C zstd

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sw-release-bot bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

polaz commented Apr 2, 2026

Uh oh!

coderabbitai bot commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

polaz commented Apr 2, 2026

Uh oh!

coderabbitai bot commented Apr 2, 2026

polaz commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

codecov bot commented Apr 2, 2026 •

edited

Loading

sw-release-bot bot left a comment •

edited

Loading