Conversation
…rategy) - Implement Best level using hash-chain matcher with deep lazy2 matching: 16 MiB window, 2M/1M hash/chain tables, 32-candidate search depth, 128 target match length - Make HC matcher parameters (hash_log, chain_log, search_depth, target_len) configurable per level via new configure() method - Add Best to streaming encoder, frame compressor, and CLI (level 4) - Add 11 roundtrip integrity tests covering compressible, random, multi-block, streaming, edge cases, repeat offsets, and large window - Add cross-validation: Rust Best → C FFI decompress, ratio assertion (Best beats Better on corpus proxy) - Add Best to benchmark level matrix (vs C zstd level 11) - Update README checklist and BENCHMARKS level mapping Closes #7
All CompressionLevel variants are now covered exhaustively after adding Best; the wildcard arm was dead code producing a warning.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughImplements CompressionLevel::Best end-to-end: CLI mapping and bench entry, encoder dispatch and streaming support, hash‑chain match‑generator tunables/configuration for Best, and added unit and cross‑validation tests exercising Best-level compression. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI/User
participant FC as FrameCompressor
participant MG as MatchGenerator (HashChain)
participant Enc as Encoder
CLI->>FC: compress(data, level=Best)
activate FC
FC->>MG: configure(level=Best)
activate MG
MG->>MG: set hash_log/chain_log/search_depth/target_len
MG-->>FC: configured matcher
deactivate MG
loop per block
FC->>MG: find_matches(block)
activate MG
MG->>MG: traverse hash_chain (up to search_depth)
MG-->>FC: matches (lazy2 selection)
deactivate MG
FC->>Enc: encode block with matches
end
Enc-->>CLI: compressed frame
deactivate FC
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Adds a new highest compression preset (CompressionLevel::Best) to the pure-Rust encoder pipeline, extending the existing hash-chain + lazy matching infrastructure to use larger tables, deeper search, and a larger window, and wires it through streaming, CLI, tests, and benchmarks.
Changes:
- Implement
CompressionLevel::Bestusing the hash-chain backend with level-specific HC parameters (16 MiB window, larger tables, deeper search, higher target length). - Extend streaming encoder + frame compressor + CLI to accept/use Best.
- Add regression/interop tests and update benchmark/docs to include Best.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/encoding/match_generator.rs |
Adds Best-level HC configuration knobs and routes CompressionLevel::Best to the HC backend with a 16 MiB window. |
zstd/src/encoding/mod.rs |
Documents CompressionLevel::Best behavior and limitations. |
zstd/src/encoding/streaming_encoder.rs |
Enables Best for streaming encoding and adds a streaming roundtrip test. |
zstd/src/encoding/frame_compressor.rs |
Routes Best through the shared block-encoding pipeline (removes unimplemented!() path). |
cli/src/main.rs |
Adds CLI --level 4 mapping to Best and updates parsing/tests. |
zstd/tests/cross_validation.rs |
Adds Rust Best → FFI decode regression and Best-vs-Better size assertion on corpus proxy. |
zstd/src/tests/roundtrip_integrity.rs |
Adds multiple roundtrip tests covering Best (including large-window and streaming). |
zstd/benches/support/mod.rs |
Adds Best to benchmark level set (maps to FFI level 11). |
BENCHMARKS.md |
Updates benchmark matrix to include Best. |
README.md |
Updates feature checklist/docs to mark Best as implemented. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 280-287: The public CompressionLevel::Best is currently routed to
MatcherBackend::HashChain; update the mapping in
zstd/src/encoding/match_generator.rs (the function that maps CompressionLevel ->
MatcherBackend) so that CompressionLevel::Best selects the DUBT/btlazy2 backend
(e.g., MatcherBackend::Btlazy2 or the Dubt variant used in the repo) with the
intended tuned parameters instead of MatcherBackend::HashChain, and adjust any
associated parameterization or tests that assume HashChain for Best.
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1426-1438: The configure method can accept a search_depth larger
than MAX_HC_SEARCH_DEPTH which causes chain_candidates to write past its fixed
[usize; MAX_HC_SEARCH_DEPTH] buffer; clamp/bound the value assigned to
self.search_depth inside configure (and the other configure-like block around
chain_candidates usage) by setting self.search_depth =
search_depth.min(MAX_HC_SEARCH_DEPTH) (or equivalent), ensuring any code paths
in chain_candidates and related functions use a capped search depth that cannot
exceed MAX_HC_SEARCH_DEPTH.
In `@zstd/src/encoding/streaming_encoder.rs`:
- Around line 752-763: The test best_level_streaming_roundtrip currently uses a
payload too small to trigger multi-block emission for StreamingEncoder with
CompressionLevel::Best; update the test so the encoder emits at least two blocks
by increasing the payload size or changing chunking (e.g., repeat the base
string more times or use smaller chunk sizes so total data exceeds a
single-block threshold), ensuring StreamingEncoder::new/CompressionLevel::Best
path crosses a block boundary and the roundtrip still decodes correctly.
In `@zstd/tests/cross_validation.rs`:
- Around line 281-295: The test best_level_beats_better_on_corpus_proxy
currently uses <= so it will not detect if CompressionLevel::Best accidentally
aliases CompressionLevel::Better; update the assertion to require a strict
improvement (use <) when comparing best.len() and better.len() in the test that
calls compress_to_vec with CompressionLevel::Best and CompressionLevel::Better;
if equal sizes are acceptable for this corpus instead, add a separate
fixture/test where Best is known to win (or add a ratio-check) so the regression
(Best == Better aliasing) is caught.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ad4f5a7b-8dfd-4eae-aaca-7240ba2d2cec
📒 Files selected for processing (10)
BENCHMARKS.mdREADME.mdcli/src/main.rszstd/benches/support/mod.rszstd/src/encoding/frame_compressor.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/src/encoding/streaming_encoder.rszstd/src/tests/roundtrip_integrity.rszstd/tests/cross_validation.rs
- Clamp search_depth to MAX_HC_SEARCH_DEPTH in configure() to prevent OOB panic if a future level config exceeds the fixed-size array - Strict assertion in best_beats_better test (< not <=) to catch accidental Best/Better aliasing - Increase best_level_streaming_roundtrip payload to 200 KiB to cross the 128 KiB block boundary - Document ensure_level_supported exhaustive guard and chain_candidates fixed-size array design
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 543-569: Summary: The test's use of generate_compressible for
region allows Better to match within the region itself, weakening the assertion
that Best's larger 16 MiB window is responsible for the compression advantage.
Fix: in roundtrip_best_level_large_window replace the compressible region
generation (generate_compressible(...)) with a high-entropy region that has low
internal repetition (e.g., generate_random or generate_high_entropy(...) of the
same size) so the only meaningful matches come from the long-distance duplicate;
keep the rest of the test (gap generation, concatenation, calls to
compress_to_vec and CompressionLevel::Best/Better, and the final length
comparison) unchanged so the size win requires window reach rather than
intra-region matches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: aa07cf7a-1b08-498e-81a6-1730e3a45a94
📒 Files selected for processing (6)
cli/src/main.rszstd/benches/support/mod.rszstd/src/encoding/match_generator.rszstd/src/encoding/streaming_encoder.rszstd/src/tests/roundtrip_integrity.rszstd/tests/cross_validation.rs
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: 0f44b8c | Previous: 8f2fe9b | Ratio |
|---|---|---|---|
compress/better/small-1k-random/matrix/pure_rust |
0.173 ms |
0.141 ms |
1.23 |
compress/better/small-1k-random/matrix/c_ffi |
0.108 ms |
0.086 ms |
1.26 |
compress/default/small-4k-log-lines/matrix/c_ffi |
0.021 ms |
0.018 ms |
1.17 |
compress/better/decodecorpus-z000033/matrix/pure_rust |
65.38 ms |
52.89 ms |
1.24 |
compress/fastest/high-entropy-1m/matrix/c_ffi |
0.284 ms |
0.234 ms |
1.21 |
compress/default/high-entropy-1m/matrix/c_ffi |
0.353 ms |
0.287 ms |
1.23 |
compress/better/high-entropy-1m/matrix/pure_rust |
71.464 ms |
62.008 ms |
1.15 |
compress/better/high-entropy-1m/matrix/c_ffi |
0.674 ms |
0.578 ms |
1.17 |
compress/fastest/low-entropy-1m/matrix/c_ffi |
0.18 ms |
0.133 ms |
1.35 |
compress/better/low-entropy-1m/matrix/pure_rust |
5.117 ms |
4.38 ms |
1.17 |
compress/fastest/large-log-stream/matrix/c_ffi |
2.656 ms |
2.134 ms |
1.24 |
compress/default/large-log-stream/matrix/pure_rust |
109.811 ms |
91.296 ms |
1.20 |
decompress/default/small-10k-random/rust_stream/matrix/c_ffi |
0.002 ms |
0.001 ms |
2 |
decompress/better/small-10k-random/rust_stream/matrix/c_ffi |
0.002 ms |
0.001 ms |
2 |
compress-dict/fastest/small-10k-random/matrix/c_ffi_with_dict |
0.018 ms |
0.015 ms |
1.20 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
Use generate_data (random) instead of generate_compressible for the duplicated region so the size win depends on window reach, not intra-region compression.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Line 13: The import group in the use statement is out of alphabetical order;
reorder the items inside crate::encoding::{...} so identifiers starting with
uppercase come before lowercase (e.g., CompressionLevel, FrameCompressor,
compress_to_vec) to satisfy cargo fmt; update the use line that currently
references compress_to_vec, CompressionLevel, FrameCompressor accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0f8c879d-0abd-4e1e-a47c-9cf99b80d714
📒 Files selected for processing (1)
zstd/src/tests/roundtrip_integrity.rs
The stack array size depends on pointer width; "256 bytes" is only correct on 64-bit targets.
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 39-47: Replace the positional usize parameters to the HC
configure(...) API with a typed HcConfig struct and constant instances to avoid
parameter-order mistakes: define a HcConfig { hash_log: usize, chain_log: usize,
search_depth: usize, target_len: usize, default_window_size: usize } and replace
the individual BEST_HC_* constants with a single BEST_HC_CONFIG constant; change
the configure(...) function signature to take HcConfig (or &HcConfig) instead of
multiple usize args and update all call sites (including other occurrences
around chain_candidates and higher-level configurations) to pass the named
HcConfig constants (e.g., BEST_HC_CONFIG) so each level’s tuning is bundled and
order-sensitive positional arguments are removed.
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 479-523: Multiple tests duplicate the same fixtures for "Best"
that mirror the "Better" suite; extract a shared test matrix to avoid
duplication and drift: create a helper function or macro (e.g., a function
run_roundtrip_matrix(roundtrip_fn: fn(&[u8])->Vec<u8>, streaming_roundtrip_fn:
Option<fn(&[u8])->Vec<u8>>) or a macro run_roundtrip_tests! that accepts the
pair of functions) and replace the repeated assertions that call roundtrip_best
and roundtrip_best_streaming (and their "Better" equivalents) with invocations
of that helper, moving the common data generators (generate_compressible,
generate_data, repeat_offset_fixture, generate_huffman_friendly, edge cases)
into the shared matrix so both Best and Better test modules call the same table
of cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0c1ecd83-7cda-450b-8035-d22139bba2c7
📒 Files selected for processing (10)
BENCHMARKS.mdREADME.mdcli/src/main.rszstd/benches/support/mod.rszstd/src/encoding/frame_compressor.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/src/encoding/streaming_encoder.rszstd/src/tests/roundtrip_integrity.rszstd/tests/cross_validation.rs
…te macro - Replace positional configure(hash_log, chain_log, search_depth, target_len) with configure(HcConfig), eliminating parameter-order hazard. Add HC_CONFIG and BEST_HC_CONFIG const presets. - Extract level_roundtrip_suite! macro that generates the standard 7-test roundtrip suite (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for any compression level. Better and Best now share the same test matrix via module-scoped macro invocations.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/encoding/match_generator.rs (1)
275-283:⚠️ Potential issue | 🟠 MajorRelease oversized HC tables when switching away from
Best.
MatchGeneratorDriverkeepshc_match_generatoralive across frames, and the backend-switch path only callshc.reset(). After oneCompressionLevel::Bestframe, the 2M/1M HC tables remain resident even if later frames useDefaultorFastest, so a long-lived compressor keeps the higher Best memory floor until the matcher is dropped or re-enters HC and reallocates smaller tables.💡 Possible fix
MatcherBackend::HashChain => { if let Some(hc) = self.hc_match_generator.as_mut() { + if backend != MatcherBackend::HashChain + && (hc.hash_log != HC_HASH_LOG || hc.chain_log != HC_CHAIN_LOG) + { + hc.hash_table = Vec::new(); + hc.chain_table = Vec::new(); + } let vec_pool = &mut self.vec_pool; hc.reset(|mut data| { data.resize(data.capacity(), 0); vec_pool.push(data); }); } }Also applies to: 317-325
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/encoding/match_generator.rs` around lines 275 - 283, The hc_match_generator instance keeps large HC tables alive after a Best-frame even when switching back to non-HashChain backends; update the backend-switch path in MatchGeneratorDriver so that when leaving MatcherBackend::HashChain you release those oversized tables by dropping or replacing the hc_match_generator instead of only calling hc.reset() — e.g., if self.hc_match_generator.is_some() then take() and drop it (self.hc_match_generator = None) or call a dedicated release method on the hc before returning, ensuring hc_match_generator and its large buffers are freed when switching away from HashChain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 275-283: The hc_match_generator instance keeps large HC tables
alive after a Best-frame even when switching back to non-HashChain backends;
update the backend-switch path in MatchGeneratorDriver so that when leaving
MatcherBackend::HashChain you release those oversized tables by dropping or
replacing the hc_match_generator instead of only calling hc.reset() — e.g., if
self.hc_match_generator.is_some() then take() and drop it
(self.hc_match_generator = None) or call a dedicated release method on the hc
before returning, ensuring hc_match_generator and its large buffers are freed
when switching away from HashChain.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bbe00f70-bb21-4a0e-a40f-dc95dbce9027
📒 Files selected for processing (2)
zstd/src/encoding/match_generator.rszstd/src/tests/roundtrip_integrity.rs
When MatchGeneratorDriver switches from HashChain to another backend, drop the hash/chain tables so Best's larger 2M/1M allocations don't persist across frames that use Default or Fastest.
|
@coderabbitai Fixed — HC tables are now released when switching away from HashChain backend, so Best's larger allocations don't persist across frames using other levels. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chainLet me verify the fix in the current code. 🏁 Script executed: Length of output: 4895 🏁 Script executed: Length of output: 4998
if backend != MatcherBackend::HashChain {
hc.hash_table = Vec::new();
hc.chain_table = Vec::new();
}drops the oversized allocations before 🧠 Learnings used |
- driver_best_to_fastest_releases_oversized_hc_tables: verifies HC hash/chain tables are freed when switching from Best to Fastest - driver_better_to_best_resizes_hc_tables: verifies tables grow when switching from Better (1M/512K) to Best (2M/1M)
- Best doc: point to Default (not Better) for >4 GiB streams since Better also uses HC with u32 positions - Remove redundant `if backend != HashChain` check inside the `active_backend != backend` branch where it is always true
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 498-509: The test best_level_compresses_close_to_better currently
allows a 1% regression between CompressionLevel::Best and
CompressionLevel::Better; change the assertion to a strict non-regression check
so compressed_best is never larger than compressed_better (e.g.
assert!(compressed_best.len() <= compressed_better.len(), "Best level must not
produce larger output than Better: best={}, better={}", compressed_best.len(),
compressed_better.len())); update the comment to reflect a strict check if
present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5eac94da-d62a-42d9-adb8-c6fb7b8c3598
📒 Files selected for processing (3)
zstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/src/tests/roundtrip_integrity.rs
…ound Rename to best_level_does_not_regress_vs_better and keep <= assertion: on this repetitive fixture HC finds identical matches at any depth (best == better == 30243 bytes). The strict < check lives in cross_validation::best_level_beats_better_on_corpus_proxy where the more diverse decodecorpus sample differentiates the levels.
Summary
CompressionLevel::Best— roughly equivalent to zstd level 11hash_log,chain_log,search_depth,target_len) are now configurable per level viaconfigure()method4maps to BestCompression ratio vs C zstd
Test plan
Closes #7
Summary by CodeRabbit
New Features
Documentation
Tests