feat(encoding): add dictionary compression support#44
Conversation
- add FrameCompressor dictionary APIs, including parse-from-bytes helper - write dictionary id into frame header and prime matcher with dictionary history - support raw-content dictionaries for dict_builder outputs - add regression tests for dict-id enforcement, C interop, and dict_builder roundtrip Closes #8
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughAdds encoder-side dictionary support: new Dictionary constructor, stricter dictionary decode validation and errors, matcher priming and window-budget tracking, FrameCompressor APIs to attach/seed dictionaries and advertise dictionary_id, helpers to convert decoder tables to encoder tables, tests, and a Cargo packaging tweak. Changes
Sequence DiagramsequenceDiagram
participant User as "User"
participant Dict as "Dictionary"
participant Compressor as "FrameCompressor"
participant MatchGen as "MatchGenerator"
participant Matcher as "Matcher"
User->>Dict: decode(bytes) / from_raw_content(id,content)
Dict-->>User: Dictionary | Err
User->>Compressor: set_dictionary(Dictionary)
User->>Compressor: compress(data)
Compressor->>Compressor: detect attached dictionary
alt dictionary attached and matcher supports priming
Compressor->>MatchGen: prime_with_dictionary(dict.content, dict.offset_hist)
MatchGen->>Matcher: commit chunks / populate hash & chains
MatchGen->>Matcher: apply offset_hist
Compressor->>Compressor: seed previous Huffman & FSE tables
Compressor->>Compressor: set FrameHeader.dictionary_id
end
Compressor-->>User: compressed bytes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Adds dictionary compression support to the encoder pipeline, enabling frames to be compressed with a provided Zstd dictionary and advertising the dictionary ID in the frame header for decoder interoperability.
Changes:
- Extend the
Matchertrait + default matcher to support priming matcher state from dictionary history/content. - Add dictionary attachment APIs to
FrameCompressorand emitdictionary_idin the encoded frame header. - Add dictionary constructors/validation (including rejecting dictionary ID 0) and new regression tests for dictionary-compressed roundtrips + zstd-ffi interop.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| zstd/src/encoding/mod.rs | Adds prime_with_dictionary hook to the Matcher trait. |
| zstd/src/encoding/match_generator.rs | Implements dictionary priming for the default matcher backends and sets repeat-offset history. |
| zstd/src/encoding/frame_compressor.rs | Stores an attached dictionary, primes state per frame, writes dict ID in header, and adds dictionary compression tests. |
| zstd/src/decoding/errors.rs | Adds ZeroDictionaryId decode error variant and Display formatting. |
| zstd/src/decoding/dictionary.rs | Adds Dictionary::from_raw_content and rejects zero dictionary IDs during decode. |
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: 127d41d | Previous: 29de24a | Ratio |
|---|---|---|---|
compress/fastest/small-4k-log-lines/matrix/pure_rust |
0.035 ms |
0.026 ms |
1.35 |
compress/fastest/high-entropy-1m/matrix/c_ffi |
0.309 ms |
0.267 ms |
1.16 |
compress/default/low-entropy-1m/matrix/c_ffi |
0.233 ms |
0.197 ms |
1.18 |
decompress/fastest/small-10k-random/rust_stream/matrix/c_ffi |
0.002 ms |
0.001 ms |
2 |
decompress/default/small-10k-random/rust_stream/matrix/c_ffi |
0.002 ms |
0.001 ms |
2 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
- reject dictionary id 0 in FrameCompressor::set_dictionary - return explicit DictionaryDecodeError on undersized dictionary buffers - keep dict_tests assets in crate package so include_bytes tests compile downstream
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/decoding/dictionary.rs (1)
77-83:⚠️ Potential issue | 🟠 MajorCritical: Initial
offset_histuses incorrect default value.Line 82 initializes
offset_histto[2, 4, 8], but the ZSTD specification (RFC 8878 §3.1.2.5) mandates the default repeat offsets are[1, 4, 8]. This is inconsistent with:
from_raw_contentat line 60 which uses[1, 4, 8]DecoderScratch::new()inscratch.rs:44which uses[1, 4, 8]This value is overwritten later (lines 136-138) when parsing a valid dictionary, so it only affects the intermediate state. However, if any error path or future refactoring reads this before it's overwritten, it would produce incorrect offset decoding.
Proposed fix
let mut new_dict = Dictionary { id: 0, fse: FSEScratch::new(), huf: HuffmanScratch::new(), dict_content: Vec::new(), - offset_hist: [2, 4, 8], + offset_hist: [1, 4, 8], };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/decoding/dictionary.rs` around lines 77 - 83, The Dictionary struct is initialized with the wrong default repeat offsets; change the initial offset_hist in the Dictionary::new-path (where Dictionary is constructed with FSEScratch::new() and HuffmanScratch::new()) from [2, 4, 8] to the correct RFC8878 defaults [1, 4, 8] so it matches from_raw_content and DecoderScratch::new(), ensuring any interim/error-path reads see the correct repeat offsets; update the offset_hist field in that Dictionary initialization accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 163-173: The test function
decode_dict_malformed_input_returns_error_instead_of_panicking contains assert!
macro calls with formatting that fails cargo fmt; reformat those assert!
invocations to adhere to rustfmt (put the condition and message on the same line
or follow rustfmt's preferred multi-line style) so the file passes cargo fmt;
locate the test function and update the two assert! calls around
Dictionary::decode_dict(&raw) and result.unwrap().is_err() (and any related
string literals referencing decode_dict or malformed dictionary messages) to be
formatted properly.
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 164-169: Add a short clarifying comment explaining why offset_hist
is assigned twice: once to FrameCompressor.state.offset_hist (used during
sequence encoding) and again via Matcher.prime_with_dictionary (which sets the
matcher's internal offset history used for match generation); place the comment
near the block in frame_compressor.rs where self.dictionary is primed (around
the self.state.offset_hist assignment and matcher.prime_with_dictionary call)
and mention both uses (sequence encoding vs match generation) and reference the
match_generator behavior so future maintainers understand the intentional dual
assignment.
- Around line 318-327: The method FrameCompressor::set_dictionary currently
panics via assert when dictionary.id == 0; change it to return a Result to match
set_dictionary_from_bytes so callers can handle the zero-ID error: replace the
assert in set_dictionary with a check that returns an Err variant (introduce or
reuse a DictionaryError/FrameCompressorError variant like InvalidDictionaryId)
when id == 0 and return Ok(previous_dictionary_opt) on success, and update
callers/tests accordingly to handle Result; keep the function name
set_dictionary and its return semantics (previous Option<Dictionary>) wrapped in
Result<Option<Dictionary>, YourErrorType> so API usage mirrors
set_dictionary_from_bytes.
---
Outside diff comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 77-83: The Dictionary struct is initialized with the wrong default
repeat offsets; change the initial offset_hist in the Dictionary::new-path
(where Dictionary is constructed with FSEScratch::new() and
HuffmanScratch::new()) from [2, 4, 8] to the correct RFC8878 defaults [1, 4, 8]
so it matches from_raw_content and DecoderScratch::new(), ensuring any
interim/error-path reads see the correct repeat offsets; update the offset_hist
field in that Dictionary initialization accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 364b693b-b273-4d81-b744-5170a275b26f
📒 Files selected for processing (4)
zstd/Cargo.tomlzstd/src/decoding/dictionary.rszstd/src/decoding/errors.rszstd/src/encoding/frame_compressor.rs
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/decoding/dictionary.rs (1)
77-83:⚠️ Potential issue | 🟡 MinorInconsistent default
offset_histinitialization.Line 82 initializes
offset_histto[2, 4, 8], but:
from_raw_content(line 60) uses[1, 4, 8]CompressStateinframe_compressor.rs(lines 108, 127, 163) uses[1, 4, 8]- RFC 8878 §3.1.2.5 specifies the default as
[1, 4, 8]This value gets overwritten by lines 136-138 for fully-parsed dictionaries, so it only affects the transient state. However, for consistency and correctness if the overwrite logic ever changes, this should match the RFC default.
Proposed fix
let mut new_dict = Dictionary { id: 0, fse: FSEScratch::new(), huf: HuffmanScratch::new(), dict_content: Vec::new(), - offset_hist: [2, 4, 8], + offset_hist: [1, 4, 8], };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/decoding/dictionary.rs` around lines 77 - 83, The Dictionary struct's transient default for offset_hist is inconsistent (currently [2, 4, 8]) and should match RFC 8878 and other code paths; change the default initialization in Dictionary::new (the Dictionary literal created in the constructor at the diff) to [1, 4, 8] so it matches from_raw_content and the CompressState usage; ensure you update the offset_hist field in the Dictionary instantiation (symbol: Dictionary, field: offset_hist) to [1, 4, 8] to keep behavior consistent with RFC 8878 §3.1.2.5.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 314-327: The method FrameCompressor::set_dictionary currently
asserts (panics) when dictionary.id == 0; change it to return a Result instead
to match set_dictionary_from_bytes and avoid panics for a public API: replace
the assert_ne! with an early Err return (e.g.,
Err(crate::error::InvalidDictionaryId) or an appropriate crate error type) and
change the signature to return Result<Option<crate::decoding::Dictionary>, _>,
keeping the existing behavior of self.dictionary.replace(dictionary) as
Ok(Some/None) on success; update callers/tests accordingly to handle the Result.
---
Outside diff comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 77-83: The Dictionary struct's transient default for offset_hist
is inconsistent (currently [2, 4, 8]) and should match RFC 8878 and other code
paths; change the default initialization in Dictionary::new (the Dictionary
literal created in the constructor at the diff) to [1, 4, 8] so it matches
from_raw_content and the CompressState usage; ensure you update the offset_hist
field in the Dictionary instantiation (symbol: Dictionary, field: offset_hist)
to [1, 4, 8] to keep behavior consistent with RFC 8878 §3.1.2.5.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7c3e769b-37d3-496f-87aa-3b7424082751
📒 Files selected for processing (4)
zstd/Cargo.tomlzstd/src/decoding/dictionary.rszstd/src/decoding/errors.rszstd/src/encoding/frame_compressor.rs
- use RFC default repeat offsets [1,4,8] in decode_dict initialization - document intentional dual offset history priming in compressor state and matcher - document fail-fast zero-id contract for set_dictionary
- assert first set_dictionary_from_bytes insert returns None - explicitly discard optional previous dictionary in dict_builder roundtrip test - exclude dict_tests/files/** while keeping dict_tests/dictionary for include_bytes tests
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/decoding/dictionary.rs (1)
120-138:⚠️ Potential issue | 🟠 MajorReject zero repeat offsets during dictionary parsing.
The new length guard prevents the slice panic, but
decode_dict()still acceptsoffset1..offset3 == 0verbatim. Those values are copied straight into live decoder state byzstd/src/decoding/scratch.rs, andzstd/src/decoding/sequence_execution.rstreats them as repeat offsets (scratch[0] - 1on one branch). A malformed dictionary can therefore survive parsing and only fail much later when it is actually used. Please validate all three parsed repeat offsets here and return a decode error if any are zero.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/decoding/dictionary.rs` around lines 120 - 138, In decode_dict (dictionary.rs) validate the three parsed repeat offsets (offset1, offset2, offset3) after converting from raw_tables and before assigning into new_dict.offset_hist: if any offset == 0 return a DictionaryDecodeError indicating an invalid/zero repeat offset (e.g., Err(DictionaryDecodeError::InvalidRepeatOffset { index: <0|1|2>, got: 0 }) or the crate's closest error variant) instead of accepting them verbatim; this prevents zero offsets from being copied into new_dict.offset_hist and later used by scratch.rs / sequence_execution.rs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 164-170: When a dictionary is present, you must also restore the
encoder's previous entropy tables so the first block can use the "repeat
previous table" path: after priming offsets in the branch that checks
self.dictionary.as_ref(), detect whether dict.huf and dict.fse are populated
and, if so, set self.state.last_huff_table to the dictionary's Huffman tables
and set self.state.fse_tables.ll_previous, self.state.fse_tables.ml_previous,
and self.state.fse_tables.of_previous from dict.fse (or the corresponding fields
on dict) so the encoder's previous-table state matches the parsed dictionary
entropy tables.
---
Outside diff comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 120-138: In decode_dict (dictionary.rs) validate the three parsed
repeat offsets (offset1, offset2, offset3) after converting from raw_tables and
before assigning into new_dict.offset_hist: if any offset == 0 return a
DictionaryDecodeError indicating an invalid/zero repeat offset (e.g.,
Err(DictionaryDecodeError::InvalidRepeatOffset { index: <0|1|2>, got: 0 }) or
the crate's closest error variant) instead of accepting them verbatim; this
prevents zero offsets from being copied into new_dict.offset_hist and later used
by scratch.rs / sequence_execution.rs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f5742492-88f2-444d-9408-59e1818f8aae
📒 Files selected for processing (3)
zstd/Cargo.tomlzstd/src/decoding/dictionary.rszstd/src/encoding/frame_compressor.rs
- restore previous Huffman/FSE encoder tables from parsed dictionaries before first block - convert decoder-side entropy tables into encoder tables for dictionary priming - reject zero repeat offsets during dictionary parsing with explicit decode error - add regression tests for entropy seeding and zero-repeat-offset rejection
- precompute decoder->encoder entropy conversions when dictionary is set - reuse cached tables across compress() calls to avoid per-frame rebuild - keep explicit fail-fast comment for zero dictionary id API contract - derive Clone for encoder HuffmanTable to support cache reuse
- clarify set_dictionary docs for uncompressed and non-priming matchers - track retained dictionary budget separately from advertised live window - shrink matcher capacity as primed dictionary bytes are evicted - add regression tests for simple and dfast budget retirement
- seed huffman table directly via Option::clone_from from cached entropy - cache FSE previous tables as PreviousFseTable to avoid per-frame reboxing - remove temporary clone/map allocations in dictionary seeding path
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 196-209: The current seeding creates an extra temporary by calling
cached_entropy.and_then(|cache| cache.ll_previous.clone()) before clone_from;
instead, avoid the intermediate clone by matching borrowed cached_entropy and
passing a reference to the inner Option to clone_from: e.g. if let Some(cache) =
&cached_entropy {
self.state.fse_tables.ll_previous.clone_from(&cache.ll_previous); } else {
self.state.fse_tables.ll_previous.clone_from(&None); } and do the same for
ml_previous and of_previous so clone_from operates on a borrowed Option
reference rather than a freshly cloned value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 206b40f6-cc75-4c1d-8bb3-91e38f82131c
📒 Files selected for processing (1)
zstd/src/encoding/frame_compressor.rs
- keep dfast eviction callbacks on logical slice length, not vec capacity - add regression tests for add_data/trim_to_window eviction length semantics - remove intermediate Option clones in FSE dictionary seeding path
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 233-245: Update the test decode_dict_rejects_zero_repeat_offsets
to assert the exact error variant instead of just is_err(): call
Dictionary::decode_dict(&raw), unwrap the Err and match it against
DictionaryDecodeError::ZeroRepeatOffsetInDictionary { index: 0 } (or use
assert_eq! on the error) so the test fails for unrelated parse errors; reference
the existing test name, Dictionary::decode_dict, offset_history_start, and the
enum variant DictionaryDecodeError::ZeroRepeatOffsetInDictionary when making the
change.
- Around line 43-61: The from_raw_content constructor currently accepts an empty
dict_content which yields a dictionary with no usable primed entropy; add a
guard at the start of Dictionary::from_raw_content to reject empty content (e.g.
check dict_content.is_empty()) and return an appropriate DictionaryDecodeError
variant (add a new variant like EmptyDictionaryDecode or reuse an existing
suitable error) so callers cannot create empty raw-content dictionaries; update
any error enum and tests accordingly.
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 624-678: The test
dictionary_compression_roundtrips_with_dict_builder_dictionary currently only
verifies correctness and large-payload compression; add a no-dictionary baseline
and an assertion that the dict-trained compressor actually reduces size on a
small input: create a small representative payload (1–10 KB, e.g. take first N
bytes of payload or build a shorter payload) and compress it twice — once with a
FrameCompressor without calling set_dictionary and once with
set_dictionary(encoder_dict) — then compare the resulting compressed buffer
lengths and assert that the dict-compressed length is strictly smaller than the
no-dict compressed length; use the existing symbols FrameCompressor,
set_dictionary, set_source, set_drain, and compress and keep the
roundtrip/decoder checks unchanged.
- Around line 681-711: The assertions currently run after compressor.compress(),
which only checks the final encoder state; instead verify seeding immediately
after setting the dictionary and source but before any block is encoded: after
calling FrameCompressor::new(...), compressor.set_dictionary_from_bytes(...),
compressor.set_source(...), and compressor.set_drain(...), move the assertions
that check compressor.state.last_huff_table and
compressor.state.fse_tables.{ll_previous,ml_previous,of_previous} to run before
compressor.compress(); this ensures you assert the tables were seeded prior to
the first block being encoded.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c996e1dc-f635-412b-a6cd-d76e2a7f8301
📒 Files selected for processing (9)
zstd/Cargo.tomlzstd/src/decoding/dictionary.rszstd/src/decoding/errors.rszstd/src/encoding/frame_compressor.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/src/fse/fse_decoder.rszstd/src/huff0/huff0_decoder.rszstd/src/huff0/huff0_encoder.rs
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
Summary
FrameCompressor::set_dictionaryandset_dictionary_from_bytesDictionary::from_raw_content) for dict_builder outputValidation
Closes #8
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores