Skip to content

feat(encoding): add dictionary compression support#44

Merged
polaz merged 24 commits intomainfrom
feat/#8-dictionary-compression
Mar 29, 2026
Merged

feat(encoding): add dictionary compression support#44
polaz merged 24 commits intomainfrom
feat/#8-dictionary-compression

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Mar 28, 2026

Summary

  • add encoder dictionary support via FrameCompressor::set_dictionary and set_dictionary_from_bytes
  • write dictionary id to frame header and prime matcher state with dictionary content/history before encoding
  • keep advertised frame window size decoupled from internal dictionary-retention budget
  • cap dictionary-retention budget to bytes actually committed to matcher history (ignore short uncommittable tails)
  • add raw-content dictionary constructor (Dictionary::from_raw_content) for dict_builder output
  • add dictionary validation errors for zero dictionary id and zero repeat offsets
  • add regression tests covering:
    • dict id enforcement and roundtrip
    • C zstd decompression of dict-compressed output
    • roundtrip with dict_builder-generated dictionary
    • dictionary tail budgeting and window-size invariants

Validation

  • cargo fmt -- --check
  • cargo build --workspace
  • cargo nextest run --workspace
  • cargo clippy -p structured-zstd --all-targets -- -D warnings
  • cargo clippy -p structured-zstd --all-targets --features dict_builder -- -D warnings
  • cargo nextest run -p structured-zstd --features dict_builder -E "test(dictionary_compression_roundtrips_with_dict_builder_dictionary) | test(dictionary_compression_sets_required_dict_id_and_roundtrips)"

Closes #8

Summary by CodeRabbit

  • New Features

    • Attach, load, clear, and advertise dictionaries for compression; compressors can prime matcher and encoder state from attached dictionaries
    • Load dictionaries from raw bytes; added encoder helpers and clonable tables for reuse
  • Bug Fixes

    • Reject invalid dictionaries (zero IDs or zero repeat offsets) with specific errors and clearer size/bounds checks; adjusted default offset history to improve safety
  • Tests

    • Unit tests for decoding, priming behavior, and non‑panic edge cases
  • Chores

    • Adjusted package exclude paths in metadata

- add FrameCompressor dictionary APIs, including parse-from-bytes helper
- write dictionary id into frame header and prime matcher with dictionary history
- support raw-content dictionaries for dict_builder outputs
- add regression tests for dict-id enforcement, C interop, and dict_builder roundtrip

Closes #8
Copilot AI review requested due to automatic review settings March 28, 2026 19:54
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f66e1240-bd79-4a71-92af-b36893f26fef

📥 Commits

Reviewing files that changed from the base of the PR and between 29de24a and 127d41d.

📒 Files selected for processing (9)
  • zstd/Cargo.toml
  • zstd/src/decoding/dictionary.rs
  • zstd/src/decoding/errors.rs
  • zstd/src/encoding/frame_compressor.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/fse/fse_decoder.rs
  • zstd/src/huff0/huff0_decoder.rs
  • zstd/src/huff0/huff0_encoder.rs

📝 Walkthrough

Walkthrough

Adds encoder-side dictionary support: new Dictionary constructor, stricter dictionary decode validation and errors, matcher priming and window-budget tracking, FrameCompressor APIs to attach/seed dictionaries and advertise dictionary_id, helpers to convert decoder tables to encoder tables, tests, and a Cargo packaging tweak.

Changes

Cohort / File(s) Summary
Dictionary decoding & errors
zstd/src/decoding/dictionary.rs, zstd/src/decoding/errors.rs
Add Dictionary::from_raw_content(id, Vec<u8>); strengthen Dictionary::decode_dict with minimum-size checks, reject dict_id == 0, require non‑zero repeat offsets, change default offset_hist to [1,4,8]; add DictionaryTooSmall, ZeroDictionaryId, ZeroRepeatOffsetInDictionary error variants and Display messages; add unit tests.
Frame compressor & public API + tests
zstd/src/encoding/frame_compressor.rs
Add dictionary and dictionary_entropy_cache fields and set_dictionary/set_dictionary_from_bytes/clear_dictionary APIs; conditionally prime matcher and seed entropy tables based on compression level and matcher capability; emit FrameHeader.dictionary_id only when priming is active; tests for id advertisement, seeding, and validation.
Match generator / Matcher trait
zstd/src/encoding/match_generator.rs, zstd/src/encoding/mod.rs
Introduce reported_window_size and dictionary_retained_budget; implement prime_with_dictionary and supports_dictionary_priming; commit dict chunks into matcher history with backend-specific tail rules; budget-driven eviction/trim loops and retirement logic; SuffixStore key hardening and eviction-report fixes; tests.
Entropy conversion & encoder clone
zstd/src/fse/fse_decoder.rs, zstd/src/huff0/huff0_decoder.rs, zstd/src/huff0/huff0_encoder.rs
Add to_encoder_table() helpers converting decoder FSE/Huffman tables to encoder-side tables (returning Option); derive Clone for encoder HuffmanTable.
Packaging metadata
zstd/Cargo.toml
Adjust package exclude to dict_tests/files/** (stop excluding dict_tests/*).

Sequence Diagram

sequenceDiagram
    participant User as "User"
    participant Dict as "Dictionary"
    participant Compressor as "FrameCompressor"
    participant MatchGen as "MatchGenerator"
    participant Matcher as "Matcher"

    User->>Dict: decode(bytes) / from_raw_content(id,content)
    Dict-->>User: Dictionary | Err

    User->>Compressor: set_dictionary(Dictionary)
    User->>Compressor: compress(data)
    Compressor->>Compressor: detect attached dictionary

    alt dictionary attached and matcher supports priming
        Compressor->>MatchGen: prime_with_dictionary(dict.content, dict.offset_hist)
        MatchGen->>Matcher: commit chunks / populate hash & chains
        MatchGen->>Matcher: apply offset_hist
        Compressor->>Compressor: seed previous Huffman & FSE tables
        Compressor->>Compressor: set FrameHeader.dictionary_id
    end

    Compressor-->>User: compressed bytes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
I nibble bytes and line them neat,
I prime the matcher, set repeats complete.
I stash the tables, hum and hum,
Frames wear IDs before they run.
Tiny bytes compress — hop, yum!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(encoding): add dictionary compression support' directly and clearly summarizes the main change: adding dictionary compression to the encoding module.
Linked Issues check ✅ Passed All acceptance criteria from issue #8 are met: FrameCompressor accepts dictionaries via set_dictionary/set_dictionary_from_bytes, dictionary ID is written to frame header, C zstd decompression is tested, dict_builder roundtrips are tested, and offset history/entropy tables are properly initialized.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing dictionary compression support as specified in issue #8. No unrelated modifications or scope creep detected.
Docstring Coverage ✅ Passed Docstring coverage is 97.26% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#8-dictionary-compression

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds dictionary compression support to the encoder pipeline, enabling frames to be compressed with a provided Zstd dictionary and advertising the dictionary ID in the frame header for decoder interoperability.

Changes:

  • Extend the Matcher trait + default matcher to support priming matcher state from dictionary history/content.
  • Add dictionary attachment APIs to FrameCompressor and emit dictionary_id in the encoded frame header.
  • Add dictionary constructors/validation (including rejecting dictionary ID 0) and new regression tests for dictionary-compressed roundtrips + zstd-ffi interop.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
zstd/src/encoding/mod.rs Adds prime_with_dictionary hook to the Matcher trait.
zstd/src/encoding/match_generator.rs Implements dictionary priming for the default matcher backends and sets repeat-offset history.
zstd/src/encoding/frame_compressor.rs Stores an attached dictionary, primes state per frame, writes dict ID in header, and adds dictionary compression tests.
zstd/src/decoding/errors.rs Adds ZeroDictionaryId decode error variant and Display formatting.
zstd/src/decoding/dictionary.rs Adds Dictionary::from_raw_content and rejects zero dictionary IDs during decode.

Copy link
Copy Markdown

@sw-release-bot sw-release-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite Current: 127d41d Previous: 29de24a Ratio
compress/fastest/small-4k-log-lines/matrix/pure_rust 0.035 ms 0.026 ms 1.35
compress/fastest/high-entropy-1m/matrix/c_ffi 0.309 ms 0.267 ms 1.16
compress/default/low-entropy-1m/matrix/c_ffi 0.233 ms 0.197 ms 1.18
decompress/fastest/small-10k-random/rust_stream/matrix/c_ffi 0.002 ms 0.001 ms 2
decompress/default/small-10k-random/rust_stream/matrix/c_ffi 0.002 ms 0.001 ms 2

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

polaz added 2 commits March 28, 2026 22:13
- reject dictionary id 0 in FrameCompressor::set_dictionary

- return explicit DictionaryDecodeError on undersized dictionary buffers

- keep dict_tests assets in crate package so include_bytes tests compile downstream
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/decoding/dictionary.rs (1)

77-83: ⚠️ Potential issue | 🟠 Major

Critical: Initial offset_hist uses incorrect default value.

Line 82 initializes offset_hist to [2, 4, 8], but the ZSTD specification (RFC 8878 §3.1.2.5) mandates the default repeat offsets are [1, 4, 8]. This is inconsistent with:

  • from_raw_content at line 60 which uses [1, 4, 8]
  • DecoderScratch::new() in scratch.rs:44 which uses [1, 4, 8]

This value is overwritten later (lines 136-138) when parsing a valid dictionary, so it only affects the intermediate state. However, if any error path or future refactoring reads this before it's overwritten, it would produce incorrect offset decoding.

Proposed fix
         let mut new_dict = Dictionary {
             id: 0,
             fse: FSEScratch::new(),
             huf: HuffmanScratch::new(),
             dict_content: Vec::new(),
-            offset_hist: [2, 4, 8],
+            offset_hist: [1, 4, 8],
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/decoding/dictionary.rs` around lines 77 - 83, The Dictionary struct
is initialized with the wrong default repeat offsets; change the initial
offset_hist in the Dictionary::new-path (where Dictionary is constructed with
FSEScratch::new() and HuffmanScratch::new()) from [2, 4, 8] to the correct
RFC8878 defaults [1, 4, 8] so it matches from_raw_content and
DecoderScratch::new(), ensuring any interim/error-path reads see the correct
repeat offsets; update the offset_hist field in that Dictionary initialization
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 163-173: The test function
decode_dict_malformed_input_returns_error_instead_of_panicking contains assert!
macro calls with formatting that fails cargo fmt; reformat those assert!
invocations to adhere to rustfmt (put the condition and message on the same line
or follow rustfmt's preferred multi-line style) so the file passes cargo fmt;
locate the test function and update the two assert! calls around
Dictionary::decode_dict(&raw) and result.unwrap().is_err() (and any related
string literals referencing decode_dict or malformed dictionary messages) to be
formatted properly.

In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 164-169: Add a short clarifying comment explaining why offset_hist
is assigned twice: once to FrameCompressor.state.offset_hist (used during
sequence encoding) and again via Matcher.prime_with_dictionary (which sets the
matcher's internal offset history used for match generation); place the comment
near the block in frame_compressor.rs where self.dictionary is primed (around
the self.state.offset_hist assignment and matcher.prime_with_dictionary call)
and mention both uses (sequence encoding vs match generation) and reference the
match_generator behavior so future maintainers understand the intentional dual
assignment.
- Around line 318-327: The method FrameCompressor::set_dictionary currently
panics via assert when dictionary.id == 0; change it to return a Result to match
set_dictionary_from_bytes so callers can handle the zero-ID error: replace the
assert in set_dictionary with a check that returns an Err variant (introduce or
reuse a DictionaryError/FrameCompressorError variant like InvalidDictionaryId)
when id == 0 and return Ok(previous_dictionary_opt) on success, and update
callers/tests accordingly to handle Result; keep the function name
set_dictionary and its return semantics (previous Option<Dictionary>) wrapped in
Result<Option<Dictionary>, YourErrorType> so API usage mirrors
set_dictionary_from_bytes.

---

Outside diff comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 77-83: The Dictionary struct is initialized with the wrong default
repeat offsets; change the initial offset_hist in the Dictionary::new-path
(where Dictionary is constructed with FSEScratch::new() and
HuffmanScratch::new()) from [2, 4, 8] to the correct RFC8878 defaults [1, 4, 8]
so it matches from_raw_content and DecoderScratch::new(), ensuring any
interim/error-path reads see the correct repeat offsets; update the offset_hist
field in that Dictionary initialization accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 364b693b-b273-4d81-b744-5170a275b26f

📥 Commits

Reviewing files that changed from the base of the PR and between f45a5a8 and 373eef0.

📒 Files selected for processing (4)
  • zstd/Cargo.toml
  • zstd/src/decoding/dictionary.rs
  • zstd/src/decoding/errors.rs
  • zstd/src/encoding/frame_compressor.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/decoding/dictionary.rs (1)

77-83: ⚠️ Potential issue | 🟡 Minor

Inconsistent default offset_hist initialization.

Line 82 initializes offset_hist to [2, 4, 8], but:

  • from_raw_content (line 60) uses [1, 4, 8]
  • CompressState in frame_compressor.rs (lines 108, 127, 163) uses [1, 4, 8]
  • RFC 8878 §3.1.2.5 specifies the default as [1, 4, 8]

This value gets overwritten by lines 136-138 for fully-parsed dictionaries, so it only affects the transient state. However, for consistency and correctness if the overwrite logic ever changes, this should match the RFC default.

Proposed fix
         let mut new_dict = Dictionary {
             id: 0,
             fse: FSEScratch::new(),
             huf: HuffmanScratch::new(),
             dict_content: Vec::new(),
-            offset_hist: [2, 4, 8],
+            offset_hist: [1, 4, 8],
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/decoding/dictionary.rs` around lines 77 - 83, The Dictionary
struct's transient default for offset_hist is inconsistent (currently [2, 4, 8])
and should match RFC 8878 and other code paths; change the default
initialization in Dictionary::new (the Dictionary literal created in the
constructor at the diff) to [1, 4, 8] so it matches from_raw_content and the
CompressState usage; ensure you update the offset_hist field in the Dictionary
instantiation (symbol: Dictionary, field: offset_hist) to [1, 4, 8] to keep
behavior consistent with RFC 8878 §3.1.2.5.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 314-327: The method FrameCompressor::set_dictionary currently
asserts (panics) when dictionary.id == 0; change it to return a Result instead
to match set_dictionary_from_bytes and avoid panics for a public API: replace
the assert_ne! with an early Err return (e.g.,
Err(crate::error::InvalidDictionaryId) or an appropriate crate error type) and
change the signature to return Result<Option<crate::decoding::Dictionary>, _>,
keeping the existing behavior of self.dictionary.replace(dictionary) as
Ok(Some/None) on success; update callers/tests accordingly to handle the Result.

---

Outside diff comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 77-83: The Dictionary struct's transient default for offset_hist
is inconsistent (currently [2, 4, 8]) and should match RFC 8878 and other code
paths; change the default initialization in Dictionary::new (the Dictionary
literal created in the constructor at the diff) to [1, 4, 8] so it matches
from_raw_content and the CompressState usage; ensure you update the offset_hist
field in the Dictionary instantiation (symbol: Dictionary, field: offset_hist)
to [1, 4, 8] to keep behavior consistent with RFC 8878 §3.1.2.5.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7c3e769b-37d3-496f-87aa-3b7424082751

📥 Commits

Reviewing files that changed from the base of the PR and between f45a5a8 and 05a712c.

📒 Files selected for processing (4)
  • zstd/Cargo.toml
  • zstd/src/decoding/dictionary.rs
  • zstd/src/decoding/errors.rs
  • zstd/src/encoding/frame_compressor.rs

- use RFC default repeat offsets [1,4,8] in decode_dict initialization

- document intentional dual offset history priming in compressor state and matcher

- document fail-fast zero-id contract for set_dictionary
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

- assert first set_dictionary_from_bytes insert returns None

- explicitly discard optional previous dictionary in dict_builder roundtrip test

- exclude dict_tests/files/** while keeping dict_tests/dictionary for include_bytes tests
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/decoding/dictionary.rs (1)

120-138: ⚠️ Potential issue | 🟠 Major

Reject zero repeat offsets during dictionary parsing.

The new length guard prevents the slice panic, but decode_dict() still accepts offset1..offset3 == 0 verbatim. Those values are copied straight into live decoder state by zstd/src/decoding/scratch.rs, and zstd/src/decoding/sequence_execution.rs treats them as repeat offsets (scratch[0] - 1 on one branch). A malformed dictionary can therefore survive parsing and only fail much later when it is actually used. Please validate all three parsed repeat offsets here and return a decode error if any are zero.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/decoding/dictionary.rs` around lines 120 - 138, In decode_dict
(dictionary.rs) validate the three parsed repeat offsets (offset1, offset2,
offset3) after converting from raw_tables and before assigning into
new_dict.offset_hist: if any offset == 0 return a DictionaryDecodeError
indicating an invalid/zero repeat offset (e.g.,
Err(DictionaryDecodeError::InvalidRepeatOffset { index: <0|1|2>, got: 0 }) or
the crate's closest error variant) instead of accepting them verbatim; this
prevents zero offsets from being copied into new_dict.offset_hist and later used
by scratch.rs / sequence_execution.rs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 164-170: When a dictionary is present, you must also restore the
encoder's previous entropy tables so the first block can use the "repeat
previous table" path: after priming offsets in the branch that checks
self.dictionary.as_ref(), detect whether dict.huf and dict.fse are populated
and, if so, set self.state.last_huff_table to the dictionary's Huffman tables
and set self.state.fse_tables.ll_previous, self.state.fse_tables.ml_previous,
and self.state.fse_tables.of_previous from dict.fse (or the corresponding fields
on dict) so the encoder's previous-table state matches the parsed dictionary
entropy tables.

---

Outside diff comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 120-138: In decode_dict (dictionary.rs) validate the three parsed
repeat offsets (offset1, offset2, offset3) after converting from raw_tables and
before assigning into new_dict.offset_hist: if any offset == 0 return a
DictionaryDecodeError indicating an invalid/zero repeat offset (e.g.,
Err(DictionaryDecodeError::InvalidRepeatOffset { index: <0|1|2>, got: 0 }) or
the crate's closest error variant) instead of accepting them verbatim; this
prevents zero offsets from being copied into new_dict.offset_hist and later used
by scratch.rs / sequence_execution.rs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f5742492-88f2-444d-9408-59e1818f8aae

📥 Commits

Reviewing files that changed from the base of the PR and between 05a712c and 9093834.

📒 Files selected for processing (3)
  • zstd/Cargo.toml
  • zstd/src/decoding/dictionary.rs
  • zstd/src/encoding/frame_compressor.rs

- restore previous Huffman/FSE encoder tables from parsed dictionaries before first block

- convert decoder-side entropy tables into encoder tables for dictionary priming

- reject zero repeat offsets during dictionary parsing with explicit decode error

- add regression tests for entropy seeding and zero-repeat-offset rejection
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

- precompute decoder->encoder entropy conversions when dictionary is set

- reuse cached tables across compress() calls to avoid per-frame rebuild

- keep explicit fail-fast comment for zero dictionary id API contract

- derive Clone for encoder HuffmanTable to support cache reuse
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

- clarify set_dictionary docs for uncompressed and non-priming matchers
- track retained dictionary budget separately from advertised live window
- shrink matcher capacity as primed dictionary bytes are evicted
- add regression tests for simple and dfast budget retirement
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

- seed huffman table directly via Option::clone_from from cached entropy
- cache FSE previous tables as PreviousFseTable to avoid per-frame reboxing
- remove temporary clone/map allocations in dictionary seeding path
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 196-209: The current seeding creates an extra temporary by calling
cached_entropy.and_then(|cache| cache.ll_previous.clone()) before clone_from;
instead, avoid the intermediate clone by matching borrowed cached_entropy and
passing a reference to the inner Option to clone_from: e.g. if let Some(cache) =
&cached_entropy {
self.state.fse_tables.ll_previous.clone_from(&cache.ll_previous); } else {
self.state.fse_tables.ll_previous.clone_from(&None); } and do the same for
ml_previous and of_previous so clone_from operates on a borrowed Option
reference rather than a freshly cloned value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 206b40f6-cc75-4c1d-8bb3-91e38f82131c

📥 Commits

Reviewing files that changed from the base of the PR and between 5da6696 and 702c64d.

📒 Files selected for processing (1)
  • zstd/src/encoding/frame_compressor.rs

- keep dfast eviction callbacks on logical slice length, not vec capacity
- add regression tests for add_data/trim_to_window eviction length semantics
- remove intermediate Option clones in FSE dictionary seeding path
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

@polaz polaz requested a review from Copilot March 29, 2026 14:56
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Mar 29, 2026

@coderabbitai re-review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

@polaz Sure, I'll re-review the PR now!


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.
✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/decoding/dictionary.rs`:
- Around line 233-245: Update the test decode_dict_rejects_zero_repeat_offsets
to assert the exact error variant instead of just is_err(): call
Dictionary::decode_dict(&raw), unwrap the Err and match it against
DictionaryDecodeError::ZeroRepeatOffsetInDictionary { index: 0 } (or use
assert_eq! on the error) so the test fails for unrelated parse errors; reference
the existing test name, Dictionary::decode_dict, offset_history_start, and the
enum variant DictionaryDecodeError::ZeroRepeatOffsetInDictionary when making the
change.
- Around line 43-61: The from_raw_content constructor currently accepts an empty
dict_content which yields a dictionary with no usable primed entropy; add a
guard at the start of Dictionary::from_raw_content to reject empty content (e.g.
check dict_content.is_empty()) and return an appropriate DictionaryDecodeError
variant (add a new variant like EmptyDictionaryDecode or reuse an existing
suitable error) so callers cannot create empty raw-content dictionaries; update
any error enum and tests accordingly.

In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 624-678: The test
dictionary_compression_roundtrips_with_dict_builder_dictionary currently only
verifies correctness and large-payload compression; add a no-dictionary baseline
and an assertion that the dict-trained compressor actually reduces size on a
small input: create a small representative payload (1–10 KB, e.g. take first N
bytes of payload or build a shorter payload) and compress it twice — once with a
FrameCompressor without calling set_dictionary and once with
set_dictionary(encoder_dict) — then compare the resulting compressed buffer
lengths and assert that the dict-compressed length is strictly smaller than the
no-dict compressed length; use the existing symbols FrameCompressor,
set_dictionary, set_source, set_drain, and compress and keep the
roundtrip/decoder checks unchanged.
- Around line 681-711: The assertions currently run after compressor.compress(),
which only checks the final encoder state; instead verify seeding immediately
after setting the dictionary and source but before any block is encoded: after
calling FrameCompressor::new(...), compressor.set_dictionary_from_bytes(...),
compressor.set_source(...), and compressor.set_drain(...), move the assertions
that check compressor.state.last_huff_table and
compressor.state.fse_tables.{ll_previous,ml_previous,of_previous} to run before
compressor.compress(); this ensures you assert the tables were seeded prior to
the first block being encoded.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c996e1dc-f635-412b-a6cd-d76e2a7f8301

📥 Commits

Reviewing files that changed from the base of the PR and between 29de24a and f11ac36.

📒 Files selected for processing (9)
  • zstd/Cargo.toml
  • zstd/src/decoding/dictionary.rs
  • zstd/src/decoding/errors.rs
  • zstd/src/encoding/frame_compressor.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/fse/fse_decoder.rs
  • zstd/src/huff0/huff0_decoder.rs
  • zstd/src/huff0/huff0_encoder.rs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Mar 29, 2026

@coderabbitai re-review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

@polaz Sure, I'll re-review the PR now!


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.
✅ Actions performed

Full review triggered.

@polaz polaz merged commit 13b0866 into main Mar 29, 2026
14 of 15 checks passed
@polaz polaz deleted the feat/#8-dictionary-compression branch March 29, 2026 16:52
@sw-release-bot sw-release-bot bot mentioned this pull request Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: dictionary compression support

2 participants