Skip to content

feat: dictionary compression support #8

@polaz

Description

@polaz

Summary

Dictionary decompression works, dictionary building works (dict_builder feature), but dictionary compression is unimplemented. This is critical for CoordiNode's per-label trained dictionaries in LSM-tree where small values benefit enormously from shared dictionaries.

Current state

  • frame_compressor.rs:148 — dictionary ID field set to None, no dict integration
  • encoding/blocks/compressed.rs:27 — offset history hardcoded to [1, 4, 8], not loaded from dictionary
  • encoding/blocks/compressed.rs:54 — FSE table reuse not implemented

C reference implementation

Dictionary compression flow (zstd_compress.c)

  1. Load dictionary — parse magic, extract Huffman table, FSE tables, offset history, raw content
  2. Initialize matcher — prefill hash/chain tables with dictionary content positions
  3. Set initial state — offset history from dict (rep[0..3]), entropy tables from dict
  4. Frame header — write dictionary ID field
  5. First block — can reference dictionary content via offsets

Key functions

  • ZSTD_compress_insertDictionary() — main entry point
  • ZSTD_loadCEntropy() — parse entropy tables from dict header
  • ZSTD_loadDictionaryContent() — fill hash/chain tables with dict positions
  • ZSTD_CCtx_refCDict() — reference pre-built dictionary

What needs to be implemented

  1. Dictionary loading in encoder — parse dict format, extract tables + content
  2. Matcher prefill — insert dictionary content positions into hash tables
  3. Initial offset history — load [rep0, rep1, rep2] from dictionary instead of [1, 4, 8]
  4. Initial entropy tables — use Huffman/FSE tables from dictionary for first block
  5. Frame header dict ID — write dictionary ID when dict is used
  6. FrameCompressor API — method to attach dictionary before compression

Acceptance criteria

  • FrameCompressor accepts dictionary via new API method
  • Compressed output includes dictionary ID in frame header
  • C zstd can decompress dict-compressed output
  • structured-zstd can decompress own dict-compressed output
  • Compression ratio on small values (1-10KB) significantly improves with trained dict
  • Roundtrip test with dict_builder-generated dictionaries

Time estimate

3d

Blocked by

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highHigh priority — core functionalityenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions