Skip to content

v0.7.0 — dict-preserving BYTE_ARRAY column reader

Choose a tag to compare

@ryan-evans-git ryan-evans-git released this 16 May 16:57
· 83 commits to main since this release
8801a7d

Highlights

  • New API: read_column_byte_array_dict_preserved (+ _into variant) returning (dict_bytes, dict_offsets, indices) from a single chunk-decode pass — no per-row materialisation.

Enables Arrow consumers to assemble DictionaryArray<UInt32, Utf8|Binary> directly so downstream operators (filter / group-by / join) can stay on dict codes rather than paying gather + hash at every operator boundary.

Behaviour

  • Errors deterministically when the chunk has no DictionaryPage or when any data page falls back to PLAIN after a dict (writer fell back — chunk can't be expressed as one dict + indices). Callers can react by falling back to read_column_byte_array_offsets.
  • Validates every index < dict_len once in the cold path so downstream gather can stay branch-free.

Why

Pairs with parquet-rs's ArrowReaderOptions::with_schema(Dictionary(...)) path on consumers like ematix-flow's FastParquet, giving the Emat reader dict-preservation parity. Unblocks Σ.E3b downstream operators (DictGroupCountExec, DictFilterExec) to receive Arrow-surface dict columns from either reader path.

Test coverage

  • New oracle: dict + Uncompressed roundtrip, dict + Snappy roundtrip, PLAIN-only column → error.
  • Full cargo test workspace passes; no regressions on existing oracles.