v0.7.0 — dict-preserving BYTE_ARRAY column reader
Highlights
- New API:
read_column_byte_array_dict_preserved(+_intovariant) returning(dict_bytes, dict_offsets, indices)from a single chunk-decode pass — no per-row materialisation.
Enables Arrow consumers to assemble DictionaryArray<UInt32, Utf8|Binary> directly so downstream operators (filter / group-by / join) can stay on dict codes rather than paying gather + hash at every operator boundary.
Behaviour
- Errors deterministically when the chunk has no
DictionaryPageor when any data page falls back toPLAINafter a dict (writer fell back — chunk can't be expressed as one dict + indices). Callers can react by falling back toread_column_byte_array_offsets. - Validates every index
< dict_lenonce in the cold path so downstream gather can stay branch-free.
Why
Pairs with parquet-rs's ArrowReaderOptions::with_schema(Dictionary(...)) path on consumers like ematix-flow's FastParquet, giving the Emat reader dict-preservation parity. Unblocks Σ.E3b downstream operators (DictGroupCountExec, DictFilterExec) to receive Arrow-surface dict columns from either reader path.
Test coverage
- New oracle: dict + Uncompressed roundtrip, dict + Snappy roundtrip, PLAIN-only column → error.
- Full
cargo testworkspace passes; no regressions on existing oracles.