Release v0.7.0 — dict-preserving BYTE_ARRAY column reader · ryan-evans-git/ematix-parquet

Highlights

New API: read_column_byte_array_dict_preserved (+ _into variant) returning (dict_bytes, dict_offsets, indices) from a single chunk-decode pass — no per-row materialisation.

Enables Arrow consumers to assemble DictionaryArray<UInt32, Utf8|Binary> directly so downstream operators (filter / group-by / join) can stay on dict codes rather than paying gather + hash at every operator boundary.

Behaviour

Errors deterministically when the chunk has no DictionaryPage or when any data page falls back to PLAIN after a dict (writer fell back — chunk can't be expressed as one dict + indices). Callers can react by falling back to read_column_byte_array_offsets.
Validates every index < dict_len once in the cold path so downstream gather can stay branch-free.

Why

Pairs with parquet-rs's ArrowReaderOptions::with_schema(Dictionary(...)) path on consumers like ematix-flow's FastParquet, giving the Emat reader dict-preservation parity. Unblocks Σ.E3b downstream operators (DictGroupCountExec, DictFilterExec) to receive Arrow-surface dict columns from either reader path.

Test coverage

New oracle: dict + Uncompressed roundtrip, dict + Snappy roundtrip, PLAIN-only column → error.
Full cargo test workspace passes; no regressions on existing oracles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.0 — dict-preserving BYTE_ARRAY column reader

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Behaviour

Why

Test coverage

Uh oh!