Skip to content

v0.9.1 — u8 dict + BYTE_ARRAY adaptive + bloom builder

Choose a tag to compare

@ryan-evans-git ryan-evans-git released this 17 May 13:06
· 69 commits to main since this release
c6dafca

Highlights

Three additive opportunistic features bundled as a patch on top of v0.9.0. No API breaks, no behaviour changes for existing callers.

u8 dict-indices reader for bw ≤ 8 columns

  • read::DictPreservedColumnU8 { dict_bytes, dict_offsets, indices: Vec<u8> }
  • read::read_column_byte_array_dict_preserved_u8 + ..._u8_into
  • dict::decode_rle_dictionary_indices_u8 + ..._u8_into

Saves 3 bytes/row on dict-encoded columns with ≤ 256 unique values (most TPC-H string columns: l_returnflag, l_linestatus, status enums, etc.). Unlocks Arrow DictionaryArray<UInt8, T> materialisation in ematix-flow with a 4× smaller indices buffer. Errors deterministically on (a) dict > 256 entries, (b) any data page with bw > 8, (c) no dictionary page — caller falls back to the u32 variant.

BYTE_ARRAY adaptive façade (closes the v0.8 gap)

  • read::read_column_byte_array_predicate_adaptive(file, rg, col, predicate, opts, telemetry)
  • read::AdaptiveByteArrayChunkOutput + read::AdaptiveByteArrayOutputKind { Bitmap, Values { bytes, offsets } }

Same dispatch contract as the scalar adaptive entry points (i32/i64/f64) introduced in v0.8.0. Materialized output is Arrow-style (bytes, offsets) so consumers get the same shape as read_column_byte_array_offsets. Predicate is Fn(&[u8]) -> bool evaluated against dict entries (≤ dict.len() calls per chunk).

Split-Block Bloom Filter builder

  • bloom::SplitBlockBloomFilterBuilder with insert_hash / insert_bytes / into_bytes
  • bloom::optimal_num_blocks(n, fpp) -> u32 (rounds to next power of two)

Symmetric to the Π.6c decoder. Round-trips byte-stable through SplitBlockBloomFilter::from_bytes. Full writer-integration (emitting filters into a parquet file's body + setting ColumnMetaData.bloom_filter_offset) is a deferred follow-up that needs format-crate metadata-writer work.

Crates published

  • ematix-parquet-format 0.9.1
  • ematix-parquet-io 0.9.1
  • ematix-parquet-crypto 0.9.1
  • ematix-parquet-codec 0.9.1
  • ematix-parquet-async 0.9.1

🤖 Generated with Claude Code