v0.9.2 — bloom-filter writer end-to-end (i32/i64/f64/byte_array)
Highlights
Closes the bloom-filter story. v0.9.1 shipped the in-memory SplitBlockBloomFilterBuilder; v0.9.2 wires it through the codec write path so emitted Parquet files carry consultable bloom filters that downstream readers — including the upstream Rust Parquet reader — discover via ColumnMetaData and apply automatically.
Format crate
metadata_writer::encode_column_meta_datanow writesbloom_filter_offset(field 14, i64) andbloom_filter_length(field 15, i32) when set. Previously both panicked.
Codec writer entry points
| Function | Hash input |
|---|---|
write::write_i32_column_dict_with_bloom_to_path |
4-byte LE |
write::write_i64_column_dict_with_bloom_to_path |
8-byte LE |
write::write_f64_column_dict_with_bloom_to_path |
8-byte LE (raw bit pattern, per spec) |
write::write_byte_array_column_dict_with_bloom_to_path |
raw bytes (no length prefix, per spec) |
All take (path, name, values, codec, target_fpp); build an SBBF over the column's distinct values (sized via optimal_num_blocks), emit it inline with the column chunk.
Interop
5 round-trip tests, 4 of them parquet-rs cross-checks. Each writes a column under our writer, opens it with parquet-rs's SerializedFileReader::new_with_options + ReaderProperties::set_read_bloom_filter(true), loads the filter via RowGroupReader::get_column_bloom_filter, and confirms every distinct value reports present via bf.check(&v).
Constraints / deferred follow-ups
- Single-column primitive only. Multi-column / multi-row-group bloom writes are follow-ups.
- Dict-write paths only. Bloom on PLAIN write paths is a separate item.
Crates published
ematix-parquet-format0.9.2ematix-parquet-io0.9.2ematix-parquet-crypto0.9.2ematix-parquet-codec0.9.2ematix-parquet-async0.9.2
🤖 Generated with Claude Code