Skip to content

v0.9.2 — bloom-filter writer end-to-end (i32/i64/f64/byte_array)

Choose a tag to compare

@ryan-evans-git ryan-evans-git released this 17 May 14:03
· 66 commits to main since this release
2792b8f

Highlights

Closes the bloom-filter story. v0.9.1 shipped the in-memory SplitBlockBloomFilterBuilder; v0.9.2 wires it through the codec write path so emitted Parquet files carry consultable bloom filters that downstream readers — including the upstream Rust Parquet reader — discover via ColumnMetaData and apply automatically.

Format crate

  • metadata_writer::encode_column_meta_data now writes bloom_filter_offset (field 14, i64) and bloom_filter_length (field 15, i32) when set. Previously both panicked.

Codec writer entry points

Function Hash input
write::write_i32_column_dict_with_bloom_to_path 4-byte LE
write::write_i64_column_dict_with_bloom_to_path 8-byte LE
write::write_f64_column_dict_with_bloom_to_path 8-byte LE (raw bit pattern, per spec)
write::write_byte_array_column_dict_with_bloom_to_path raw bytes (no length prefix, per spec)

All take (path, name, values, codec, target_fpp); build an SBBF over the column's distinct values (sized via optimal_num_blocks), emit it inline with the column chunk.

Interop

5 round-trip tests, 4 of them parquet-rs cross-checks. Each writes a column under our writer, opens it with parquet-rs's SerializedFileReader::new_with_options + ReaderProperties::set_read_bloom_filter(true), loads the filter via RowGroupReader::get_column_bloom_filter, and confirms every distinct value reports present via bf.check(&v).

Constraints / deferred follow-ups

  • Single-column primitive only. Multi-column / multi-row-group bloom writes are follow-ups.
  • Dict-write paths only. Bloom on PLAIN write paths is a separate item.

Crates published

  • ematix-parquet-format 0.9.2
  • ematix-parquet-io 0.9.2
  • ematix-parquet-crypto 0.9.2
  • ematix-parquet-codec 0.9.2
  • ematix-parquet-async 0.9.2

🤖 Generated with Claude Code