Skip to content

feat: port zonemap package and wire into Parquet pipeline#6295

Open
g-talbot wants to merge 11 commits intogtt/row-keys-extractionfrom
gtt/zonemap-regexes
Open

feat: port zonemap package and wire into Parquet pipeline#6295
g-talbot wants to merge 11 commits intogtt/row-keys-extractionfrom
gtt/zonemap-regexes

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

  • Port Go logs-event-store/zonemap package to Rust: DFA-based prefix-preserving superset regex builder with weighted pruning, MinMax FNV-1a hash builder
  • Wire zonemap extraction into prepare_write() pipeline: for each string-valued sort schema column, generate a compact regex from distinct values
  • Store regexes in Parquet KV metadata (qh.zonemap_regexes as JSON) and propagate to MetricsSplitMetadata.zonemap_regexes via split writer
  • 34 tests: algorithm (automaton regex, pruning, escaping, character classes, disjunctive clauses, long strings, post-prune), regex builder (basic, pruning, progressive), MinMax (string/int hashing), Arrow extraction (nulls, missing columns, special chars), and full pipeline integration (KV metadata roundtrip, split writer, write path consistency)

Architecture

extract_zonemap_regexes(sort_fields, sorted_batch, opts)
    ├─ parse sort schema → identify string columns
    ├─ for each string column:
    │   ├─ PrefixPreservingRegexBuilder::reset()
    │   ├─ register all distinct non-null values
    │   └─ build() → prune DFA → generate ^regex$
    └─ return HashMap<column_name, regex_string>

Pipeline: prepare_write()
    ├─ append_sorted_series_column()
    ├─ sort_batch() → reorder_columns()
    ├─ extract_row_keys()
    ├─ extract_zonemap_regexes()     ← NEW
    ├─ build KV metadata (qh.zonemap_regexes)
    └─ return PreparedWrite { sorted_batch, props, row_keys_proto, zonemap_regexes }

Test plan

  • All 29 algorithm tests pass (automaton, regex builder, minmax, extraction)
  • All 5 pipeline integration tests pass (KV metadata, write_to_bytes, split_writer, JSON roundtrip, consistency)
  • All 244 crate tests pass (0 regressions)
  • cargo clippy --all-features --tests — 0 warnings
  • cargo +nightly fmt --check — clean
  • cargo doc --no-deps — clean
  • cargo machete — no unused deps
  • License headers present on all new files

🤖 Generated with Claude Code

g-talbot and others added 11 commits April 13, 2026 07:24
Port the Go zonemap package (DFA-based prefix-preserving superset regex
builder) to Rust and integrate it into the Parquet write pipeline. For
each string-valued sort schema column, a compact regex is generated that
accepts all column values (and possibly more). These regexes enable
query-time split pruning: if a predicate cannot match any string
accepted by the regex, the split can be skipped entirely.

Zonemap module (quickwit-parquet-engine/src/zonemap/):
- automaton.rs: DFA with weighted pruning and deterministic regex
  generation, including character class collapsing and suffix factoring
- regex_builder.rs: PrefixPreservingRegexBuilder with progressive
  pruning during registration and final prune at build time
- minmax.rs: FNV-1a hash-based MinMax builder for future range pruning
- mod.rs: extract_zonemap_regexes() public API for Arrow RecordBatches

Pipeline integration:
- writer.rs: extract zonemap regexes in prepare_write() after row_keys,
  store as JSON in qh.zonemap_regexes Parquet KV metadata
- split_writer.rs: capture zonemap_regexes in MetricsSplitMetadata
- WriteMetadata type alias carries both row_keys and zonemap through
  write_to_bytes() and write_to_file_with_metadata()

34 tests covering automaton (regex, pruning, escaping, character classes,
disjunctive clauses, long strings, post-prune behavior), regex builder
(basic, pruning, progressive), MinMax (string/int hashing, reset, empty),
Arrow extraction (basic, nulls, missing columns, special chars, disabled),
and full pipeline integration (KV metadata, split writer, JSON roundtrip,
write path consistency).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pub(crate) re-export was unused in non-test builds, causing
-D unused-imports to fail in CI. The constant is only needed by
zonemap pipeline_tests, so gate it behind #[cfg(test)].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reorder pub use re-exports and use crate:: imports to satisfy
cargo +nightly fmt with group_imports = StdExternalCrate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
serde_json::to_string on HashMap<String, String> cannot fail — silently
swallowing the error with let Ok() would hide a real bug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port remaining Go builder_test.go cases:
- TestBuildFragmentZoneMap: exact regex string verification
- TestNilSortSchema: empty sort fields → no regexes
- TestNonMutatedResult: builder reuse independence
- TestInvalidUTF8: Unicode BMP handling (Rust variant)
- TestZoneMapForIntColumns: int columns produce no regex
- TestResetWithLsmComparisonCutoff: LSM cutoff truncation

Also fixes extract_zonemap_regexes to respect lsm_comparison_cutoff
from the sort schema — only columns before the cutoff get zonemaps,
matching Go FragmentZoneMapBuilder.Reset() behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Include the complete benchmark_data/integrations file from the Go
zonemap package and use it via include_str!() in the benchmark test.
Verifies that 584 real integration names with max 64 transitions
produces a valid pruned regex.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add exact regex string verification for long service name (matches Go
TestBuildFragmentZoneMap: "^a_very_very_very_very_long_long_.+$") and
metric_name single-value case ("^cpu\.usage$"). Also verifies service
and env dict columns produce identical regexes for identical values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark completed items in GAP-002 (sort schema parser, configurable
directions, timeseries_id, schema-driven sort, metadata storage) and
GAP-004 (MetricsSplitMetadata fields, RowKeys, zonemap regexes,
sorting_columns, KV metadata). Update ADR-002 Implementation Status
to reflect the full PR stack (#6287-#6295).

Remaining open items: per-index metastore storage (Phase 32), null
ordering fix, Parquet column/offset index enabling, PostgreSQL
migration for row_keys + zonemap columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change nulls_first from true to false in sort_batch(), sorting_columns()
metadata, and the SS-1 verification sort. Nulls now sort after all
non-null values for both ascending and descending columns.

This simplifies compaction: when a sort column is absent from a split,
all rows are treated as null. With nulls-last, these rows cluster at the
end and don't interfere with key-range comparisons between splits.

Adds test_nulls_sort_last_ascending_and_descending which writes through
the full pipeline with ascending and descending service columns,
verifying null rows appear last in both cases.

Updates ADR-002 and GAP-002 to mark the null ordering item resolved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes clippy::field_reassign_with_default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@alanfgates alanfgates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go over automaton.rs in full detail as it's large. Let me know if you think I should before approving the PR.


while let Some(entry) = heap.pop() {
// SAFETY: pointer is valid and uniquely accessed (popped from heap).
let state = unsafe { &mut *entry.state };
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from Claude:

The pruning algorithm uses raw *mut State pointers in a BinaryHeap to enable mutable access to tree nodes during traversal. The safety comments state:

  1. All pointers come from our owned tree
  2. We only mutate transitions on states we pop (removing from heap)
  3. The tree structure is never modified during iteration

These invariants appear to hold based on code inspection — a state is only mutated after being popped from the heap, and the tree ownership is entirely within Automaton. However, this is inherently fragile:

  • If future modifications push child states back into the heap after mutating a parent, the invariants could silently break.
  • There are no runtime checks or debug assertions validating the "only mutate after pop" invariant.

Recommendation: Consider adding a debug_assert! in the loop body that validates the state hasn't been pruned yet when popped. Alternatively, consider an arena-based approach using indices instead of raw pointers, which would eliminate the unsafe block entirely.

Adding the debug_assert! seems like a reasonable precaution.

}
}
}
DataType::Utf8 => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why the previous functions match on Dictionary, Utf8, and LargeUtf8, but this one only Dictionary and Utf8 (flagged by Claude review as well).

panic!(
"bug: max_num_transitions({}) > max_depth({})",
max_num_transitions, self.max_depth
);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude:

A debug_assert! would be more appropriate for an invariant check that should never be triggered by external input. A panic in production for what's described as a "bug" is harsh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants