feat: enable page-level Parquet stats + add rg_partition_prefix_len marker#6377
Open
feat: enable page-level Parquet stats + add rg_partition_prefix_len marker#6377
Conversation
…arker Foundation for the streaming column-major merge engine workstream. Switches the writer's default from EnabledStatistics::Chunk to EnabledStatistics::Page so every newly-written file carries a Column Index and Offset Index in its footer. Without this, single-RG files produced by future PRs would have one min/max per file — useless for selective queries. The default is exposed as a knob (`ParquetWriterConfig::with_page_statistics`) so callers can opt out when the footer overhead isn't worth it. Adds a numeric marker `qh.rg_partition_prefix_len` in the file's KV metadata and a matching `rg_partition_prefix_len: u32` field on `ParquetSplitMetadata`. The marker records how many leading sort schema columns RG boundaries align with: 0 = no claim (legacy default), N = aligned with the first N sort columns. Single-RG files vacuously satisfy any prefix; future writers will set N = sort_schema.len(). Compaction scope now includes `rg_partition_prefix_len`. Splits with different prefix values land in different buckets; the merge engine validates input files agree on prefix and rejects mismatches at both the metastore-struct layer and the on-disk KV layer. Until the streaming engine lands, the merge writer demotes the output's prefix to 0 because it cannot enforce alignment. New developer tooling: - `quickwit_parquet_engine::storage::inspect_parquet_page_stats` library function returning a structured per-RG / per-column / per-page report, plus `verify_partition_prefix` for the strong-form alignment check. - `inspect_parquet` binary in the parquet-engine crate with `--json`, `--all-pages`, and `--verify-prefix` flags. Footer-size delta on a representative shape (100K rows × 6 cols): +19.5% (672 KB → 804 KB). The page index scales with column count, not data volume, so production-sized 50 MB splits show < 0.3% overhead. Test count: 367 → 382 (15 new). Clippy/doc/license/log/machete clean.
Avoids a compaction-bucket leak that would otherwise appear once PR-3 ships single-RG ingest before PR-6 ships the streaming column-major merge engine. Previously, every merge unconditionally set the output's `rg_partition_prefix_len` to 0, even when the writer happened to produce a single-RG output that vacuously satisfies any alignment claim. With single-RG ingest active and merge demoting on every operation, post-PR-3 ingest splits would leak out of the `prefix = sort_len` bucket on their first merge and never rejoin it — newer ingests would not merge with merge outputs. New rule: predict the output's row group count via `num_rows.div_ceil(row_group_size)`. If ≤ 1 RG, propagate the inputs' prefix; otherwise demote to 0. Both the metastore split metadata (`merge_parquet_split_metadata`) and the file's KV metadata (`build_merge_kv_metadata`) follow the same rule, so they always agree about what's on disk. A `debug_assert!` checks that the prediction matches the actual row group count returned by `ArrowWriter::close()` — catches a future config change that adds a byte-based RG threshold and silently invalidates the KV claim. `MergeOutputFile` gains a `num_row_groups: usize` field so the metastore-side rule can be applied without re-parsing the file. Test changes: - Rename `test_output_prefix_len_demoted_to_zero` to `test_output_prefix_len_demoted_when_multi_rg`; pin the demotion to the `num_row_groups > 1` case. - New `test_output_prefix_len_preserved_when_single_rg` asserting the propagation case. - New `test_merge_demotes_prefix_when_output_is_multi_rg` exercising the real writer with `row_group_size = 2` and verifying the file's KV records 0 via the inspector. - Extend `test_merge_accepts_matching_rg_partition_prefix_len` to inspector-verify the single-RG output's KV preserves the prefix. Test count: 382 → 384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Gate-A verification before PR-3 (single-RG ingest cutover): proves that page-level statistics written by PR-1 are actually consumed by the production query path for pruning, not just embedded inertly in the footer. Findings: - The metrics read path at `MetricsParquetTableProvider::scan` already calls `ParquetSource::with_enable_page_index(true)`, so DataFusion loads the column index + offset index when reading. No new wiring needed on the reader side. - DataFusion's `PruningMetrics` (`page_index_rows_pruned`) counter on `DataSourceExec` is the testable signal — pruned > 0 means pages were eliminated using their min/max from the column index. The new integration test (`quickwit-datafusion/tests/metrics.rs::test_page_index_pruning_via_query`) builds a single split with two metric_names interleaved, forces the metric_name column into ~16 pages within one row group, runs `WHERE metric_name = 'cpu.usage'`, walks the executed plan, and asserts `page_index_rows_pruned >= 4096` (the rows from the *other* metric) plus correctness of the returned rows. Plumbing change: `ParquetWriterConfig::with_data_page_row_count_limit` exposes Parquet's per-page row count rollover threshold. The size-based `data_page_size` knob alone can't force multi-page output when dictionary-encoded columns RLE-compress to a handful of bytes regardless of row count. Default 0 = unbounded; production behavior unchanged. Tests: 14/14 metrics integration tests pass (was 13). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR-1 of a 7-PR stack producing a memory-bounded, streaming, column-major Parquet pipeline (the reformulated Phase 4a). This PR lays the on-disk format foundation and verifies it works end-to-end through the production query path. Subsequent PRs add the streaming reader, writer, merge engine, and pipeline wiring.
Three changes:
Page-level statistics by default. Writer flips from
EnabledStatistics::ChunktoEnabledStatistics::Page, so every new file carries Parquet Column Index + Offset Index in the footer. The level is now a config knob —ParquetWriterConfig::with_page_statistics(false)falls back to chunk-level for callers that want a smaller footer. A second knob (with_data_page_row_count_limit) exposes Parquet's per-page row count rollover threshold, useful for tests and for future writers that want to control page granularity directly.qh.rg_partition_prefix_lenmarker. A numeric KV in the Parquet footer (and a matchingu32field onParquetSplitMetadata) recording how many leading sort schema columns row group boundaries align with.0(or absent) = no claim, the legacy default.N= aligned with the firstNsort columns. The marker is read by the merge engine and added to the compaction scope so files with different prefix values stay in separate buckets. The merge engine preserves the inputs' prefix on a single-RG output (vacuously aligned) and demotes to 0 only when the output is genuinely multi-RG, so post-PR-3 ingest splits stay in their bucket through small merges.Developer tooling.
inspect_parquet_page_stats(...)library +inspect_parquetCLI binary inquickwit-parquet-engine. Reads the footer (including the page indexes), pretty-prints a per-RG / per-column / per-page report, and supports--json,--all-pages, and--verify-prefix(which enforces the strong form of the prefix claim: every column in the prefix must be constant within each RG).End-to-end verification (Gate-A)
A new integration test (
test_page_index_pruning_via_queryinquickwit-datafusion/tests/metrics.rs) confirms page-level pruning fires through the production query path:WHERE metric_name = 'cpu.usage'via the regular SQL session builder (no shortcuts; same path the REST API uses).ExecutionPlanfor thePruningMetrics::page_index_rows_prunedcounter exposed by DataFusion'sParquetSource.This proves that:
EnabledStatistics::Pagewriter config makes the column index + offset index land in the footer (already covered byinspect.rsunit tests).MetricsParquetTableProvider::scan) loads the page index —ParquetSource::with_enable_page_index(true)was already wired at line 207, no change needed.This is the gate that had to pass before PR-3 cuts ingest over to single-RG: without page-level pruning, single-RG would collapse query pruning to one min/max per file. The gate is now passed.
Why now
Single-row-group files — coming in PR-3 — have only one RG worth of chunk-level statistics, which collapses query pruning to one min/max per file unless page-level data is in the footer. Page indexes have to land in the footer before the writer cuts over to single-RG, which is what this PR sets up.
The marker is the contract between the streaming reader and the merge engine: only files that claim the same prefix length can be merged together. Defining and threading it through compaction scope + validation now means later PRs in the stack just turn on writes — no retroactive plumbing.
Footer overhead
Measured on production-sized files (zstd-3, default writer config) by writing the same data twice — once with
EnabledStatistics::Chunk, once withPage— and comparing total file size:The page index scales with column count × page count, not row count, so on production-sized splits it's a rounding error. Pinned by an integration test (
test_footer_size_delta_for_page_level_stats) that fails if the delta exceeds 30% on a small synthetic file — a generous bound that catches regressions without flapping on the absolute size.Test plan
cargo nextest run -p quickwit-parquet-engine --all-features— 384 / 384 (17 new for this PR)cargo nextest run -p quickwit-datafusion --test metrics— 14 / 14 (1 new: end-to-end page-index pruning)cargo clippy -p quickwit-parquet-engine --tests— cleancargo clippy -p quickwit-datafusion --tests— cleancargo doc -p quickwit-parquet-engine --no-deps— cleancargo machete— cleanbash quickwit/scripts/check_license_headers.sh— cleanbash quickwit/scripts/check_log_format.sh— cleancargo run -p quickwit-parquet-engine --bin inspect_parquet -- <file>.parquetproduces a useful human report;--jsonround-trips through serde;--verify-prefixerrors with a per-RG diagnostic when the prefix claim is violatedcargo check --workspace --all-features— no breakage from the newParquetSplitMetadatafield (default0, additive JSON Serde) or the newParquetWriterConfigfield (default0= unbounded, behavior unchanged)What's not in this PR (deferred to later in the stack)
rg_partition_prefix_len > 0from any production code path. PR-3 (single-RG ingest) is the first writer that will set it; PR-6 (streaming column-major merge engine) is where multi-RG-by-metric_name output lands.🤖 Generated with Claude Code