Skip to content

feat: extract and populate RowKeys from sorted batches#6292

Merged
g-talbot merged 7 commits intomainfrom
gtt/row-keys-extraction
Apr 21, 2026
Merged

feat: extract and populate RowKeys from sorted batches#6292
g-talbot merged 7 commits intomainfrom
gtt/row-keys-extraction

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

  • New row_keys module: extract_row_keys() reads sort schema column values at the first and last rows of a sorted batch, building a sortschema::RowKeys proto with min_row_values, max_row_values, and all_inclusive_max_row_values
  • prepare_write() now computes RowKeys after sorting and injects qh.row_keys (base64) + qh.row_keys_json into Parquet KV metadata
  • write_to_bytes / write_to_file_with_metadata return the RowKeys proto bytes alongside the write result
  • split_writer captures the returned bytes and stores them on MetricsSplitMetadata.row_keys_proto
  • Column type mapping: Dictionary(Int32,Utf8) / Utf8TypeString, Int64TypeInt, UInt64TypeInt (cast), Float64TypeFloat; null values → ColumnValue { value: None }

Stacked on #6290 (sorted_series column).

Test plan

  • 11 row_keys tests: empty batch, single row min==max, multi-row first/last, null encoding, missing columns, column count, proto round-trip (with and without nulls), Parquet KV metadata round-trip, split_writer integration
  • All 209 quickwit-parquet-engine tests pass
  • Clippy clean, formatted, license headers pass

🤖 Generated with Claude Code

@g-talbot g-talbot requested a review from mattmkim April 10, 2026 20:59
@mattmkim
Copy link
Copy Markdown
Contributor

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b444d2da2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-parquet-engine/src/row_keys/mod.rs Outdated
Comment thread quickwit/quickwit-parquet-engine/src/storage/writer.rs
@mattmkim mattmkim force-pushed the gtt/sorted-series-key branch from 652d128 to 069a5a1 Compare April 15, 2026 22:30
@g-talbot g-talbot force-pushed the gtt/sorted-series-key branch from 069a5a1 to 7444e26 Compare April 21, 2026 13:14
@g-talbot g-talbot force-pushed the gtt/row-keys-extraction branch 10 times, most recently from 7c7beb1 to edb1aec Compare April 21, 2026 19:09
Base automatically changed from gtt/sorted-series-key to main April 21, 2026 19:39
g-talbot and others added 7 commits April 21, 2026 16:43
Compute sort-key boundaries (first/last row values) during Parquet
write and store them as RowKeys proto in both the Parquet KV metadata
and MetricsSplitMetadata. The compactor uses these ranges to determine
key-range overlap between splits and to select merge boundaries.

- New row_keys module: extract_row_keys() reads sort schema column
  values at rows 0 and N-1, mapping Arrow types to ColumnValue proto
- prepare_write() now computes row_keys after sorting and injects
  qh.row_keys (base64) + qh.row_keys_json into Parquet KV metadata
- write_to_bytes/write_to_file_with_metadata return row_keys proto
  bytes alongside the write result
- split_writer captures and stores row_keys_proto on metadata

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
End-to-end test with 8 rows across 4 series (including null tags,
mixed metrics, wide timestamp range). Writes through the full
split_writer pipeline and verifies the RowKeys proto is byte-identical
in all three places:

1. MetricsSplitMetadata.row_keys_proto (feeds Postgres)
2. Parquet file KV metadata (qh.row_keys, base64-encoded)
3. InsertableMetricsSplit.row_keys (Postgres column)

Also checks: min/max column values match expected sort order,
all_inclusive_max equals max, JSON metadata is parseable, and
the split is not marked expired.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…data API

- Keep DDSketch sketch validation in prepare_write
- Fix ParquetSplitWriter::new to pass ParquetSplitKind
- Remove Postgres path test (InsertableParquetSplit moved to quickwit-metastore)
- Update MetricsSplitMetadata references to ParquetSplitMetadata

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…row_keys tests

These constructors now return Result after the sorted-series-key merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…raction

The sort schema parser accepts "timestamp" as a legacy alias, but the
Arrow batch only contains "timestamp_secs". Without normalization,
index_of fails and the code encodes None for the timestamp boundary,
losing time range information from RowKeys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
build_compaction_key_value_metadata already emits qh.row_keys when
split_metadata.row_keys_proto is present. prepare_write then appended
another entry from freshly computed values, producing duplicates.

Now strips any pre-existing qh.row_keys and qh.row_keys_json before
pushing the authoritative values from the sorted batch. Also adds a
test assertion verifying exactly one qh.row_keys entry exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/row-keys-extraction branch from df99c89 to c6d1681 Compare April 21, 2026 20:44
Comment thread quickwit/quickwit-parquet-engine/src/row_keys/mod.rs
Comment thread quickwit/quickwit-parquet-engine/src/row_keys/mod.rs
@g-talbot g-talbot merged commit a6b5b73 into main Apr 21, 2026
8 checks passed
@g-talbot g-talbot deleted the gtt/row-keys-extraction branch April 21, 2026 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants