Skip to content

feat: enforce physical column ordering in Parquet files#6287

Open
g-talbot wants to merge 1 commit intogtt/docs-claude-mdfrom
gtt/parquet-column-ordering-v2
Open

feat: enforce physical column ordering in Parquet files#6287
g-talbot wants to merge 1 commit intogtt/docs-claude-mdfrom
gtt/parquet-column-ordering-v2

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

  • Sort schema columns are placed first (in their configured sort order), followed by remaining data columns alphabetically
  • This layout enables a two-GET streaming merge during compaction: the first GET reads the footer, the second streams from the start of the row group — sort columns arrive first, allowing the compactor to compute the global merge order before data columns arrive
  • Clean cherry-pick of the column-ordering work from feat: enforce physical column ordering in Parquet files for two-GET streaming merge #6281 (which was accidentally merged into the docs branch instead of main)

Test plan

  • reorder_columns unit test verifies sort columns first, then alphabetical
  • Round-trip test verifies column order preserved through Parquet write/read
  • All 154 quickwit-parquet-engine tests pass
  • Clippy clean

🤖 Generated with Claude Code

@g-talbot g-talbot requested a review from mattmkim April 10, 2026 12:30
@g-talbot g-talbot changed the base branch from main to gtt/docs-claude-md April 10, 2026 14:15
@g-talbot g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from 9e5c6ef to cc4492e Compare April 10, 2026 14:18
@g-talbot g-talbot force-pushed the gtt/docs-claude-md branch from cc4492e to 4006b20 Compare April 10, 2026 14:18
…treaming merge (#6281)

* feat: enforce physical column ordering in Parquet files

Sort schema columns are written first (in their configured sort order),
followed by all remaining data columns in alphabetical order. This
physical layout enables a two-GET streaming merge during compaction:
the footer GET provides the schema and offsets, then a single streaming
GET from the start of the row group delivers sort columns first —
allowing the compactor to compute the global merge order before data
columns arrive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: verify input column order is actually scrambled

The sanity check only asserted presence, not ordering. Now it
verifies that host appears before service in the input (scrambled)
which is the opposite of the sort-schema order (service before host).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: rustfmt test code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: collapse nested if to satisfy clippy::collapsible_if

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from cc4492e to 946c229 Compare April 10, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants