Skip to content

feat: Phase 3c — ParquetMergeExecutor + full downloader#6358

Open
g-talbot wants to merge 1 commit intogtt/parquet-merge-pipeline-3bfrom
gtt/parquet-merge-pipeline-3c
Open

feat: Phase 3c — ParquetMergeExecutor + full downloader#6358
g-talbot wants to merge 1 commit intogtt/parquet-merge-pipeline-3bfrom
gtt/parquet-merge-pipeline-3c

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

Stacked on #6357 (Phase 3b).

  • ParquetMergeSplitDownloader: replaces the stub with real download logic — downloads each input split's Parquet file from object storage to a local temp directory via storage.copy_to_file(), forwards ParquetMergeScratch to the executor.
  • ParquetMergeExecutor: runs merge_sorted_parquet_files via run_cpu_intensive, builds output ParquetSplitMetadata via merge_parquet_split_metadata, renames output files to match generated split IDs, sends ParquetSplitBatch with replaced_split_ids to the uploader.
  • checkpoint_delta_opt: ParquetSplitBatch.checkpoint_delta changed to Option<IndexCheckpointDelta> to support merge operations (no checkpoint delta for data reorganization). Ingest path passes Some(delta), merge path passes None.

Test plan

  • 4 existing uploader tests pass (checkpoint_delta_opt change)
  • 4 existing packager tests pass
  • Compiles with and without metrics feature
  • cargo clippy clean

🤖 Generated with Claude Code

@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 3b171a0 to 84c6dd3 Compare April 29, 2026 15:30
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from ceba410 to 5937440 Compare April 29, 2026 15:31
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 84c6dd3 to a23011c Compare April 29, 2026 18:10
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 5937440 to 0b1c9cc Compare April 29, 2026 18:10
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from a23011c to 86fd55a Compare April 29, 2026 18:16
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch 2 times, most recently from 66b97e0 to 0f051bc Compare April 29, 2026 18:24
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 86fd55a to ef391c9 Compare April 29, 2026 18:39
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 0f051bc to 16b46d7 Compare April 29, 2026 18:40
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from ef391c9 to 90c5589 Compare April 29, 2026 18:51
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 16b46d7 to 8e19b6b Compare April 29, 2026 18:51
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 90c5589 to 49c6c19 Compare April 29, 2026 19:04
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from 8e19b6b to f32bd64 Compare April 29, 2026 19:05
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 49c6c19 to 93a0a20 Compare April 29, 2026 20:54
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from f32bd64 to de17c0e Compare April 29, 2026 20:54
…ase 3c)

Phase 3 pipeline integration, third PR:

- ParquetMergeSplitDownloader: downloads each input split's Parquet file
  from object storage to a local temp directory, forwards ParquetMergeScratch
  to the executor. Replaces the stub from PR 3b.

- ParquetMergeExecutor: runs merge_sorted_parquet_files via run_cpu_intensive,
  builds output ParquetSplitMetadata via merge_parquet_split_metadata, renames
  output files to match generated split IDs, sends ParquetSplitBatch with
  replaced_split_ids to the uploader.

- ParquetSplitBatch.checkpoint_delta -> checkpoint_delta_opt: now Option to
  support merge operations (no checkpoint delta for data reorganization).
  Ingest path passes Some(delta), merge path passes None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3b branch from 93a0a20 to 3a91a31 Compare April 30, 2026 02:30
@g-talbot g-talbot force-pushed the gtt/parquet-merge-pipeline-3c branch from de17c0e to 1f6512e Compare April 30, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant