Skip to content

Conversation

@burmecia
Copy link
Member

What kind of change does this PR introduce?

This PR is to optimise large dataset insertion for Iceberg wrapper.

What is the current behavior?

When insert data into an Iceberg foreign table, the whole data set is buffered in local memory first for partitioning and sorting before sending to remote Iceberg. This could be a problem if the insert dataset is larger than local memory.

What is the new behavior?

Although the local partitioning and sorting cannot be avoid if they're defined on the Iceberg table, optimisation can still be done in a special case, that is, when the Iceberg table has no partition and sort spec defined. In this case, we can split the insert dataset into batches and directly insert them to remote Iceberg sequentially without buffering the whole dataset, thus can avoid local memory exhaustion.

The number of rows in a batch can be defined by foreign server option batch_size, default is 4096. Each batch will be saved to an individual parquet file, so it is recommended to use larger batch size when inserting large dataset.

Additional context

N/A

Copilot AI review requested due to automatic review settings November 20, 2025 07:23
@burmecia burmecia changed the title perf(iceberg): optimise direct insert for large dataset perf(iceberg): [WRA-11] optimise direct insert for large dataset Nov 20, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes memory usage for large dataset insertions into Iceberg foreign tables by introducing a direct insert path for tables without partitioning or sorting specifications. Instead of buffering the entire dataset in memory, data is now written in configurable batches when these conditions are met.

Key Changes:

  • Added batched insertion logic that writes data in chunks when tables have no partition or sort specs defined
  • Changed batch_size field from Option<usize> to usize with a default value of 4096
  • Refactored insert logic by extracting write operations into a reusable write_rows_to_iceberg() method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@burmecia burmecia merged commit 0b5b9a9 into main Nov 20, 2025
6 of 7 checks passed
@burmecia burmecia deleted the bo/perf/iceberg-direct-insert branch November 20, 2025 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants