perf(iceberg): [WRA-11] optimise direct insert for large dataset #526

burmecia · 2025-11-20T07:23:02Z

What kind of change does this PR introduce?

This PR is to optimise large dataset insertion for Iceberg wrapper.

What is the current behavior?

When insert data into an Iceberg foreign table, the whole data set is buffered in local memory first for partitioning and sorting before sending to remote Iceberg. This could be a problem if the insert dataset is larger than local memory.

What is the new behavior?

Although the local partitioning and sorting cannot be avoid if they're defined on the Iceberg table, optimisation can still be done in a special case, that is, when the Iceberg table has no partition and sort spec defined. In this case, we can split the insert dataset into batches and directly insert them to remote Iceberg sequentially without buffering the whole dataset, thus can avoid local memory exhaustion.

The number of rows in a batch can be defined by foreign server option batch_size, default is 4096. Each batch will be saved to an individual parquet file, so it is recommended to use larger batch size when inserting large dataset.

Additional context

N/A

Copilot

Pull Request Overview

This PR optimizes memory usage for large dataset insertions into Iceberg foreign tables by introducing a direct insert path for tables without partitioning or sorting specifications. Instead of buffering the entire dataset in memory, data is now written in configurable batches when these conditions are met.

Key Changes:

Added batched insertion logic that writes data in chunks when tables have no partition or sort specs defined
Changed batch_size field from Option<usize> to usize with a default value of 4096
Refactored insert logic by extracting write operations into a reusable write_rows_to_iceberg() method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wrappers/src/fdw/iceberg_fdw/iceberg_fdw.rs

…ffer

perf(iceberg): optimise direct insert for large dataset

0f14b18

Copilot AI review requested due to automatic review settings November 20, 2025 07:23

burmecia changed the title ~~perf(iceberg): optimise direct insert for large dataset~~ perf(iceberg): [WRA-11] optimise direct insert for large dataset Nov 20, 2025

burmecia added the iceberg label Nov 20, 2025

Copilot started reviewing on behalf of burmecia November 20, 2025 07:23 View session

Copilot finished reviewing on behalf of burmecia November 20, 2025 07:28

Copilot AI reviewed Nov 20, 2025

View reviewed changes

wrappers/src/fdw/iceberg_fdw/iceberg_fdw.rs Show resolved Hide resolved

wrappers/src/fdw/iceberg_fdw/iceberg_fdw.rs Outdated Show resolved Hide resolved

perf(iceberg): add comment for direct insert and cleanup for local bu…

d9f9ca6

…ffer

kiwicopple approved these changes Nov 20, 2025

View reviewed changes

burmecia merged commit 0b5b9a9 into main Nov 20, 2025
6 of 7 checks passed

burmecia deleted the bo/perf/iceberg-direct-insert branch November 20, 2025 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf(iceberg): [WRA-11] optimise direct insert for large dataset #526

perf(iceberg): [WRA-11] optimise direct insert for large dataset #526

Uh oh!

burmecia commented Nov 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

perf(iceberg): [WRA-11] optimise direct insert for large dataset #526

perf(iceberg): [WRA-11] optimise direct insert for large dataset #526

Uh oh!

Conversation

burmecia commented Nov 20, 2025

What kind of change does this PR introduce?

What is the current behavior?

What is the new behavior?

Additional context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants