perf(iceberg): [WRA-11] optimise direct insert for large dataset #526
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What kind of change does this PR introduce?
This PR is to optimise large dataset insertion for Iceberg wrapper.
What is the current behavior?
When insert data into an Iceberg foreign table, the whole data set is buffered in local memory first for partitioning and sorting before sending to remote Iceberg. This could be a problem if the insert dataset is larger than local memory.
What is the new behavior?
Although the local partitioning and sorting cannot be avoid if they're defined on the Iceberg table, optimisation can still be done in a special case, that is, when the Iceberg table has no partition and sort spec defined. In this case, we can split the insert dataset into batches and directly insert them to remote Iceberg sequentially without buffering the whole dataset, thus can avoid local memory exhaustion.
The number of rows in a batch can be defined by foreign server option
batch_size, default is 4096. Each batch will be saved to an individual parquet file, so it is recommended to use larger batch size when inserting large dataset.Additional context
N/A