Flink: Dynamic Iceberg Sink: Optimise RowData evolution #13340

aiborodin · 2025-06-18T05:45:36Z

RowDataEvolver recomputes Flink RowType and field getters for every input record that needs to match a destination Iceberg table schema. Cache field getters and column converters to optimise RowData conversion.

mxm

Thanks for improving the performance on the conversion write path @aiborodin! It looks like this PR contains two separate changes:

Adding caching to the conversion write path
Refactoring RowDataEvolver to dynamically instantiate converter classes (quasi code generation)

I wonder if we can do (1) as a first step. RowDataEvolver so far has been static and I understand that it needs to become an object in order to add the cache, but perhaps we can use a central RowDataEvolver instance with a cache for source and target schema first. I'm not sure adding the code generation yields much performance and I would like to minimize the objects getting created.

...k/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordProcessor.java

...v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/convert/RowDataConverter.java

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/convert/MapConverter.java

...k/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordProcessor.java

....0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicTableUpdateOperator.java

aiborodin · 2025-06-19T09:08:31Z

According to the profile in my previous comment #13340 (comment), schema caching would not be sufficient and we also need to cache field accessors and converters to minimise the CPU overhead. The object overhead is minimal as each converter would only store filed accessors and conversion lambdas. The cache overhead is minimal because it is an identity cache and same schema objects are already cached in TableMetadataCache.

...k/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordProcessor.java

...k/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/convert/ArrayConverter.java

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/convert/DataConverter.java

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/convert/MapConverter.java

...v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/convert/RowDataConverter.java

mxm

Thanks for explaining the rational behind the change. This is an excellent contribution!

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicIcebergSink.java

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/TableMetadataCache.java

flink/v2.0/flink/src/jmh/java/org/apache/iceberg/flink/sink/dynamic/CacheBenchmark.java

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/LRUCache.java

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/TableMetadataCache.java

pvary

LGTM +1
A few small changes, and we are ready

RowDataEvolver recomputes Flink RowType and field getters for every input record that needs to match a destination Iceberg table schema. Cache field getters and column converters to optimise RowData conversion.

TableMetadataCache already contains an identity cache to store schema comparison results. Let's move the row data converter cache into SchemaInfo and make it configurable.

mxm

Thanks a lot @aiborodin!

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/LRUCache.java

mxm · 2025-06-26T15:50:13Z

Nice last commits 😂

pvary · 2025-06-26T19:29:49Z

Merged to main.
Thanks for the optimization @aiborodin and @mxm for the review.

@aiborodin: Could you please create a backport PR to port these changes to Flink 1.20, 1.19.
This sed command could help:

g diff HEAD~1 HEAD flink/v2.0 |sed "s/v2.0/v1.20/g">/tmp/patch

Also, you need to change anything above cleanly applying the change, please highlight, so it is easier to review.

Thanks for all of your work on this! Happy to have you as a contributor!

aiborodin · 2025-06-27T06:58:19Z

Thank you for merging and reviewing the change @pvary!
I appreciate your and @mxm's valuable feedback, and it's a pleasure to have you as reviewers.
I raised this PR to backport the changes to Flink 1.19 / 1.20: #13401.

) backports #13340

github-actions bot added the flink label Jun 18, 2025

aiborodin mentioned this pull request Jun 18, 2025

Flink: Dynamic Iceberg Sink Contribution #12424

Closed

mxm reviewed Jun 18, 2025

View reviewed changes

aiborodin force-pushed the optimise-row-data-conversion branch from 913c0c6 to 0a6af3a Compare June 19, 2025 08:35