Scale writers when write is partitioned #10791

findepi · 2022-01-25T14:29:54Z

Support writer scaling when write is partitioned, including partitioned or bucketed Hive or Iceberg tables, or OPTIMIZE (which may force repartitioning, overriding, preferred partitioning tuning knobs).

cc @sopel39

vincentpoon · 2022-01-29T00:15:42Z

+1 to this idea

sopel39 · 2022-01-31T09:52:16Z

Relevant thread: https://trinodb.slack.com/archives/CFLB9AMBN/p1642961339264200

sopel39 · 2022-05-19T09:56:50Z

This might be implemented by introducing a special adaptive partitioned exchange. Normally, row is assigned to a single partition based on hash. However, when there is a skew, a lot of rows will be assigned to the same partition. This will cause that specific partition buffer to be full due to slow writes.
The idea with adaptive partitioned exchange is that such exchange would keep track of hash -> partition assignments. When adaptive partitioned exchange gets a signal that any partition buffer gets blocked, it would rebalance hash -> partition assignments. For example, it might decide that particular hash which is skewed would be distributed between two (deterministic) partitions in a round robin fashion.

This solution is suitable for piplinemode. For tardigrade I think insert tasks for a partition would probably need to be scaled differently (cc. @losipiuk @arhimondr)

arhimondr · 2022-05-20T15:39:56Z

For fault tolerant execution it should be possible to split partition dynamically. This mechanism could also be useful for handling skews in joins. Though it is unclear when we are gonna be able to start working on it.

damnMeddlingKid · 2022-06-24T22:34:08Z

Ran into this recently and thought it would be helpful to add some context on our case. We write to iceberg using transforms in our partition column (hour(), day() etc.).

when using transforms in partition columns trino uses iceberg's bucket function for partitioning, this means that the planner can not choose writer scaling or redistributed writes.

This leads to writer skew when we write a single large partition.

Steps

Heres a break down of what is happening:

IcebergMetadata sees that the partition spec contains transform columns so it creates a partition handle https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L700
LogicalPlanner sees that the connector has specified a partitionhandle so it creates a partitioningScheme based on this https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/sql/planner/LogicalPlanner.java#L591-L595
ApplyPreferredTableWriterPartitioning does not match this node because theres no preferred partitioning.
AddExchanges creates an exchange node based on the iceberg bucket function https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddExchanges.java#L634-L642

Heres a diagram that maps out all codepaths related to partitioned writes in iceberg.

damnMeddlingKid · 2022-06-24T22:34:50Z

I'd like to work on this issue and I'm curious about the adaptive exchange idea and how it differs from the current implementation of writer scaling. I've tried a change which forces writer scaling and it works pretty well for iceberg, however we sometimes run into OOMs for hive based writes.

findepi added enhancement New feature or request performance labels Jan 25, 2022

findepi mentioned this issue Mar 28, 2022

ALTER TABLE ... EXECUTE optimize (file_size_threshold) does not work when writer count is large #11672

Open

damnMeddlingKid mentioned this issue Jun 23, 2022

[Iceberg] partitioned writes with transform columns have poor distribution. #12966

Closed

gaurav8297 mentioned this issue Sep 16, 2022

Scale task writers based on throughput for partitioned tables with skewness #13379

Closed

gaurav8297 mentioned this issue Nov 2, 2022

Mitigate Writer skewness when writing partitioned data with preferred partitioning enabled #14718

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale writers when write is partitioned #10791

Scale writers when write is partitioned #10791

findepi commented Jan 25, 2022

vincentpoon commented Jan 29, 2022

sopel39 commented Jan 31, 2022

sopel39 commented May 19, 2022 •

edited

arhimondr commented May 20, 2022

damnMeddlingKid commented Jun 24, 2022

damnMeddlingKid commented Jun 24, 2022

Scale writers when write is partitioned #10791

Scale writers when write is partitioned #10791

Comments

findepi commented Jan 25, 2022

vincentpoon commented Jan 29, 2022

sopel39 commented Jan 31, 2022

sopel39 commented May 19, 2022 • edited

arhimondr commented May 20, 2022

damnMeddlingKid commented Jun 24, 2022

Steps

damnMeddlingKid commented Jun 24, 2022

sopel39 commented May 19, 2022 •

edited