Fix partitioned output page flushing #19762

pettyjamesm · 2023-11-15T21:39:28Z

Description

Fixes a number of issues with the flushing behavior of PositionsAppenderPageBuilder and related classes. The fixes are:

To consider PositionsAppenderPageBuilder to be full after at least 4x PageProcessor.MAX_BATCH_SIZE (4 * 8192 = 32,768) positions have been inserted, regardless of the current size in bytes. The previous behavior of only considering the builder full when declaredPositions == Integer.MAX_VALUE was unsafe since PositionsAppenderPageBuilder#isFull() is only checked after inserting an entire page not after each row, so integer overflows could occur.
Accounting for the size of at least the current dictionary ids in UnnestingPositionsAppender when in dictionary mode, even though the size of the actually referenced entries is still not tracked. Previously, repeated insertions that used the same underlying dictionary would report no increase in the size of the builder and never flush. If at some later point a page were inserted with a different dictionary, the builder's transition to direct mode could then create very large memory allocations due to how many positions were buffered.
Setting an upper bound on the resulting size of an UnnestingPositionsAppender in RLE mode if it were to be transitioned to direct mode. Before this change, RLE mode appenders could accumulate an unbounded number of positions and report no size changes so long as the same value kept being inserted. At the point at which a different value was inserted, the transition to direct mode could cause huge spikes in memory usage and cause query failures and/or worker node heap OOMs.

In particular, issues 2 and 3 are exceedingly common when performing CROSS JOIN UNNEST queries since the UnnestOperator will emit RLE and dictionary blocks that repeat the same value or use the same dictionary for extended periods before suddenly transitioning to new values or dictionaries.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Fix an issue that could cause sudden increases in PagePartitioner memory consumption in some scenarios

lukasz-stec

lgtm % tests

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java

sopel39

lgtm % comments

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

sopel39

lgtm % comments % benchmarks (will have results soon)

sopel39 · 2023-11-20T13:38:16Z

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java

@@ -26,6 +28,9 @@
 public class PositionsAppenderPageBuilder
 {
    private static final int DEFAULT_INITIAL_EXPECTED_ENTRIES = 8;
+    @VisibleForTesting


not used in testing

The tests do assert on this value, since the test logic requires knowing what the value is in order to trigger the flushing based on position count without reaching the size limit.

core/trino-main/src/test/java/io/trino/operator/output/TestPositionsAppenderPageBuilder.java

sopel39 · 2023-11-20T14:05:33Z

core/trino-main/src/test/java/io/trino/operator/output/TestPositionsAppenderPageBuilder.java

+    }
+
+    @Test
+    public void testFullOnDirectSizeInBytes()


could we test with both RLE and dirct?

To be honest, I'm not sure what we would really be asserting there. maxDirectSizeInBytes must be >= maxSizeInBytes so, we would always be considered full based on maxSizeInBytes and wouldn't be able to tell the difference except when appending in RLE mode.

sopel39 · 2023-11-21T15:09:41Z

benchmarks look good

label 	TPCH wall time 	TPC-DS wall time 	TPCH CPU time 	TPC-DS CPU time 	TPCH peak mem 	TPC-DS peak mem 	TPCH network 	TPC-DS network 	TPCH input 	TPC-DS input
0 	baseline 	477.684667 	731.906833 	44121.5 	67848.357333 	1.217579e+09 	1.032452e+09 	1.441352e+09 	1.401036e+09 	1.383788e+09 	1.294938e+09
1 	baseline 	480.248667 	725.994167 	43740.0 	67811.382833 	1.218880e+09 	1.028290e+09 	1.441439e+09 	1.397992e+09 	1.383787e+09 	1.292503e+09
2 	after   	479.253667 	741.316167 	44653.8 	68609.902667 	1.209194e+09 	1.025864e+09 	1.431125e+09 	1.396864e+09 	1.383787e+09 	1.292680e+09
3 	after   	492.526167 	730.921333 	43796.9 	67022.969000 	1.212940e+09 	1.025419e+09 	1.431126e+09 	1.395533e+09 	1.383787e+09 	1.292576e+09

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

sopel39 · 2023-11-21T15:21:01Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

@@ -211,6 +211,11 @@ public long getSizeInBytes()
                (rleValue != null ? rleValue.getSizeInBytes() : 0);
    }

+    public long getDirectSizeInBytes()


this should most likely return OptionalLong and non-empty only for DIRECT, RLE

I think we still want to count the current size in bytes for direct appenders towards the total direct size, but with a special case for RLE where we want to report the size "as if it were flattened". Dictionaries of course are still unavoidably under-reported here.

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java

Flushes PositionsAppenderPageBuilder after reaching 4x the PageProcessor.MAX_BATCH_SIZE positions. The previous limit of Integer.MAX_VALUE positions could easily overflow since append operations can insert more than a single row and isFull() is only checked after inserting an entire page into the builder. Additionally, buffering too many RLE rows in the builder can prevent it from reaching the size limit and starve downstream tasks from receiving any input at all.

Although the total exact size in bytes of UnnestingPositionAppender dictionaries is expensive to track, we should at least account for the dictionary ids size. Otherwise, repeated dictionary insertions don't increase the reported size at all.

Flushes PositionsAppenderPageBuilder entries if the cost of converting all RLE channels into direct channels would exceed the maximum page size by a factor of 8x. Previously, page builders could buffer a very large number of RLE positions without being considered full and then suddenly expand to huge sizes when forced to transition to a direct representation as a result of the RLE input value changing across one or more columns. In particular, this is can easily happen when pages are produced from CROSS JOIN UNNEST operations.

sopel39

lgtm % comments

sopel39 · 2023-11-22T10:43:16Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

@@ -211,6 +211,14 @@ public long getSizeInBytes()
                (rleValue != null ? rleValue.getSizeInBytes() : 0);
    }

+    void addSizesToAccumulator(PositionsAppenderSizeAccumulator accumulator)


I think this could just return a record (and possibly OptionalLong for directSizeInBytes). IMO it would be cleaner and easier to understand than accumulator.

I don’t think the extra allocation per appender per check is worthwhile for the small bit of extra clarity is worthwhile here. At that point, you can’t box the direct size into an optional because RLE’s have both a “size” and a separate “direct size” value that need to be summed independently from each other.

sopel39 · 2023-11-22T10:46:36Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

+    void addSizesToAccumulator(PositionsAppenderSizeAccumulator accumulator)
+    {
+        long sizeInBytes = getSizeInBytes();
+        // dictionary size is not included due to the expense of the calculation, so this will under-report for dictionaries


dictionary size is not included due to the expense of the calculation
nit: actually there is io.trino.operator.output.UnnestingPositionsAppender#dictionary so it's easy to account for it.

It’s easy to count for the retained size, but not for which positions are actually in the output and the size of those positions without expensive book keeping.

pettyjamesm requested a review from dain November 15, 2023 21:39

cla-bot bot added the cla-signed label Nov 15, 2023

pettyjamesm changed the title ~~Fix positions appender flushing~~ Fix partitioned output page flushing Nov 15, 2023

pettyjamesm marked this pull request as ready for review November 15, 2023 22:09

pettyjamesm requested a review from lukasz-stec November 16, 2023 03:05

lukasz-stec approved these changes Nov 16, 2023

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java Outdated Show resolved Hide resolved

pettyjamesm force-pushed the fix-positions-appender-flushing branch from aa02616 to 6952664 Compare November 16, 2023 16:35

pettyjamesm requested a review from sopel39 November 16, 2023 16:36

sopel39 approved these changes Nov 17, 2023

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java Show resolved Hide resolved

pettyjamesm force-pushed the fix-positions-appender-flushing branch from 6952664 to 7eb074b Compare November 17, 2023 17:42

pettyjamesm mentioned this pull request Nov 17, 2023

Flush PagePartitioner dictionaries before release #19806

Merged

pettyjamesm force-pushed the fix-positions-appender-flushing branch from 7eb074b to 13a64f8 Compare November 17, 2023 20:11

sopel39 approved these changes Nov 20, 2023

View reviewed changes

pettyjamesm force-pushed the fix-positions-appender-flushing branch 2 times, most recently from 8d691e5 to d75e68b Compare November 21, 2023 14:53

sopel39 reviewed Nov 21, 2023

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java Outdated Show resolved Hide resolved

sopel39 reviewed Nov 21, 2023

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/output/PositionsAppenderPageBuilder.java Outdated Show resolved Hide resolved

pettyjamesm force-pushed the fix-positions-appender-flushing branch from d75e68b to ee7ce59 Compare November 21, 2023 15:46

pettyjamesm added 2 commits November 21, 2023 12:01

Track dictionary ids size when partitioning

82dd85d

Although the total exact size in bytes of UnnestingPositionAppender dictionaries is expensive to track, we should at least account for the dictionary ids size. Otherwise, repeated dictionary insertions don't increase the reported size at all.

pettyjamesm force-pushed the fix-positions-appender-flushing branch from ee7ce59 to 00f4269 Compare November 21, 2023 17:04

pettyjamesm force-pushed the fix-positions-appender-flushing branch from 00f4269 to a3a6a25 Compare November 21, 2023 19:24

sopel39 approved these changes Nov 22, 2023

View reviewed changes

pettyjamesm merged commit ddae2ec into trinodb:master Nov 22, 2023
90 checks passed

pettyjamesm deleted the fix-positions-appender-flushing branch November 22, 2023 15:22

github-actions bot added this to the 434 milestone Nov 22, 2023

mosabua mentioned this pull request Nov 22, 2023

Add Trino 434 release notes #19764

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix partitioned output page flushing #19762

Fix partitioned output page flushing #19762

pettyjamesm commented Nov 15, 2023 •

edited

lukasz-stec left a comment

sopel39 left a comment

sopel39 left a comment

sopel39 Nov 20, 2023

pettyjamesm Nov 20, 2023

sopel39 Nov 20, 2023

pettyjamesm Nov 20, 2023

sopel39 commented Nov 21, 2023

sopel39 Nov 21, 2023 •

edited

pettyjamesm Nov 21, 2023

sopel39 left a comment

sopel39 Nov 22, 2023 •

edited

pettyjamesm Nov 22, 2023

sopel39 Nov 22, 2023

pettyjamesm Nov 22, 2023

Fix partitioned output page flushing #19762

Fix partitioned output page flushing #19762

Conversation

pettyjamesm commented Nov 15, 2023 • edited

Description

Release notes

lukasz-stec left a comment

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 commented Nov 21, 2023

sopel39 Nov 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

sopel39 Nov 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pettyjamesm commented Nov 15, 2023 •

edited

sopel39 Nov 21, 2023 •

edited

sopel39 Nov 22, 2023 •

edited