Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate Writer skewness when writing partitioned data with preferred partitioning enabled #14718

Merged
merged 2 commits into from
Nov 19, 2022
Merged

Mitigate Writer skewness when writing partitioned data with preferred partitioning enabled #14718

merged 2 commits into from
Nov 19, 2022

Conversation

gaurav8297
Copy link
Member

@gaurav8297 gaurav8297 commented Oct 23, 2022

Description

Issue: #13379

Problem

  • Improve the performance of partitioned writes specifically in case writers/partitions are skewed.

  • Right now, prefer-partitioning only works if you have statistics and the number of partitions is greater than preferred-write-partitioning-min-number-of-partitions (default to 50). However, we know that stats are not always guaranteed to be present in which case partitioned writes will go through from an inefficient route. But with scaling, we could enable prefer-partitioning for any number of partitions thus we don't have to rely on statistics.

Benchmarks

Cluster with 6 worker nodes

1.) Single partition (260M rows)

  • Without preferred partitioning: 1:31 mins
  • With preferred partitioning (before): 18:34 mins
  • With preferred partitioning (after): 2:55 mins

2.) 8 partitions (600M rows)

  • Without preferred partitioning: 2:11 mins
  • With preferred partitioning (before): 6:23 mins
  • With preferred partitioning (after): 1:48 mins

3.) 2000+ partitions with almost no skewness (600M rows)

  • Without preferred partitioning: 13:23 mins
  • With preferred partitioning (before): 11:32 mins
  • With preferred partitioning (after): 10:55 mins

4.) 2000+ partitions with 6 skewed partitions (2.74B rows)

  • Without preferred partitioning: 22:18 mins (400GB peak memory)
  • With preferred partitioning (before): 55:26 mins (76.1GB peak memory)
  • With preferred partitioning (after): 20:33 mins (79.5GB peak memory)

In experiments 3 and 4, the finishing time is also substantial and included in the measurements (almost +9 mins)

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@gaurav8297 gaurav8297 changed the title Local scale writers for partitioned data Scale partitions to multiple writers locally Oct 23, 2022
@@ -103,7 +103,7 @@ public void testScaledWritersUsedAndTargetSupportsIt()
PlanNode root = planBuilder.output(
outputBuilder -> outputBuilder
.source(planBuilder.tableWithExchangeCreate(
planBuilder.createTarget(catalogSupportingScaledWriters, schemaTableName, true),
planBuilder.createTarget(catalogSupportingScaledWriters, schemaTableName, true, true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider introducing builder

import static io.trino.sql.planner.SystemPartitioningHandle.SINGLE_DISTRIBUTION;
import static java.util.Objects.requireNonNull;
import static java.util.function.Function.identity;

@ThreadSafe
public class LocalExchange
{
private static final int MIN_PARTITION_COUNT_PER_WRITER = 128;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is bit generic. Could you also add javadoc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the granularity like this is a little bit of an overkill. With the task concurrency of 32 the implementation would have to re-balance 4096 partitions and keep the state for each partition in memory. Generally there's only a few partitions that are skewed. I wonder if we should start with something lower, maybe 8? (32 * 8 = 256)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 8 or some other low value, it could be possible that a few small partitions are scaled along with skewed ones because they belong to the same bucket. But, it might be okay since 256 is per node. WDYT? @sopel39

@gaurav8297 gaurav8297 marked this pull request as ready for review October 28, 2022 05:41
@gaurav8297 gaurav8297 requested review from arhimondr and kabunchi and removed request for kabunchi November 2, 2022 10:28
@gaurav8297 gaurav8297 changed the title Scale partitions to multiple writers locally Mitigate Writer skewness when writing partitioned data with preferred partitioning enabled Nov 2, 2022
@sopel39
Copy link
Member

sopel39 commented Nov 2, 2022

Up to "Introduce ScaleWriterPartitioningHandle " lgtm % comments

@gaurav8297
Copy link
Member Author

gaurav8297 commented Nov 7, 2022

Do you know the reason of it?

I think there are two reasons:

  • No global scaling (Major): We don't have global scaling for preferred partitioning yet. However, with the tardigrade skewness work, this problem might get solved.
  • Local scaling speed (Minor): In preferred partitioning, we are scaling every 100MBs whereas in the un-part route we are doing it on every page.

sopel39
sopel39 previously approved these changes Nov 8, 2022
Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % Add scaleWriters flag in PartitioningHandle

@sopel39 sopel39 dismissed their stale review November 8, 2022 12:55

lgtm until Add scaleWriters flag in PartitioningHandle

}

// Update the partitions row count state which will help with scaling partitions across writers
partitionRebalancer.addPartitionsRowCount(writerPartitionIdToRowCount.buildOrThrow());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_PARTITIONS_TO_REBALANCE_PER_WRITER * 32 = 4096, which means in worst case writerPartitionIdToRowCount will be as big as page.

Consider reversing the relationship, e.g: partitionRebalancer fetching writerPartitionIdToRowCount from ScaleWriterPartitioningExchanger instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but it just makes the code more complex. I wonder if it's worth the effort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reversed the relationship. PTAL

@gaurav8297
Copy link
Member Author

@gaurav8297 Have you captured the CPU utilization as well by any chance?

In trino, the CPU time is almost similar for all related queries. So to better understand, I created this doc containing CPU utilization from the AWS console.

https://docs.google.com/document/d/1Bg5z-EzavtkXnBfhNLdJE3ngXRgrYrmmTVRJwiMhpTE/edit?usp=sharing

cc @arhimondr @sopel39

@gaurav8297
Copy link
Member Author

@arhimondr @sopel39 PTAL again

Copy link
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % a couple of nits

@gaurav8297
Copy link
Member Author

@sopel39 PTAL again

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% comments

{
private final List<Integer> writerAssignments;
// Partition estimated physical written bytes at the end of last rebalance cycle
private final AtomicLong physicalWrittenBytesAtLastRebalance = new AtomicLong(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be clearer if you had a field like lastScaleUpPhysicalWrittenBytes and didn't set physicalWrittenBytesAtLastRebalance to 0

First figure out if there's a skewness across
writers in a node, then find the biggest partitions
in the skewed writers and scale them across writers
which are on the lower end i.e. written smallest
amount of data.

Scaling will only happen if the skewness is above 70%
and the partition to be scaled has written atleast
writerMinSize since last scale up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

8 participants