Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide number of clients basing on average request size of client #15369

Merged
merged 2 commits into from
Dec 15, 2022

Conversation

radek-kondziolka
Copy link
Contributor

@radek-kondziolka radek-kondziolka commented Dec 12, 2022

Description

Let's define the class of queries that are run on skewed data and as a result on some downstream stage there is one (or few nodes) that gather data from empty or nearly empty nodes at upstream stage - i.e. at some upstream stage most nodes are empty (serving no data) or nearly empty (serving little data). We observed that such queries are executed very poorly on Trino.
This PR addresses that issue.

The query listed below is a representant of that class:

SELECT
    count(*),
    cc.cc_name,
    cc.cc_class,
    cc.cc_manager,
    cc.cc_mkt_desc,
    cc.cc_market_manager,
    cc.cc_division_name
FROM catalog_sales cs
    RIGHT JOIN call_center cc ON cc.cc_call_center_sk =  cs.cs_call_center_sk
    RIGHT JOIN store s ON cc.cc_closed_date_sk = s.s_closed_date_sk
GROUP BY 2, 3, 4, 5, 6, 7
ORDER BY 1, 2, 3, 4, 5, 6, 7
LIMIT 100;

Before the change it takes 11,5 minutes to finish that query on 32 nodes r6g.4xlarge cluster. After the change it takes 4 minutes to finish on same cluster.
No regresion in serial and throughput benchmarks.

Release notes

( *) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Dec 12, 2022
@radek-kondziolka radek-kondziolka force-pushed the rk/avarage_per_client branch 2 times, most recently from 5d5fe25 to b520ecf Compare December 12, 2022 12:36
@radek-kondziolka radek-kondziolka marked this pull request as ready for review December 12, 2022 12:37
@radek-kondziolka radek-kondziolka changed the title Rk/avarage per client Decide number of clients basing on average request size of client Dec 12, 2022
.filter(client -> !queuedClients.contains(client) && !completedClients.contains(client))
.mapToLong(HttpPageBufferClient::getAverageRequestSizeInBytes)
.sum();
long bytesToBeRequested = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

projectedBytesToBeRequested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let it be

@@ -106,7 +108,9 @@ public DirectExchangeClient(
ScheduledExecutorService scheduledExecutor,
LocalMemoryContext memoryContext,
Executor pageBufferClientCallbackExecutor,
TaskFailureListener taskFailureListener)
TaskFailureListener taskFailureListener,
ConcurrentHashMap<URI, HttpPageBufferClient> allClients,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to expose package private getter for testing

Copy link
Contributor Author

@radek-kondziolka radek-kondziolka Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so in that case. If I do it in that way I need to make the queuedClients collection to be concurrent or I need another mechanism of synchronization. I saw in the codebase that we use @VisibleForTesting constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visible for testing (package private)
getAllClients()
getQueuedClients()

synchronized void updateAverageRequestSize(int successfulRequests, long responseSize)
{
if (successfulRequests > 0) {
averageRequestSizeInBytes = (averageRequestSizeInBytes * (successfulRequests - 1) + responseSize) / successfulRequests;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in current could this is using doubles for computations:

            // AVG_n = AVG_(n-1) * (n-1)/n + VALUE_n / n
            averageBytesPerRequest = (long) (1.0 * averageBytesPerRequest * (successfulRequests - 1) / successfulRequests + responseSize / successfulRequests);

I think its more accurate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

long responseSize = pages.stream().mapToLong(Slice::length).sum();
synchronized (HttpPageBufferClient.this) {
requestsCompleted.incrementAndGet();
updateAverageRequestSize(Math.max(0, requestsCompleted.get() - requestsFailed.get()), responseSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can requestsCompleted.get() - requestsFailed.get() become negative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not. It is kind of defensive programming.

@@ -511,8 +532,8 @@ public void onSuccess(@Nullable StatusResponse result)
future = null;
}
lastUpdate = DateTime.now();
requestsCompleted.incrementAndGet();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need that. I am doing artihmetic on requestsFailed and requestsCompleted. Without synchronization my average would be invalid.

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

@@ -106,7 +108,9 @@ public DirectExchangeClient(
ScheduledExecutorService scheduledExecutor,
LocalMemoryContext memoryContext,
Executor pageBufferClientCallbackExecutor,
TaskFailureListener taskFailureListener)
TaskFailureListener taskFailureListener,
ConcurrentHashMap<URI, HttpPageBufferClient> allClients,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visible for testing (package private)
getAllClients()
getQueuedClients()

@radek-kondziolka
Copy link
Contributor Author

radek-kondziolka commented Dec 13, 2022

It gives some gain in our TPCDS / TPCH benchmarks:

  • 32 nodes r6g.4xlarge, orc, unpart, sf1000:
    TPCDS walltime: -4.69%
    TPCH walltime: -6.87%

@radek-kondziolka
Copy link
Contributor Author

@sopel39 all comments addressed

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment

Change the way how DirectExchangeClient.scheduleRequestIfNecessary calculates
the number of clients to be requested on the exchange phase to use an average
request size of specific client instead of aggregated average of all clients.
Tests for a new approach to calculate the number of clients to be
requested was added. Beyond that, the test for calculating average
size of request was added.
@radek-kondziolka
Copy link
Contributor Author

small comment

commit structure repaired

@sopel39 sopel39 merged commit ac661b9 into trinodb:master Dec 15, 2022
@sopel39 sopel39 mentioned this pull request Dec 15, 2022
@github-actions github-actions bot added this to the 404 milestone Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants