-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically configure BigQuery scan parallelism #22279
Conversation
06b922a
to
0437630
Compare
plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java
Outdated
Show resolved
Hide resolved
// At least 100 to cater for cluster scale up | ||
int desiredParallelism = Math.min(nodeManager.getRequiredWorkerNodes().size() * 3, 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi Did you mean to clamp this to 100 at-most? Current code is min(worker * 3, 100)
which can be less than 100.
Did you mean max(100, min(worker * 3, 100))
?
Or change the comment to Limit to 100 for very large clusters
.
EDIT: Nevermind, this number is now fed to different parameter in client which sets min count of streams, not actual requested count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that the ReadSession creation takes two parameters - a maximal number of streams, and a preferred minimal number of streams, which the API tries to accommodate but treats it as a recommendation only, not binding limit. This logic is similar to what the Spark BigQuery connector is doing.
By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidrabinowitz that you for your input!
By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?
I think this is a very important question, but I definitely am not in a position to answer it. Please allow me to gloss over it.
Notice that the ReadSession creation takes two parameters - a maximal number of streams, and a preferred minimal number of streams, which the API tries to accommodate but treats it as a recommendation only, not binding limit.
Indeed. And this PR switches us from setting the max, to setting the min.
The idea is that we actually don't want to have a static max. For example, if the data set is very large, we want to have a very large number of "reasonably sized" splits.
However, I don't know whether this is how it works.
@davidrabinowitz do we need to set both 'recommended min' and 'strict max' parameters? or can we just set the 'recommended min' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark BigQuery connector seems to set both (https://github.com/search?q=repo%3AGoogleCloudDataproc%2Fspark-bigquery-connector+setPreferredMinStreamCount&type=code). They request at-least 3, and at most 20k by default (which is odd since IIRC 1k is the actual limit as documented at https://cloud.google.com/php/docs/reference/cloud-bigquery-storage/latest/V1.CreateReadSessionRequest#_Google_Cloud_BigQuery_Storage_V1_CreateReadSessionRequest__setPreferredMinStreamCount__).
Setting max might be useful for example if we want to avoid having too many splits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but what should be the max value? 10k?
if the actual limit is 1k, we're fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?
@davidrabinowitz @anoopj I sent DM in Trino community Slack. Let's continue the discussion there.
plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/ReadSessionCreator.java
Show resolved
Hide resolved
Please rebase @findepi |
Happy to, provided we have directional agreement about the change. |
Let's go ahead with the change. It makes sense to me. If we ever see that BigQuery gives errors when requesting a stream count in high concurrency workloads then we can introduce some config to adjust the "multiplier value" so that we don't run into quotas/limits. For now though, no changes requested. |
@findepi I'm in favor of having less configuration in connectors so 👍🏼 from me |
I don't see any requests for changes except editorial #22279 (comment). |
pls fix conflicts |
just rebased |
Similar to what we do in other connectors. There is no reason to create multiple splits just to return row count information.
No description provided.