Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically configure BigQuery scan parallelism #22279

Merged
merged 6 commits into from
Jun 11, 2024
Merged

Conversation

findepi
Copy link
Member

@findepi findepi commented Jun 5, 2024

No description provided.

Comment on lines 169 to 168
// At least 100 to cater for cluster scale up
int desiredParallelism = Math.min(nodeManager.getRequiredWorkerNodes().size() * 3, 100);
Copy link
Member

@hashhar hashhar Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi Did you mean to clamp this to 100 at-most? Current code is min(worker * 3, 100) which can be less than 100.
Did you mean max(100, min(worker * 3, 100))?

Or change the comment to Limit to 100 for very large clusters.

EDIT: Nevermind, this number is now fed to different parameter in client which sets min count of streams, not actual requested count.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that the ReadSession creation takes two parameters - a maximal number of streams, and a preferred minimal number of streams, which the API tries to accommodate but treats it as a recommendation only, not binding limit. This logic is similar to what the Spark BigQuery connector is doing.

By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidrabinowitz that you for your input!

By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?

I think this is a very important question, but I definitely am not in a position to answer it. Please allow me to gloss over it.

Notice that the ReadSession creation takes two parameters - a maximal number of streams, and a preferred minimal number of streams, which the API tries to accommodate but treats it as a recommendation only, not binding limit.

Indeed. And this PR switches us from setting the max, to setting the min.
The idea is that we actually don't want to have a static max. For example, if the data set is very large, we want to have a very large number of "reasonably sized" splits.
However, I don't know whether this is how it works.

@davidrabinowitz do we need to set both 'recommended min' and 'strict max' parameters? or can we just set the 'recommended min' ?

Copy link
Member

@hashhar hashhar Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark BigQuery connector seems to set both (https://github.com/search?q=repo%3AGoogleCloudDataproc%2Fspark-bigquery-connector+setPreferredMinStreamCount&type=code). They request at-least 3, and at most 20k by default (which is odd since IIRC 1k is the actual limit as documented at https://cloud.google.com/php/docs/reference/cloud-bigquery-storage/latest/V1.CreateReadSessionRequest#_Google_Cloud_BigQuery_Storage_V1_CreateReadSessionRequest__setPreferredMinStreamCount__).

Setting max might be useful for example if we want to avoid having too many splits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but what should be the max value? 10k?
if the actual limit is 1k, we're fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?

@davidrabinowitz @anoopj I sent DM in Trino community Slack. Let's continue the discussion there.

@wendigo
Copy link
Contributor

wendigo commented Jun 6, 2024

Please rebase @findepi

@findepi
Copy link
Member Author

findepi commented Jun 6, 2024

Please rebase @findepi

Happy to, provided we have directional agreement about the change.
That's not clear to me yet.

@hashhar
Copy link
Member

hashhar commented Jun 6, 2024

Let's go ahead with the change. It makes sense to me.

If we ever see that BigQuery gives errors when requesting a stream count in high concurrency workloads then we can introduce some config to adjust the "multiplier value" so that we don't run into quotas/limits. For now though, no changes requested.

@wendigo
Copy link
Contributor

wendigo commented Jun 6, 2024

@findepi I'm in favor of having less configuration in connectors so 👍🏼 from me

@findepi
Copy link
Member Author

findepi commented Jun 7, 2024

I don't see any requests for changes except editorial #22279 (comment).
can i get more review comments, or maybe approval?
of course, I am also hoping for a clarification in #22279 (comment)

@hashhar
Copy link
Member

hashhar commented Jun 7, 2024

pls fix conflicts

@findepi
Copy link
Member Author

findepi commented Jun 7, 2024

just rebased

@findepi findepi merged commit 6c4c1d9 into master Jun 11, 2024
25 checks passed
@findepi findepi deleted the findepi/bq-auto branch June 11, 2024 11:24
@github-actions github-actions bot added this to the 450 milestone Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

7 participants