Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Storage write API in BigQuery connector #18897

Merged
merged 1 commit into from Nov 22, 2023
Merged

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Sep 2, 2023

Release notes

(x) Release notes are required, with the following suggested text:

# BigQuery
* Improve performance when writing rows. ({issue}`18897`)

@cla-bot cla-bot bot added the cla-signed label Sep 2, 2023
@github-actions github-actions bot added the bigquery BigQuery connector label Sep 2, 2023
@ebyhr ebyhr marked this pull request as draft September 4, 2023 02:05
}

@Override
public CompletableFuture<?> appendPage(Page page)
{
InsertAllRequest.Builder batch = InsertAllRequest.newBuilder(tableId);
JSONArray batch = new JSONArray();
for (int position = 0; position < page.getPositionCount(); position++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the data be batched based on a request size limit (config)? https://cloud.google.com/bigquery/quotas#write-api-limits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to do but pre-existing issue.

@ebyhr ebyhr force-pushed the ebi/bigquery-storage-write branch 5 times, most recently from 6d1e190 to 3b10a3f Compare November 12, 2023 23:05
@ebyhr ebyhr marked this pull request as ready for review November 13, 2023 00:43
return NOT_BLOCKED;
}

private void insertWithCommitted(JSONArray batch)
{
WriteStream stream = WriteStream.newBuilder().setType(COMMITTED).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are creating a new stream per Page.
BigQuery has a limit of 1k streams open at a time and 30k/4 hours (7.5k creations per hour).

Let's either document this if we think it's shouldn't be a problem or let's consider creating a single stream per page-sink/tablewriter task.

Also do we consider to use "pending mode" long term to provide proper isolation? With current mode ("committed") if a single stream fails then writes from other streams will still succeed and be visible. i.e. it's not ACID and there's no way to rollback. Note that this is same behaviour that we already had so it's not a regression in that sense.

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

.orElseGet(remoteTableName::toTableName);
// TODO: Consider using PENDING mode
WriteStream stream = WriteStream.newBuilder().setType(COMMITTED).build();
CreateWriteStreamRequest createWriteStreamRequest = CreateWriteStreamRequest.newBuilder().setParent(tableName.toString()).setWriteStream(stream).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: multi-line

Suggested change
CreateWriteStreamRequest createWriteStreamRequest = CreateWriteStreamRequest.newBuilder().setParent(tableName.toString()).setWriteStream(stream).build();
CreateWriteStreamRequest createWriteStreamRequest = CreateWriteStreamRequest.newBuilder()
.setParent(tableName.toString())
.setWriteStream(stream)
.build();

plugin/trino-bigquery/pom.xml Show resolved Hide resolved
@ebyhr ebyhr merged commit deb8ae0 into master Nov 22, 2023
15 of 17 checks passed
@ebyhr ebyhr deleted the ebi/bigquery-storage-write branch November 22, 2023 08:18
@github-actions github-actions bot added this to the 434 milestone Nov 22, 2023
@hashhar
Copy link
Member

hashhar commented Nov 22, 2023

@ebyhr Does the docs need updating about any new IAM permissions which are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery BigQuery connector cla-signed
Development

Successfully merging this pull request may close these issues.

None yet

3 participants