Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with partial BigQuery failures more elegantly #53

Closed
mtagle opened this issue Oct 27, 2016 · 2 comments
Closed

Deal with partial BigQuery failures more elegantly #53

mtagle opened this issue Oct 27, 2016 · 2 comments

Comments

@mtagle
Copy link
Contributor

mtagle commented Oct 27, 2016

BigQuery write requests can partially fail. Some rows for a request may be successfully written, and others will not be. If this is the case, the whole flush will be considered a failure we'll end up with a stack trace like this:

Caused by: java.util.concurrent.ExecutionException: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: table insertion failed for the following rows:
    [row index 3000]: backendError: null
    [row index 3001]: backendError: null
    [row index 3002]: backendError: null

Since the whole flush is considered a failure, kafka connect will rebalance and end up re-writing all the rows to BQ. This results in duplicated rows.

This is not a huge issue (BQ views can be written to dedup duplicated rows), but, if possible, it would be nice to take advantage of the fact that some rows were successfully written and only attempt to write the unsuccessful rows.

@mtagle mtagle changed the title Deal with partial BigQuery Failures more elegantly Deal with partial BigQuery failures more elegantly Oct 27, 2016
@whynick1
Copy link
Contributor

According to https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
screen shot 2018-03-26 at 9 43 58 am
When there is a partial BQ write failure, "a list of insertErrors" will be return, which contains the index of the failed rows(I think the "index" here means index in insert row list, rather than row number in BQ table). So, we can always filter out succeed rows in our "retry logic" to eliminate duplication.

@whynick1
Copy link
Contributor

Another interesting source provided by @criccomini, about how BQ internally handle deduplicate
https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert
screen shot 2018-03-26 at 9 53 03 am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants