You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently have a hard-coded sleep of 1000ms (BigQuerySinkTask.TABLE_WRITE_INTERVAL) for writes to each table. This is a performance penalty that's pretty expensive when running a bootstrap of a lot of data (we're behind in the log). During load testing, I was seeing 30000 rows (212 bytes each) take > 3 seconds to write to BigQuery using the streaming API. The single-writer performance in the BQ stream API seems to be in the 1-2 megs/sec range.
Eliminating this config will expose us to quota issues with BigQuery. They only allow 100,000 rows/sec/table. It's going to be really difficult for us to tune this properly using configuration in a distributed environment, since we'll have some number of writers distributed across multiple machines. I think the right approach is actually just to add back off and retry logic when we receive a quota_exceeded error. This will cause the tasks to automatically handle quota exceeded errors, and will allow us to go full throttle when bootstrapping data.
The text was updated successfully, but these errors were encountered:
Resolve Issue #14
Remove hard-coded throttling in between requests to stay under quota requirements.
If a request attempt returns a quotaExceeded error, pause, and then retry request, similar to 500/503 exception handling logic.
(Migrated from internal Jira issue DI-448)
We currently have a hard-coded sleep of 1000ms (
BigQuerySinkTask.TABLE_WRITE_INTERVAL
) for writes to each table. This is a performance penalty that's pretty expensive when running a bootstrap of a lot of data (we're behind in the log). During load testing, I was seeing 30000 rows (212 bytes each) take > 3 seconds to write to BigQuery using the streaming API. The single-writer performance in the BQ stream API seems to be in the 1-2 megs/sec range.Eliminating this config will expose us to quota issues with BigQuery. They only allow 100,000 rows/sec/table. It's going to be really difficult for us to tune this properly using configuration in a distributed environment, since we'll have some number of writers distributed across multiple machines. I think the right approach is actually just to add back off and retry logic when we receive a quota_exceeded error. This will cause the tasks to automatically handle quota exceeded errors, and will allow us to go full throttle when bootstrapping data.
The text was updated successfully, but these errors were encountered: