Description
What happened?
We are reporting an issue where BigQueryIO failed due to maxRequestSize
being exceeded. This occurred when attempting to append ProtoRows where the serialised size of a single row pushed the overall request beyond the BiqQuery API's defined limit, despite the code's internal checks and a "TODO" comment suggesting this scenario is "nearly impossible".
Problematic Code Location
This issue originates around lines L885 in StorageApiWritesShardedRecords
.
Specifically the code block:
// Handle the case where the request is too large.
if (inserts.getSerializedSize() >= maxRequestSize) {
if (inserts.getSerializedRowsCount() > 1) {
// TODO(reuvenlax): Is it worth trying to handle this case by splitting the protoRows?
// Given that we split
// the ProtoRows iterable at 2MB and the max request size is 10MB, this scenario seems
// nearly impossible.
LOG.error(
"A request containing more than one row is over the request size limit of {}. "
+ "This is unexpected. All rows in the request will be sent to the failed-rows PCollection.",
maxRequestSize);
}
However, we found that the same problematic code also exists in StorageApiWriteUnshardedRecords
line 640
Context and Steps to Reproduce:
- Pipeline: We utilise an Apache Beam pipeline (running on version 2.62.0) that processes and writes data to BigQuery using BigQueryIO via the Storage Write API.
- Data Trigger: The incident was triggered by a specific, albeit rare, data pattern within the input stream. A particular record generated an exceptionally large serialised payload when converted to ProtoRows. This large payload, when included in its micro-batch, caused the total size of the request to exceed the 10MB BigQuery API limit.
Crucially, this micro-batch was constructed and attempted to be sent before a definitive check or pre-validation mechanism could prevent it from exceeding BigQuery API's hard limit. This suggest a reliance on a post-submission validation or a lack of proactive size management for dynamically growing ProtoRow payloads. - Observed Behaviour: Despite the internal BigQueryIO logic and comments suggesting this would not occur, our production pipeline did encounter this exact issue. The described LOG.error message was triggered, leading to the failure of the affected micro-batch and downstream data synchronisation problems.
Expected Behaviour
The logic should robustly handle scenarios where serialised ProtoRows within a micro-batch, or even a single very large row, approach or exceed the maxRequestSize (10MB). This should involve:
- More sophisticated pre-validation or dynamic splitting mechanisms to ensure that individual AppendRows requests always remain within BigQuery API limits
- More resilient error handling that minimises the impact of an oversized record, ideally allowing the successful processing of other valid records in the batch.
- A re-evaluation and implementation of a solution for the "TODO" comment, acknowledging that this scenario is indeed possible in real-world production data.
Actual Behaviour
The pipeline failed to write data to BigQuery, logging the LOG.error regarding the maxRequestSize being exceeded. This resulted in:
- Data sync failure for the affected micro-batch
- Interruption of a critical data flows
- Operation effort due to manual intervention and remediation.
Proposed Solution / Request
- Prioritise the implementation of a robust solution for the "TODO" within
StorageApiWritesShardedRecords
(andStorageApiWriteUnshardedRecords
), as our experience confirms it is a real-world edge case with production impact. - Implement more resilient splitting or batching logic to prevent single large records or small groups of records from causing the entire micro-batch to fail agains the 10MB BQ limit.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner