feat(sink): support bigquery sink upsert #15780

xxhZs · 2024-03-19T08:18:35Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

#14882
refactor bigquery with storage write api. And support bigquery sink upsert
https://cloud.google.com/bigquery/docs/write-api

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

In bigquery sink, Support upsert,
Users need to set corresponding permissions and pk based on the document in bigquery
https://cloud.google.com/bigquery/docs/change-data-capture?hl=zh-cn

wenym1 · 2024-03-19T11:25:55Z

src/connector/src/sink/encoder/proto.rs

@@ -364,18 +409,61 @@ fn encode_field<D: MaybeData>(
                    Ok(Value::Message(message.transcode_to_dynamic()))
                })?
            }
+            (false, Kind::String) if is_big_query => {


It seems that the newly added custom_proto_type is only used to generate the is_big_query flag, and the flag is only used to control the logic when seeing different datatypes. If so, instead of adding this new parameter, we can just add new methods like on_timestamptz, on_jsonb ... to the MaybeData trait and then call the corresponding trait methods here. And then for bigquery we can implement its own MaybeData with new customized logic and then pass it to the encoder.

There's rw and proto type match , bigquery has some special matches (many of which are converted to string) that require is_big_query to determine whether the match holds or not

The newly added logic does not look like specially for bigquery. It's more like an extension of the original type compatibility and can be generalized for proto encoding used in sinks other than bigquery (cc @xiangjinwu ).

If so, we can remove the is_big_query flag and custom_proto_type and make it a general way of processing.

I agree it can be more general. It is better to introduce a TimestamptzHandlingMode and bigquery can just select one of the string formats, similar to the json encoder.

However it may not be another implementation of MaybeData (for bigquery). That trait is only meant to be implemented twice: once with type info alone (for validation), and once with concrete datum (for encoding). Given that we want to affect both validation and encoding here, it is supposed to be in encode_field here.

src/connector/src/sink/encoder/proto.rs

wenym1 · 2024-03-21T05:48:11Z

src/connector/src/sink/encoder/proto.rs

@@ -364,18 +409,61 @@ fn encode_field<D: MaybeData>(
                    Ok(Value::Message(message.transcode_to_dynamic()))
                })?
            }
+            (false, Kind::String) if is_big_query => {


The newly added logic does not look like specially for bigquery. It's more like an extension of the original type compatibility and can be generalized for proto encoding used in sinks other than bigquery (cc @xiangjinwu ).

If so, we can remove the is_big_query flag and custom_proto_type and make it a general way of processing.

remove index fix fix

wenym1 · 2024-04-09T05:00:35Z

src/connector/src/sink/encoder/proto.rs

+                                         * Group C: experimental */
+        },
+        DataType::Int16 => match (expect_list, proto_field.kind()) {
+            (false, Kind::Int64) => maybe.on_base(|s| Ok(Value::I64(s.into_int16() as i64)))?,


May also add support for casting to Int32 and Int16 by the way.

Cargo.lock

src/connector/with_options_sink.yaml

wenym1 · 2024-04-09T05:12:40Z