Add support for topics with multiple schemas #238

mkubala · 2020-01-09T14:38:06Z

This contribution makes kafka-connect-bigquery compatible with topics that carry different types of messages, especially when Kafka is used for an event sourcing.

Approach of putting several event types in a single topic is already supported by Kafka and Schema Registry (see https://www.confluent.io/blog/put-several-event-types-kafka-topic/).

This PR has been derived from #219 and rebased on top of the recent changes made by @bingqinzhou

This contribution makes kafka-connect-bigquery compatible with topics that carry different types of messages, especially when Kafka is used for an event sourcing. Approach of putting several event types in a single topic is already supported by Kafka and Schema Registry (see https://www.confluent.io/blog/put-several-event-types-kafka-topic/).

jaroslawZawila · 2020-01-21T09:51:16Z

@mtagle @criccomini @C0urante Any progress on this one?

mkubala · 2020-01-23T15:01:11Z

@mtagle @criccomini @C0urante Is there any chance for this feature to be merged and included in the next release?

I do not want to exert pressure on anyone here, but it's been almost 3 months since I started work on this contribution after receiving a positive feedback in #175 and my team is blocked with their work on ML & reporting.

I'm between a rock and a hard place and if I should start looking for another solution (write my own custom service or connector?) I'd like to know it before the team get really mad.

mtagle · 2020-01-23T17:55:13Z

hey @mkubala, I've asked @whynick1 to take a look at your PR. I'm eager to get this merged as well!

kcbq-api/src/main/java/com/wepay/kafka/connect/bigquery/api/TopicAndRecordName.java

kcbq-api/src/main/java/com/wepay/kafka/connect/bigquery/api/SchemaRetriever.java

kcbq-api/src/main/java/com/wepay/kafka/connect/bigquery/api/TopicAndRecordName.java

kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/config/BigQuerySinkConfig.java

...pay/kafka/connect/bigquery/schemaregistry/schemaretriever/SchemaRegistrySchemaRetriever.java

kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/BigQuerySinkConnector.java

skyzyx · 2020-01-30T11:16:10Z

Codecov Report

Merging #238 into master will increase coverage by 0.25%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master     #238      +/-   ##
============================================
+ Coverage     68.69%   68.95%   +0.25%     
- Complexity      267      273       +6     
============================================
  Files            32       32              
  Lines          1460     1498      +38     
  Branches        152      153       +1     
============================================
+ Hits           1003     1033      +30     
- Misses          409      417       +8     
  Partials         48       48

Impacted Files	Coverage Δ	Complexity Δ
...a/connect/bigquery/utils/TopicToTableResolver.java	`73.33% <0.00%> (-18.34%)`	`11.00% <0.00%> (-1.00%)`

…chemas())

mkubala · 2020-01-30T13:15:25Z

Most of the comments have been addressed. I left some food for thoughts regarding moving the getSingleMatch method back from config to the TopicToTableResolver.

Also I'd be glad if you could take a look on #237 - it's another feature, essential to start using multi schema topics. When they finally get merged, I'll expose a third PR, which brings support for resolving datasets based on the topic and record names.

whynick1

Overall, it looks really good! Added a few suggested changes.

criccomini · 2020-01-30T22:41:54Z

kcbq-api/src/main/java/com/wepay/kafka/connect/bigquery/api/TopicAndRecordName.java

+   * @param schemaType schema type used to resolve full subject when recordName is absent.
+   * @return corresponding schema registry subject.
+   */
+  public String toSubject(KafkaSchemaRecordType schemaType) {


This is leaking Confluent Schema Registry stuff into the API layer, which must be agnostic to specific schema implementations. The concept of a subject is specific to Confluent.

In fact, the entire concept of a record name is specific to the Avro/Confluent implementation. Protobufs will probably work with this since they also have record names, but what about JSON? IIRC, there is another PR that is about supporting JSON schemas.

I am concerned that this PR is far too tied to the Confluent Schema Registry, especially in the API layer.

I am open to changing the API interfaces to make it easier on the Confluent Schema Registry, but not in a way that excludes other potential implementations (especially ones like JSON, where there is active interest). Can you suggest how this can be altered to not leak Confluent Schema Registry assumptions into the API?

Good point!
I'm working on decoupling this plugin from that particular schema implementation. The rough idea is to extract and encapsulate all the SR-related logic in a specialized classes (so that it should be easy to provide support for different schemas).

The responsibility for assembling Confluent Schema Registry subject names has been moved to SchemaRegistrySchemaRetriever.
Also the recordName field stays optional, so in case of an alternative implementations (like aforementioned JSON) developers won't have to bother with it.

Ok, so @whynick1 and I have done some talking and investigating. I think we have a good idea about the best approach to achieve not only what you're trying to do, but to evolve KCBQ to support things in a more generic way in general. This amounts to:

Adding a pluggable TableRouter that takes a SinkRecord and returns which table it should be written to. (Default: RegexTableRouter)

Changing SchemaRetriever interface to have two methods: Schema getKeySchema(SinkRecord) and Schema getValueSchema(SinkRecord). (Default: IdentitySchemaRetriever, which just returns sinkRecord.keySchema() and sinkRecord.valueSchema())

Changing the way that schema updates are handled in the AdaptiveBigQueryWriter.

This approach should give us a ton of flexibility including:

It will allow you to pick which table to route each individual message based on all of the information in SinkRecord (topic, key schema, value schema, message payload, etc.)

It will allow us to fix some known schema-evolution bugs in cases where one field is added and another is dropped.

It will allow us to use SinkRecord's .keySchema and .valueSchema rather than talking to the schema registry for schemas.

It will make it easy to support JSON messages, even those that return null for the .keySchema() and .valueSchema() methods--you can implement a custom retriever for this case.

I am going to write up a GH issue today that documents in detail what needs to be done to implement these changes, and we can go from there.

Does this sound good?

Add some note regarding @criccomini 's point 1, 2, 3.

As an example to your case,

public class MultiSchemaRegexSTableRouter { Map<TopicAndRecordName, TableId> topicsToBaseTableIds; @override public TableId getTable(SinkRecord sinkRecord) { TopicAndRecordName key = new TopicAndRecordName(sinkRecord.topic(), sinkRecord.recordName()); if (topicsToBaseTableIds.contains(key) { return topicsToBaseTableIds.get(key); } else { // return matched BQ table with topic & record name } } }

Splitting into getKeySchema, valueSchema also provide an option to load SinkRecord's key schema (if has) to BQ for various purpose including deduplication, debugging, etc.

Automatic schema revolution now is default (should be tunable). If a new record uses new schema to insert (to BQ), KCBQ will try to update BQ schema with the latest one retrieved from Schema Registry. Now, we want to decouple from that, and instead derive new schema from Record, assuming json-like Record might not has a schema.

I really like this idea!

Yesterday I was struggling with changing an existing code on my local branch, responsible for picking up datasets based on both topic and record name, so that it would keep the API part agnostic and completely unaware of different dataset routing strategies. At the end of the day I was a little bit frustrated, because I had to either add yet another, configurable class (similar to SchemaRetriever but for datasets) and letting plugin users to remember about combining the right schema retriever with the right dataset resolver or add responsibility for routing datasets to SchemaRetriever, thus breaking SRP rule...

The proposed TableRouter.getTable method solves all the problems!

…ingleMatch + add unit tests.

criccomini · 2020-02-10T17:49:51Z

@mkubala are you planning to update this with TableRouter?

mkubala · 2020-02-24T15:04:09Z

Hi @criccomini!
Sorry for late response. I cannot tell you if and when I'll find time to update this PR.

I've been working on it as a part of my daily job and since we are behind the original schedule for BigQuery integration I found myself between a rock and a hard place.

My client, who paid for the contribution made so far, does not want to spend more money and see the working integration in our product ASAP.
On the other hand you guys want to keep as clean and maintainable codebase as possible.
Due to personal reasons I cannot sacrifice my spare time in order to introduce TableRouter to this PR in a reasonable time. Also I have yet another PR to expose (datasets based on both - topic & record names).

I wonder if there is any chance that you could approve & merge this PR "as it is" and TableRouter would be introduced as a separate PR, in the near future?

criccomini · 2020-02-24T17:40:18Z

No, this PR adds too much complexity to the codebase, and isn't the right approach--the refactor is. I don't think it's much more work than this PR, itself. I am reaching out to @rhauch to see if you all can allocate time for refactor that's outlined in issue #245.

mkubala · 2020-02-25T06:13:32Z

I understand. @rhauch please let me know if you start working on that.

rhauch · 2020-02-25T13:35:28Z

@mkubala sorry, I won’t have time to work on this.

mkubala · 2020-02-26T14:12:58Z

There is light in the end of tunnel! I spoke to my client this morning and they will let me finish the feature as part of my daily job if I give them a kind of hard estimate / deadline.

How much time you will need to review this PR and release a new version of the plugin?
Can you guarantee that after this PR got merged (and optionally the other one with support for matching datasets by both topic and record name) a new version of the connector plugin will be released?

mustosm · 2020-04-15T12:30:37Z

Hello,
We are very interested in this feature. Can you tell us when you would consider adding @mkubala's contribution to your release?

mkubala · 2020-04-15T12:39:35Z

@mustosm Maybe I'll find some spare time in the next couple of weeks to finish this feature.
Although I have no idea how much time it will take to re-review this PR and release a new version, so if you cannot wait then build connector from this branch on your own - that's what we did.

mustosm · 2020-04-15T12:46:38Z

@mkubala thank you for your answer. If the feature is released by wepay it would be better.
Are you using this feature in production today?

mkubala · 2020-04-15T14:56:41Z

It's been used on the staging environment.

OuesFa · 2021-11-17T16:20:27Z

Hi @mkubala 👋
Any update on this please?
Are you using this in prod env?
Did you make a release somewhere?

OuesFa · 2021-11-29T13:39:43Z

👋
I'm sharing a solution for this using SMTs with the current version of the connector.
confluentinc/kafka-connect-bigquery#114 (comment)

This was referenced Jan 9, 2020

there is multi Avro Schema for one topic #206 #207

Closed

Add unit tests for triggering (or not) table existence checks #240

Closed

whynick1 reviewed Jan 28, 2020

View reviewed changes

Fix minor review comments.

88d41c3

mkubala added 2 commits January 30, 2020 12:52

Do not fetch schemas on startup (get rid of SchemaRetriever.retreiveS…

ce911cf

…chemas())

Minor fix: simplify SchemaRegistrySchemaRetriever

b89609a

mkubala requested a review from whynick1 January 30, 2020 12:19

whynick1 reviewed Jan 30, 2020

View reviewed changes

criccomini reviewed Jan 30, 2020

View reviewed changes

mkubala added 3 commits January 31, 2020 14:25

BigQuerySinkConfig: Rename 'recordAliases' to 'recordsToTablePostfixes'

a606813

Extract common part of BigQuerySinkConfig's getSingleMatches and getS…

afc9a20

…ingleMatch + add unit tests.

Make TopicAndRecordName agnostic to specific schema implementations.

1805bda

criccomini mentioned this pull request Feb 4, 2020

Decouple topics from tables and generalize SchemaRetriever #245

Open

art-i-svsg mentioned this pull request May 18, 2021

Add support for topics with multiple schemas confluentinc/kafka-connect-bigquery#114

Closed

mkubala closed this Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for topics with multiple schemas #238

Add support for topics with multiple schemas #238

mkubala commented Jan 9, 2020 •

edited

Loading

jaroslawZawila commented Jan 21, 2020

mkubala commented Jan 23, 2020

mtagle commented Jan 23, 2020

skyzyx commented Jan 30, 2020

mkubala commented Jan 30, 2020

whynick1 left a comment

criccomini Jan 30, 2020

mkubala Jan 31, 2020 •

edited

Loading

mkubala Feb 3, 2020 •

edited

Loading

criccomini Feb 4, 2020

whynick1 Feb 4, 2020 •

edited

Loading

criccomini Feb 4, 2020

mkubala Feb 5, 2020

criccomini commented Feb 10, 2020

mkubala commented Feb 24, 2020 •

edited

Loading

criccomini commented Feb 24, 2020

mkubala commented Feb 25, 2020

rhauch commented Feb 25, 2020

mkubala commented Feb 26, 2020 •

edited

Loading

mustosm commented Apr 15, 2020

mkubala commented Apr 15, 2020

mustosm commented Apr 15, 2020

mkubala commented Apr 15, 2020

OuesFa commented Nov 17, 2021 •

edited

Loading

OuesFa commented Nov 29, 2021

Add support for topics with multiple schemas #238

Add support for topics with multiple schemas #238

Conversation

mkubala commented Jan 9, 2020 • edited Loading

jaroslawZawila commented Jan 21, 2020

mkubala commented Jan 23, 2020

mtagle commented Jan 23, 2020

skyzyx commented Jan 30, 2020

Codecov Report

mkubala commented Jan 30, 2020

whynick1 left a comment

Choose a reason for hiding this comment

criccomini Jan 30, 2020

Choose a reason for hiding this comment

mkubala Jan 31, 2020 • edited Loading

Choose a reason for hiding this comment

mkubala Feb 3, 2020 • edited Loading

Choose a reason for hiding this comment

criccomini Feb 4, 2020

Choose a reason for hiding this comment

whynick1 Feb 4, 2020 • edited Loading

Choose a reason for hiding this comment

criccomini Feb 4, 2020

Choose a reason for hiding this comment

mkubala Feb 5, 2020

Choose a reason for hiding this comment

criccomini commented Feb 10, 2020

mkubala commented Feb 24, 2020 • edited Loading

criccomini commented Feb 24, 2020

mkubala commented Feb 25, 2020

rhauch commented Feb 25, 2020

mkubala commented Feb 26, 2020 • edited Loading

mustosm commented Apr 15, 2020

mkubala commented Apr 15, 2020

mustosm commented Apr 15, 2020

mkubala commented Apr 15, 2020

OuesFa commented Nov 17, 2021 • edited Loading

OuesFa commented Nov 29, 2021

mkubala commented Jan 9, 2020 •

edited

Loading

mkubala Jan 31, 2020 •

edited

Loading

mkubala Feb 3, 2020 •

edited

Loading

whynick1 Feb 4, 2020 •

edited

Loading

mkubala commented Feb 24, 2020 •

edited

Loading

mkubala commented Feb 26, 2020 •

edited

Loading

OuesFa commented Nov 17, 2021 •

edited

Loading