Implement in-memory schema retriever #101

makearl · 2017-11-06T18:56:41Z

This change implements an in-memory schema retriever which caches the last seen schema for a given topic (inspiration drawn from how the JDBC sink connector handles schema updates).

Our use case for this schema retriever is to be able to support creating and updating table schemas for serialization formats that do not use the schema registry.

mtagle · 2017-11-09T00:38:48Z

This sounds great, but I'm a little confused at a high-level.

Can you explain how the memory retriever will return the most recent schema, and how we can be sure the cache isn't stale?

makearl · 2017-11-09T14:33:05Z

For every incoming message, we insert the schema for the message for that table and topic into the cache. If we see invalid errors when trying to write rows to the database, we call updateSchema using the most recent message schema that we've received for that table and topic.

In BigQuery, any fields that are added to a schema must be nullable or repeated. That means we should only try to update the schema when trying to write a message that has new fields.

mtagle · 2017-11-09T00:38:29Z

...connector/src/main/java/com/wepay/kafka/connect/bigquery/retrieve/MemorySchemaRetriever.java

+      return schema;
+    }
+
+    return SchemaBuilder.struct().build();


so if we don't have a record, we just return an empty schema? Is that safe?

Yeah, by returning an empty schema the calling code will create a table without a schema. When we receive our first message and try to add it, we'll hit the invalid schema case from the original code and update the schema with the schema from the message

Could you leave a comment to this effect? either on or in the function?

mtagle · 2017-11-09T21:31:38Z

kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/BigQuerySinkTask.java

@@ -150,6 +153,10 @@ public void put(Collection<SinkRecord> records) {
    for (SinkRecord record : records) {
      if (record.value() != null) {
        PartitionedTableId table = getRecordTable(record);
+        if (schemaRetriever != null) {


will the schema retriever ever be null, or is this just for ease of testing?

The schemaRetriever is optional configuration and only required when you want the connector to create the table or update the schema

mtagle · 2017-11-09T21:44:07Z

...nnector/src/main/java/com/wepay/kafka/connect/bigquery/write/row/AdaptiveBigQueryWriter.java

@@ -62,6 +62,10 @@ public AdaptiveBigQueryWriter(BigQuery bigQuery,
    this.schemaManager = schemaManager;
  }

+  private boolean isTableMissingSchema(BigQueryException e) {
+    return e.getReason().equalsIgnoreCase("invalid");


Lets have a comment linking to https://cloud.google.com/bigquery/troubleshooting-errors and a short explanation so it's clear what "invalid" means to bigquery to anyone reading this code.

Also, does this really need to use equalsIgnoreCase?

mtagle · 2017-11-09T21:53:08Z

...nnector/src/main/java/com/wepay/kafka/connect/bigquery/write/row/AdaptiveBigQueryWriter.java

      }
+    } catch (BigQueryException e) {
+      if (isTableMissingSchema(e)) {
+        attemptSchemaUpdate(tableId, topic);


are there really two different types of responses we could get that imply that we require a schema update? We have two paths here leading to the same result (attemptSchemaUpdate)

If a table has no schema, it will raise a BigQueryException and the writeResponse will never be set. If the table has an incorrect schema, bigQuery.insertAll will return a writeResponse containing the errors encountered

mtagle · 2017-11-09T21:55:38Z

...nnector/src/main/java/com/wepay/kafka/connect/bigquery/write/row/AdaptiveBigQueryWriter.java

+      }
+
+      // If the table was missing its schema, we never received a writeResponse
+      writeResponse = bigQuery.insertAll(request);


this seems like a duplicate from the try block

If a table has no schema, it will raise a BigQueryException and the writeResponse will never be set in the try block so we need to retry inserting after adding a schema to the table

mtagle · 2017-11-09T21:57:52Z

...nnector/src/main/java/com/wepay/kafka/connect/bigquery/write/row/AdaptiveBigQueryWriter.java

@@ -106,6 +120,18 @@ public void performWriteRequest(PartitionedTableId tableId,
    logger.debug("table insertion completed successfully");
  }

+  private boolean hasBigQueryResponseErrors(InsertAllResponse writeResponse, InsertAllRequest request) {
+    return request != null && writeResponse != null && writeResponse.hasErrors();


when is request/writeResponse ever going to be null? I also feel like this is method is hiding some functionality for not much gain (rather than just having this line in the one place where this method seems to be used)

You're right, I'll get rid of this function. writeResponse should not be null

mtagle · 2017-11-10T18:43:48Z

...nnector/src/main/java/com/wepay/kafka/connect/bigquery/write/row/AdaptiveBigQueryWriter.java

+        attemptSchemaUpdate(tableId, topic);
+      }
+    } catch (BigQueryException e) {
+      if (isTableMissingSchema(e)) {


if I understand correctly, there are two possiblities where we need to update the table schema:

the table exists and has a schema, but doesn't have some new cols (and thus needs to be updated)

the table exists but has no/an empty schema. This is new because this is only possible with the MemorySchemaRetriever.

is my understanding accurate?

Yes, that's correct

mtagle · 2017-11-10T18:44:49Z

...nnector/src/main/java/com/wepay/kafka/connect/bigquery/write/row/AdaptiveBigQueryWriter.java

      }
+
+      // If the table was missing its schema, we never received a writeResponse
+      writeResponse = bigQuery.insertAll(request);


still seems to me like we could be doing this twice. If this is only needed if the table was missing its schema, shouldn't this be in the catch block?

It is in the catch block, I can move it under the if (isTableMissingSchema(e)) to make it more readable though

mtagle · 2017-11-10T21:20:16Z

...nnector/src/main/java/com/wepay/kafka/connect/bigquery/write/row/AdaptiveBigQueryWriter.java

+        attemptSchemaUpdate(tableId, topic);
+
+        // If the table was missing its schema, we never received a writeResponse
+        writeResponse = bigQuery.insertAll(request);


Actually, I wonder if this should be reformatted a bit.

If we have a write response with invalid schema errors, then we retry the insert with an attempt count (lines 106-121) because the schema update might be delayed. I assume this is also a possibility if there is no schema all all.

So, I feel like this insertAll should be covered in that block, and not here.

If we have no schema, the writeResponse will be null so we'll run into issues during the while loop's writeResponse.hasErrors check. Just to clarify, are you thinking we should update the while loop to handle the case where the writeResponse is null? That seems reasonable to me but want to verify before implementing

yeah, I think the retry-after-schema-related-failure while loop should be updated to cover the case of this type of schema-related failure.

mtagle

One last thing!

mtagle · 2017-11-14T22:32:25Z

...nnector/src/main/java/com/wepay/kafka/connect/bigquery/write/row/AdaptiveBigQueryWriter.java

      logger.trace("insertion failed");
-      if (onlyContainsInvalidSchemaErrors(writeResponse.getInsertErrors())) {
+      if (writeResponse == null || onlyContainsInvalidSchemaErrors(writeResponse.getInsertErrors())) {
+        // If the table was missing its schema, we never received a writeResponse
        logger.debug("re-attempting insertion");
        writeResponse = bigQuery.insertAll(request);


we should have a try/catch around this, right? because we could get an exception of we have a null/empty schema? It's fine to just catch it and do nothing, but we should make sure the error doesn't percolate up if we intend to just retry.

Good catch, added!

mtagle · 2017-11-15T18:38:28Z

Looks good!

makearl added 2 commits November 6, 2017 13:31

Implement in-memory schema retriever

0860e8a

Make logging less chatty

5e31639

mtagle reviewed Nov 9, 2017

View reviewed changes

Remove unnecessary function, add comment about missing table schema

4323543

mtagle reviewed Nov 10, 2017

View reviewed changes

Add comment and move insertAll for readability

ee4bdc5

mtagle reviewed Nov 10, 2017

View reviewed changes

Move null writeResponse handling into retry loop

8a3dd35

mtagle reviewed Nov 14, 2017

View reviewed changes

Add a try-catch so that we retry the insert for a missing table schema

c9327f3

mtagle merged commit 430ece7 into wepay:master Nov 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement in-memory schema retriever #101

Implement in-memory schema retriever #101

makearl commented Nov 6, 2017

mtagle commented Nov 9, 2017

makearl commented Nov 9, 2017

mtagle Nov 9, 2017

makearl Nov 9, 2017

mtagle Nov 10, 2017

mtagle Nov 9, 2017

makearl Nov 9, 2017

mtagle Nov 9, 2017

mtagle Nov 9, 2017

makearl Nov 9, 2017

mtagle Nov 9, 2017

makearl Nov 9, 2017

mtagle Nov 9, 2017

makearl Nov 9, 2017

mtagle Nov 10, 2017

makearl Nov 10, 2017

mtagle Nov 10, 2017 •

edited

Loading

makearl Nov 10, 2017

mtagle Nov 10, 2017

makearl Nov 11, 2017

mtagle Nov 13, 2017

mtagle left a comment

mtagle Nov 14, 2017

makearl Nov 15, 2017

mtagle commented Nov 15, 2017

Implement in-memory schema retriever #101

Implement in-memory schema retriever #101

Conversation

makearl commented Nov 6, 2017

mtagle commented Nov 9, 2017

makearl commented Nov 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtagle Nov 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtagle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtagle commented Nov 15, 2017

mtagle Nov 10, 2017 •

edited

Loading