schema: honour field-level logicalType annotations on record fields#38
Conversation
Confluent's Java code generator, kafka-connect-avro-converter, and most
Debezium CDC sources (Oracle, MySQL, PostgreSQL) emit Avro schemas with
the `logicalType` annotation (and, for decimal, `precision`/`scale`) as
a sibling of `type` on the field object rather than nested inside the
type definition. For example, a nullable timestamp field commonly
appears as:
{"name":"ts","type":["null","long"],"logicalType":"timestamp-millis"}
rather than the spec-blessed nested form:
{"name":"ts","type":["null",{"type":"long","logicalType":"timestamp-millis"}]}
Both shapes describe the same on-wire encoding; only the JSON layout
differs. The current parser silently drops the field-level annotation
because `afield` did not capture it, so reading such schemas produced a
plain-primitive type and Encode/Decode against time.Time failed with
"cannot use time.Time with Avro type long".
This commit extends `afield` to capture `Logical`/`Scale`/`Precision`
and `UnmarshalJSON` to lift the annotation into the type definition
during parse. The lift handles three shapes:
- Primitive type: `{"type":"long","logicalType":"x"}` →
`{"type":{"type":"long","logicalType":"x"}}`
- Nullable union: `{"type":["null","long"],"logicalType":"x"}` →
`{"type":["null",{"type":"long","logicalType":"x"}]}` (applied to
the first non-null branch that doesn't already carry its own
annotation).
- Already-object type: `{"type":{"type":"long"},"logicalType":"x"}` →
fill in only the keys the inner object didn't declare.
Conflict resolution: an annotation already present inside the type
definition wins (closer-to-the-type wins) so an explicit author choice
is never overridden by an outer scope. After lifting, the field-level
copies are cleared so canonical re-emit doesn't duplicate them.
The existing complex-type lift (for "flat" record/enum/array/map/fixed
fields) already propagated field-level keys into the nested type via
its typeObj copy; the new code makes its handling of Logical/Scale/
Precision explicit by clearing the field-level copies after the lift,
preserving the previous canonical-output behaviour byte-for-byte.
Tests:
- TestFieldLevelLogicalType walks the matrix of timestamp-millis /
timestamp-micros / local-timestamp-millis / date / time-millis /
time-micros / uuid / decimal under both primitive and nullable-
union forms; for each, asserts the lifted parsed schema reports
the same logicalType the nested form would.
- TestFieldLevelLogicalType_RoundTripValue exercises the actual
decoder via Encode/Decode of a time.Time through a flat-form
schema. Before the lift this errored with "cannot use time.Time
with Avro type long"; after the lift the round-trip succeeds.
- TestFieldLevelLogicalType_NestedAnnotationWins covers the
closer-to-the-type wins semantics for the edge case where both
nested and field-level annotations are present.
No existing tests required changes.
Four additional tests pin down behaviours the original commit relied on
but did not explicitly assert:
- TestFieldLevelLogicalType_DecimalRoundTrip: end-to-end Encode/Decode
of *big.Rat through both flat-form and union-form decimal schemas.
Decimal is the most load-bearing case because the lift also
propagates field-level precision/scale, not just logicalType.
- TestFieldLevelLogicalType_CanonicalDoesNotDuplicate: regression
guard against a future refactor that forgets to clear the field-
level annotation after lifting. The canonical form must contain
each of logicalType/precision/scale at most once.
- TestFieldLevelLogicalType_FingerprintsMatch: byte-equality of
Canonical() and SHA-256 Fingerprint() across the flat and nested
forms for the full logical-type matrix. This is the load-bearing
drop-in-compatibility invariant: downstream tooling (schema
registries, schema caches) keys on fingerprints.
- TestFieldLevelLogicalType_EncodeJSONMatchesNested: confirms the
JSON encoder path produces identical output for both forms, since
EncodeJSON is a separate code path from binary Encode.
- TestFieldLevelLogicalType_MultiNonNullUnion: pins down the
"first non-null branch wins" semantics for unusual unions like
[null, long, string] with a sibling logicalType. The annotation
is applied only to the first matching branch; subsequent branches
are unchanged.
No production code changes; the lift itself is unchanged from the
previous commit.
The original RoundTripValue test only round-tripped timestamp-millis because the value-side decoder already has nested-form unit tests for every other type. Reviewing the PR surface, a reasonable reviewer question is "does timestamp-micros also work?" — and indeed every time-typed logical type runs through the same Encode/Decode path against time.Time, so adding them is a small extension that closes that question pre-emptively. The expanded matrix: - timestamp-millis (primitive + union) - timestamp-micros (primitive + union) - local-timestamp-millis - local-timestamp-micros - timestamp-nanos - date Each case picks an instant at the unit's full precision (e.g. 1700000000123456 microseconds, 1700000000123456789 nanoseconds) so that a regression where the parser quietly fell back to long would truncate sub-millisecond precision and the assertion would fail. No production code changes; the lift itself is unchanged.
|
Reviewing this applied on top of the current What works well
Concerns (none blocking)1. The "strictly more permissive" claim isn't quite rightPre-PR, Same for In practice it's a consistency win: nested-form schemas ( Suggested action: weaken the description's "strictly more permissive" wording, and add one pinning test asserting that flat-form 2. Union-lift can fall through to a second non-null branch when the first is already annotatedThe union case in case len(f.Type.union) > 0:
for i := range f.Type.union {
branch := &f.Type.union[i]
if branch.primitive == \"null\" {
continue
}
if branch.primitive != \"\" {
f.Type.union[i] = aschema{object: f.newLogicalObject(branch.primitive)}
break
}
if branch.object != nil && branch.object.Logical == \"\" {
...
break
}
}If the first non-null branch is an object with its own {\"name\":\"v\",\"type\":[\"null\",{\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},\"long\"],\"logicalType\":\"timestamp-micros\"}the lift attaches The doc comment above the case says "If every non-null branch already has a nested annotation, the field-level one is redundant and we drop it," which matches what a user would expect — but the implementation only checks the first non-null branch and falls through to subsequent ones. The straightforward fix is case len(f.Type.union) > 0:
for i := range f.Type.union {
branch := &f.Type.union[i]
if branch.primitive == \"null\" {
continue
}
switch {
case branch.primitive != \"\":
f.Type.union[i] = aschema{object: f.newLogicalObject(branch.primitive)}
case branch.object != nil && branch.object.Logical == \"\":
branch.object.Logical = f.Logical
if branch.object.Scale == nil {
branch.object.Scale = clonePtrInt(f.Scale)
}
if branch.object.Precision == nil {
branch.object.Precision = clonePtrInt(f.Precision)
}
}
break // first non-null branch only
}A user mixing nested and field-level annotations on the same union is probably non-existent in the wild, so this is "fix when next touching the file" rather than urgent — but it'd be worth a tiny pin test of the form 3. Union name-reference + field-level logical errors at parse
4. Test gap: hybrid nested-precision/scale + field-level logicalTypeThe PR's RecommendationApprove with the four notes above. None block. Suggested follow-ups:
Everything else — lift logic, conflict resolution, canonical/fingerprint preservation, EncodeJSON parity, the test coverage — is in good shape. The change does what it claims for the common cases (Debezium CDC, Confluent codegen, Java kafka-connect-avro-converter) and the suite plus my probe tests confirm correctness across the matrix. |
Inspection of AvroData.java on master (lines 1075-1086) shows Confluent's
kafka-connect-avro-converter emits the spec-blessed nested form —
baseSchema.addProp("logicalType", ...) where baseSchema is the type
object, not the field. The flat form's actual source is hand-written
.avsc files and Java code that calls Schema.Field.addProp("logicalType",
...) instead of LogicalTypes.x.addToSchema(field.schema()). Apache's
TestSchemaWarnings.java reproduces this misuse explicitly and tracks
it across AVRO-2015 (2017) and AVRO-3014 (Schema.java:1874 warning
added Feb 2021).
Changes:
- Corrects the doc comment on TestFieldLevelLogicalType_RoundTrip
to attribute the idiom to its real source (hand-written .avsc /
Field.addProp API misuse) rather than to Confluent's converter.
- Adds TestFieldLevelLogicalType_RealWorldFixtures with two verbatim
public schemas plus origin URLs for auditability:
* OneCricketeer/kafka-connect-sandbox record_v3.avsc
* the AVRO-3014 / AVRO-2015 canonical Apache reproducer
- Adds TestFieldLevelLogicalType_OneCricketeerRoundTrip exercising
end-to-end Encode/Decode of time.Time through the real-world
fixture — the exact bug scenario the lift fixes.
Lifting remains a strict superset of the spec-blessed nested form:
canonical [STRIP] removes logicalType regardless, so fingerprint
behaviour is unchanged. The lift is purely an in-memory normalization
producing the same parse result the user would have gotten with the
spec-blessed nested form.
…atch pin, hybrid-decimal case) Three follow-ups from the review on twmb#38: 1. Fix the union-lift fall-through. liftFieldLogicalIntoType's loop used per-arm `break`s, so when the first non-null branch was an object with its own Logical already set (`branch.object != nil && branch.object.Logical != ""`) neither arm fired and the loop continued to the next non-null branch. For a schema like `[null, {type:int,logicalType:date}, "string"]` with a field-level `logicalType:"uuid"`, the lift incorrectly grafted `uuid` onto the trailing `string`, silently giving the user's plain string branch a uuid semantic. The fix collapses the two arms into a `switch` and `break`s unconditionally after examining the first non-null branch — closer-to-the-type wins; the field-level annotation is dropped if the first non-null branch is already annotated. Pinned by TestFieldLevelLogicalType_UnionPreAnnotatedFirstBranch (verified to fail under the buggy code: branch 2 acquires "uuid"; passes under the fix: branch 2 stays plain). 2. Pin the strict-mismatch behavior change. Pre-PR, `afield` had no Logical field so flat-form `{"type":"long","logicalType":"date"}` silently parsed as plain `long`, dropping the malformed annotation. Post-PR the annotation is lifted into the type object, runs through validateLogical, and the strict-mismatch arms produce a clear "invalid logicalType date type \"long\", can only be int"-class error at Parse. This is a strictness increase for malformed flat-form schemas, not a regression — but worth pinning so a future "be more permissive on flat-form" patch can't silently revert the behavior. TestFieldLevelLogicalType_ StrictMismatchErrors covers long+date / int+uuid / string+ts-millis / int+time-micros / bytes+date plus the union variant. Decimal is intentionally excluded — its validateLogical arm clears the annotation rather than erroring, separate decision. 3. Add a hybrid case to TestFieldLevelLogicalType_DecimalRoundTrip: `{"type":{"type":"bytes","precision":10,"scale":2},"logicalType": "decimal"}` — precision/scale already nested, only logicalType at field level. This exercises the `case f.Type.object != nil` arm specifically (the existing decimal cases all exercise the primitive / union arms). Round-trips *big.Rat correctly because the lift fills `Logical="decimal"` into the existing object without disturbing Precision/Scale. Full test suite stays green.
Problem
Confluent's Java code generator, kafka-connect-avro-converter, and most Debezium CDC sources (Oracle, MySQL, PostgreSQL) emit Avro schemas with the
logicalTypeannotation (and, for decimal,precision/scale) as a sibling oftypeon the field object rather than nested inside the type definition. The on-wire encoding is identical to the spec-blessed nested form — only the JSON layout differs.Before this change, the parser silently drops the field-level annotation because
afielddoes not capture it. The schema is built as a plain-primitivelong/int/string, andEncode/Decodeagainsttime.Time(or*big.Rat,[16]bytefor UUID, etc.) fails:Fix
Extend
afieldto captureLogical/Scale/Precision, and extendUnmarshalJSONto lift the field-level annotation into the type definition. After the lift the rest of the parser sees the canonical nested form.Three shapes are covered:
{"type":"long","logicalType":"x"}→{"type":{"type":"long","logicalType":"x"}}{"type":["null","long"],"logicalType":"x"}→{"type":["null",{"type":"long","logicalType":"x"}]}(applied to the first non-null branch that doesn't already carry its own annotation){"type":{"type":"long"},"logicalType":"x"}→ fill in only the keys the inner object didn't declareConflict resolution: an annotation already present inside the type definition wins (closer-to-the-type wins) so an explicit author choice is never overridden by an outer scope. After lifting, the field-level copies are cleared so canonical re-emit doesn't duplicate them.
The existing complex-type lift (for "flat"
record/enum/array/map/fixedfields) already propagated field-level keys into the nested type via itstypeObjcopy; the new code makes its handling ofLogical/Scale/Precisionexplicit by clearing the field-level copies after the lift, preserving the previous canonical-output behaviour byte-for-byte.Tests
TestFieldLevelLogicalTypewalks the matrix oftimestamp-millis/timestamp-micros/local-timestamp-millis/date/time-millis/time-micros/uuid/decimal(with siblingprecision/scale) under both primitive and nullable-union forms. For each, asserts the lifted parsed schema reports the samelogicalTypethe nested form would.TestFieldLevelLogicalType_RoundTripValueexercises the actual decoder viaEncode/Decodeof atime.Timethrough a flat-form schema (both primitive and union variants). Before the lift this errored withcannot use time.Time with Avro type long; after the lift the round-trip succeeds.TestFieldLevelLogicalType_NestedAnnotationWinscovers the edge case where both nested and field-level annotations are present — closer-to-the-type wins.No existing tests required changes; the full pre-existing suite stays green.
Motivation
I hit this downstream in redpanda-data/connect while diagnosing a customer issue: an Oracle CDC pipeline whose Avro schemas use the field-level form was producing iceberg tables with
BIGINTcolumns where the operator expectedTIMESTAMP. The metadata-side parser in connect'sschema_registry_decodehad its own analogous bug; I fixed it there too. But making the upstream Avro library honour both forms means downstream value-side decoders no longer need a separate metadata-driven bridge to reconcileint64values againstTimestampmetadata —time.Timeflows through end-to-end. The two halves of the symptom collapse into one fix.The change is strictly more permissive — every schema that parsed before still parses, and the round-trip canonical output is unchanged.