schema: honour field-level logicalType annotations on record fields by Jeffail · Pull Request #38 · twmb/avro

Jeffail · 2026-05-13T14:50:03Z

Problem

Confluent's Java code generator, kafka-connect-avro-converter, and most Debezium CDC sources (Oracle, MySQL, PostgreSQL) emit Avro schemas with the logicalType annotation (and, for decimal, precision/scale) as a sibling of type on the field object rather than nested inside the type definition. The on-wire encoding is identical to the spec-blessed nested form — only the JSON layout differs.

// Java/JDBC/Debezium idiom (silently mis-parsed today):
{"name":"ts","type":"long","logicalType":"timestamp-millis"}
{"name":"ts","type":["null","long"],"logicalType":"timestamp-millis"}

// Spec-blessed nested form (parsed correctly today):
{"name":"ts","type":{"type":"long","logicalType":"timestamp-millis"}}
{"name":"ts","type":["null",{"type":"long","logicalType":"timestamp-millis"}]}

Before this change, the parser silently drops the field-level annotation because afield does not capture it. The schema is built as a plain-primitive long / int / string, and Encode/Decode against time.Time (or *big.Rat, [16]byte for UUID, etc.) fails:

avro: field ts: cannot use time.Time with Avro type long

Fix

Extend afield to capture Logical / Scale / Precision, and extend UnmarshalJSON to lift the field-level annotation into the type definition. After the lift the rest of the parser sees the canonical nested form.

Three shapes are covered:

Primitive type: {"type":"long","logicalType":"x"} → {"type":{"type":"long","logicalType":"x"}}
Nullable union: {"type":["null","long"],"logicalType":"x"} → {"type":["null",{"type":"long","logicalType":"x"}]} (applied to the first non-null branch that doesn't already carry its own annotation)
Already-object type: {"type":{"type":"long"},"logicalType":"x"} → fill in only the keys the inner object didn't declare

Conflict resolution: an annotation already present inside the type definition wins (closer-to-the-type wins) so an explicit author choice is never overridden by an outer scope. After lifting, the field-level copies are cleared so canonical re-emit doesn't duplicate them.

The existing complex-type lift (for "flat" record/enum/array/map/fixed fields) already propagated field-level keys into the nested type via its typeObj copy; the new code makes its handling of Logical/Scale/Precision explicit by clearing the field-level copies after the lift, preserving the previous canonical-output behaviour byte-for-byte.

Tests

TestFieldLevelLogicalType walks the matrix of timestamp-millis / timestamp-micros / local-timestamp-millis / date / time-millis / time-micros / uuid / decimal (with sibling precision/scale) under both primitive and nullable-union forms. For each, asserts the lifted parsed schema reports the same logicalType the nested form would.
TestFieldLevelLogicalType_RoundTripValue exercises the actual decoder via Encode/Decode of a time.Time through a flat-form schema (both primitive and union variants). Before the lift this errored with cannot use time.Time with Avro type long; after the lift the round-trip succeeds.
TestFieldLevelLogicalType_NestedAnnotationWins covers the edge case where both nested and field-level annotations are present — closer-to-the-type wins.

No existing tests required changes; the full pre-existing suite stays green.

Motivation

I hit this downstream in redpanda-data/connect while diagnosing a customer issue: an Oracle CDC pipeline whose Avro schemas use the field-level form was producing iceberg tables with BIGINT columns where the operator expected TIMESTAMP. The metadata-side parser in connect's schema_registry_decode had its own analogous bug; I fixed it there too. But making the upstream Avro library honour both forms means downstream value-side decoders no longer need a separate metadata-driven bridge to reconcile int64 values against Timestamp metadata — time.Time flows through end-to-end. The two halves of the symptom collapse into one fix.

The change is strictly more permissive — every schema that parsed before still parses, and the round-trip canonical output is unchanged.

Confluent's Java code generator, kafka-connect-avro-converter, and most Debezium CDC sources (Oracle, MySQL, PostgreSQL) emit Avro schemas with the `logicalType` annotation (and, for decimal, `precision`/`scale`) as a sibling of `type` on the field object rather than nested inside the type definition. For example, a nullable timestamp field commonly appears as: {"name":"ts","type":["null","long"],"logicalType":"timestamp-millis"} rather than the spec-blessed nested form: {"name":"ts","type":["null",{"type":"long","logicalType":"timestamp-millis"}]} Both shapes describe the same on-wire encoding; only the JSON layout differs. The current parser silently drops the field-level annotation because `afield` did not capture it, so reading such schemas produced a plain-primitive type and Encode/Decode against time.Time failed with "cannot use time.Time with Avro type long". This commit extends `afield` to capture `Logical`/`Scale`/`Precision` and `UnmarshalJSON` to lift the annotation into the type definition during parse. The lift handles three shapes: - Primitive type: `{"type":"long","logicalType":"x"}` → `{"type":{"type":"long","logicalType":"x"}}` - Nullable union: `{"type":["null","long"],"logicalType":"x"}` → `{"type":["null",{"type":"long","logicalType":"x"}]}` (applied to the first non-null branch that doesn't already carry its own annotation). - Already-object type: `{"type":{"type":"long"},"logicalType":"x"}` → fill in only the keys the inner object didn't declare. Conflict resolution: an annotation already present inside the type definition wins (closer-to-the-type wins) so an explicit author choice is never overridden by an outer scope. After lifting, the field-level copies are cleared so canonical re-emit doesn't duplicate them. The existing complex-type lift (for "flat" record/enum/array/map/fixed fields) already propagated field-level keys into the nested type via its typeObj copy; the new code makes its handling of Logical/Scale/ Precision explicit by clearing the field-level copies after the lift, preserving the previous canonical-output behaviour byte-for-byte. Tests: - TestFieldLevelLogicalType walks the matrix of timestamp-millis / timestamp-micros / local-timestamp-millis / date / time-millis / time-micros / uuid / decimal under both primitive and nullable- union forms; for each, asserts the lifted parsed schema reports the same logicalType the nested form would. - TestFieldLevelLogicalType_RoundTripValue exercises the actual decoder via Encode/Decode of a time.Time through a flat-form schema. Before the lift this errored with "cannot use time.Time with Avro type long"; after the lift the round-trip succeeds. - TestFieldLevelLogicalType_NestedAnnotationWins covers the closer-to-the-type wins semantics for the edge case where both nested and field-level annotations are present. No existing tests required changes.

Four additional tests pin down behaviours the original commit relied on but did not explicitly assert: - TestFieldLevelLogicalType_DecimalRoundTrip: end-to-end Encode/Decode of *big.Rat through both flat-form and union-form decimal schemas. Decimal is the most load-bearing case because the lift also propagates field-level precision/scale, not just logicalType. - TestFieldLevelLogicalType_CanonicalDoesNotDuplicate: regression guard against a future refactor that forgets to clear the field- level annotation after lifting. The canonical form must contain each of logicalType/precision/scale at most once. - TestFieldLevelLogicalType_FingerprintsMatch: byte-equality of Canonical() and SHA-256 Fingerprint() across the flat and nested forms for the full logical-type matrix. This is the load-bearing drop-in-compatibility invariant: downstream tooling (schema registries, schema caches) keys on fingerprints. - TestFieldLevelLogicalType_EncodeJSONMatchesNested: confirms the JSON encoder path produces identical output for both forms, since EncodeJSON is a separate code path from binary Encode. - TestFieldLevelLogicalType_MultiNonNullUnion: pins down the "first non-null branch wins" semantics for unusual unions like [null, long, string] with a sibling logicalType. The annotation is applied only to the first matching branch; subsequent branches are unchanged. No production code changes; the lift itself is unchanged from the previous commit.

The original RoundTripValue test only round-tripped timestamp-millis because the value-side decoder already has nested-form unit tests for every other type. Reviewing the PR surface, a reasonable reviewer question is "does timestamp-micros also work?" — and indeed every time-typed logical type runs through the same Encode/Decode path against time.Time, so adding them is a small extension that closes that question pre-emptively. The expanded matrix: - timestamp-millis (primitive + union) - timestamp-micros (primitive + union) - local-timestamp-millis - local-timestamp-micros - timestamp-nanos - date Each case picks an instant at the unit's full precision (e.g. 1700000000123456 microseconds, 1700000000123456789 nanoseconds) so that a regression where the parser quietly fell back to long would truncate sub-millisecond precision and the assertion would fail. No production code changes; the lift itself is unchanged.

twmb · 2026-05-13T18:02:02Z

Reviewing this applied on top of the current perf-and-audit branch (auto-merged schema.go and schema_test.go once, no conflicts). go build, go vet, and the full pre-existing test suite all stay green.

What works well

Surgical scope. All changes live inside afield.UnmarshalJSON + the new liftFieldLogicalIntoType / newLogicalObject / clonePtrInt helpers. The rest of the parser (buildPrimitive, validateLogical, the schema build, all encoders/decoders) sees the spec-blessed nested form unchanged.
Conflict resolution is per-field, closer-wins. For the {"type":{"type":"long"},"logicalType":"x"} shape the code only fills in Logical/Scale/Precision on the nested object when each is independently unset, and clears the field-level copies after — so canonical re-emit doesn't duplicate. TestFieldLevelLogicalType_NestedAnnotationWins and TestFieldLevelLogicalType_CanonicalDoesNotDuplicate lock both halves of this.
Fingerprint stability is asserted directly. TestFieldLevelLogicalType_FingerprintsMatch byte-compares flat-form vs nested-form Canonical() and SHA-256 fingerprints across primitive, union, and decimal shapes — the load-bearing invariant for schema-registry interop.
clonePtrInt correctly defensive-copies the int pointers across the lift, so a later mutation of the field-level copy cannot bleed back into the type.

Concerns (none blocking)

1. The "strictly more permissive" claim isn't quite right

Pre-PR, afield silently dropped unknown JSON keys (no Logical field, no extras map). That meant {"name":"x","type":"long","logicalType":"date"} parsed quietly as plain long — the bad annotation was thrown away and the field's data was decoded as raw int64 instead of the user's expected date. Post-PR the same schema now errors at parse:

invalid record field: invalid logicalType date type \"long\", can only be int

Same for int+uuid, string+uuid, int+timestamp-millis, bytes+date, every other strict-mismatch case in validateLogical. (decimal is an exception — its arm is intentionally lenient and silently clears the logical, so flat-form decimal-on-long still parses as plain long post-PR.)

In practice it's a consistency win: nested-form schemas ({\"type\":{\"type\":\"long\",\"logicalType\":\"date\"}}) have always errored on those mismatches, and post-PR flat-form behaves identically. Java itself silently ignores the mismatch (LogicalTypes.fromSchema returns null), so twmb is stricter than Java in both forms — but the PR brings the two twmb forms into alignment, which is what matters for users hitting this code path.

Suggested action: weaken the description's "strictly more permissive" wording, and add one pinning test asserting that flat-form {\"type\":\"long\",\"logicalType\":\"date\"} (and a couple of equivalents) now error. That way anyone considering reverting to the silent-tolerance has to confront it.

2. Union-lift can fall through to a second non-null branch when the first is already annotated

The union case in liftFieldLogicalIntoType:

case len(f.Type.union) > 0:
    for i := range f.Type.union {
        branch := &f.Type.union[i]
        if branch.primitive == \"null\" {
            continue
        }
        if branch.primitive != \"\" {
            f.Type.union[i] = aschema{object: f.newLogicalObject(branch.primitive)}
            break
        }
        if branch.object != nil && branch.object.Logical == \"\" {
            ...
            break
        }
    }

If the first non-null branch is an object with its own Logical already set, neither break fires and the loop continues to the next non-null branch. For the (admittedly weird) schema

{\"name\":\"v\",\"type\":[\"null\",{\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},\"long\"],\"logicalType\":\"timestamp-micros\"}

the lift attaches timestamp-micros onto the second non-null branch (the trailing \"long\"), producing [null, long+timestamp-millis, long+timestamp-micros] — which then fails downstream with duplicate union type \"long\". Pre-PR this schema parsed as [null, long+timestamp-millis, long] (the field-level annotation was silently dropped).

The doc comment above the case says "If every non-null branch already has a nested annotation, the field-level one is redundant and we drop it," which matches what a user would expect — but the implementation only checks the first non-null branch and falls through to subsequent ones. The straightforward fix is break unconditionally after examining the first non-null branch:

case len(f.Type.union) > 0:
    for i := range f.Type.union {
        branch := &f.Type.union[i]
        if branch.primitive == \"null\" {
            continue
        }
        switch {
        case branch.primitive != \"\":
            f.Type.union[i] = aschema{object: f.newLogicalObject(branch.primitive)}
        case branch.object != nil && branch.object.Logical == \"\":
            branch.object.Logical = f.Logical
            if branch.object.Scale == nil {
                branch.object.Scale = clonePtrInt(f.Scale)
            }
            if branch.object.Precision == nil {
                branch.object.Precision = clonePtrInt(f.Precision)
            }
        }
        break // first non-null branch only
    }

A user mixing nested and field-level annotations on the same union is probably non-existent in the wild, so this is "fix when next touching the file" rather than urgent — but it'd be worth a tiny pin test of the form [null, long-with-its-own-logical, ...] + field-level to lock the corrected semantics.

3. Union name-reference + field-level logical errors at parse

{\"name\":\"f\",\"type\":[\"null\",\"SomeRecord\"],\"logicalType\":\"uuid\"} lifts to aschema{object: aobject{Type: \"SomeRecord\", Logical: \"uuid\"}}, which then errors at validateLogical: invalid logicalType uuid type \"SomeRecord\", must be string or fixed(16). That's a clear and actionable parse error — same as the equivalent nested form would produce. No fix needed; calling out so it's not surprising.

4. Test gap: hybrid nested-precision/scale + field-level logicalType

The PR's TestFieldLevelLogicalType_DecimalRoundTrip covers {\"type\":\"bytes\",\"logicalType\":\"decimal\",\"precision\":9,\"scale\":2} (everything at field level). It does not test the hybrid {\"type\":{\"type\":\"bytes\",\"precision\":10,\"scale\":2},\"logicalType\":\"decimal\"} shape (precision/scale already nested, only logicalType at field level). I verified that case round-trips a *big.Rat correctly post-PR — the lift fills Logical=\"decimal\" into the existing object without disturbing Precision/Scale. Worth adding because it exercises the case f.Type.object != nil arm specifically and is a natural "user added the logicalType later" shape.

Recommendation

Approve with the four notes above. None block. Suggested follow-ups:

(Required) Reword the description; add a pinning test asserting flat-form strict-mismatch logical now errors.
(Should) Tighten the union-lift loop to break unconditionally after the first non-null branch + add a pin.
(Nice-to-have) Add a TestFieldLevelLogicalType_DecimalRoundTrip hybrid case.

Everything else — lift logic, conflict resolution, canonical/fingerprint preservation, EncodeJSON parity, the test coverage — is in good shape. The change does what it claims for the common cases (Debezium CDC, Confluent codegen, Java kafka-connect-avro-converter) and the suite plus my probe tests confirm correctness across the matrix.

Inspection of AvroData.java on master (lines 1075-1086) shows Confluent's kafka-connect-avro-converter emits the spec-blessed nested form — baseSchema.addProp("logicalType", ...) where baseSchema is the type object, not the field. The flat form's actual source is hand-written .avsc files and Java code that calls Schema.Field.addProp("logicalType", ...) instead of LogicalTypes.x.addToSchema(field.schema()). Apache's TestSchemaWarnings.java reproduces this misuse explicitly and tracks it across AVRO-2015 (2017) and AVRO-3014 (Schema.java:1874 warning added Feb 2021). Changes: - Corrects the doc comment on TestFieldLevelLogicalType_RoundTrip to attribute the idiom to its real source (hand-written .avsc / Field.addProp API misuse) rather than to Confluent's converter. - Adds TestFieldLevelLogicalType_RealWorldFixtures with two verbatim public schemas plus origin URLs for auditability: * OneCricketeer/kafka-connect-sandbox record_v3.avsc * the AVRO-3014 / AVRO-2015 canonical Apache reproducer - Adds TestFieldLevelLogicalType_OneCricketeerRoundTrip exercising end-to-end Encode/Decode of time.Time through the real-world fixture — the exact bug scenario the lift fixes. Lifting remains a strict superset of the spec-blessed nested form: canonical [STRIP] removes logicalType regardless, so fingerprint behaviour is unchanged. The lift is purely an in-memory normalization producing the same parse result the user would have gotten with the spec-blessed nested form.

…atch pin, hybrid-decimal case) Three follow-ups from the review on twmb#38: 1. Fix the union-lift fall-through. liftFieldLogicalIntoType's loop used per-arm `break`s, so when the first non-null branch was an object with its own Logical already set (`branch.object != nil && branch.object.Logical != ""`) neither arm fired and the loop continued to the next non-null branch. For a schema like `[null, {type:int,logicalType:date}, "string"]` with a field-level `logicalType:"uuid"`, the lift incorrectly grafted `uuid` onto the trailing `string`, silently giving the user's plain string branch a uuid semantic. The fix collapses the two arms into a `switch` and `break`s unconditionally after examining the first non-null branch — closer-to-the-type wins; the field-level annotation is dropped if the first non-null branch is already annotated. Pinned by TestFieldLevelLogicalType_UnionPreAnnotatedFirstBranch (verified to fail under the buggy code: branch 2 acquires "uuid"; passes under the fix: branch 2 stays plain). 2. Pin the strict-mismatch behavior change. Pre-PR, `afield` had no Logical field so flat-form `{"type":"long","logicalType":"date"}` silently parsed as plain `long`, dropping the malformed annotation. Post-PR the annotation is lifted into the type object, runs through validateLogical, and the strict-mismatch arms produce a clear "invalid logicalType date type \"long\", can only be int"-class error at Parse. This is a strictness increase for malformed flat-form schemas, not a regression — but worth pinning so a future "be more permissive on flat-form" patch can't silently revert the behavior. TestFieldLevelLogicalType_ StrictMismatchErrors covers long+date / int+uuid / string+ts-millis / int+time-micros / bytes+date plus the union variant. Decimal is intentionally excluded — its validateLogical arm clears the annotation rather than erroring, separate decision. 3. Add a hybrid case to TestFieldLevelLogicalType_DecimalRoundTrip: `{"type":{"type":"bytes","precision":10,"scale":2},"logicalType": "decimal"}` — precision/scale already nested, only logicalType at field level. This exercises the `case f.Type.object != nil` arm specifically (the existing decimal cases all exercise the primitive / union arms). Round-trips *big.Rat correctly because the lift fills `Logical="decimal"` into the existing object without disturbing Precision/Scale. Full test suite stays green.

Ashley Jeffs added 3 commits May 13, 2026 15:48

twmb added 2 commits May 13, 2026 12:42

twmb merged commit 1e5c2a3 into twmb:main May 13, 2026

twmb mentioned this pull request May 13, 2026

test: pin flat-form lift composition with unknownLogical / cache rejection / CustomType #40

Merged

3 tasks

Jeffail mentioned this pull request May 14, 2026

confluent,iceberg,parquet: preserve avro logical types end-to-end across schema_metadata consumers redpanda-data/connect#4427

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema: honour field-level logicalType annotations on record fields#38

schema: honour field-level logicalType annotations on record fields#38
twmb merged 5 commits into
twmb:mainfrom
Jeffail:field-level-logical-type

Jeffail commented May 13, 2026

Uh oh!

twmb commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jeffail commented May 13, 2026

Problem

Fix

Tests

Motivation

Uh oh!

twmb commented May 13, 2026

What works well

Concerns (none blocking)

1. The "strictly more permissive" claim isn't quite right

2. Union-lift can fall through to a second non-null branch when the first is already annotated

3. Union name-reference + field-level logical errors at parse

4. Test gap: hybrid nested-precision/scale + field-level logicalType

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants