Skip to content

schema: honour field-level logicalType annotations on record fields#38

Merged
twmb merged 5 commits into
twmb:mainfrom
Jeffail:field-level-logical-type
May 13, 2026
Merged

schema: honour field-level logicalType annotations on record fields#38
twmb merged 5 commits into
twmb:mainfrom
Jeffail:field-level-logical-type

Conversation

@Jeffail
Copy link
Copy Markdown

@Jeffail Jeffail commented May 13, 2026

Problem

Confluent's Java code generator, kafka-connect-avro-converter, and most Debezium CDC sources (Oracle, MySQL, PostgreSQL) emit Avro schemas with the logicalType annotation (and, for decimal, precision/scale) as a sibling of type on the field object rather than nested inside the type definition. The on-wire encoding is identical to the spec-blessed nested form — only the JSON layout differs.

// Java/JDBC/Debezium idiom (silently mis-parsed today):
{"name":"ts","type":"long","logicalType":"timestamp-millis"}
{"name":"ts","type":["null","long"],"logicalType":"timestamp-millis"}

// Spec-blessed nested form (parsed correctly today):
{"name":"ts","type":{"type":"long","logicalType":"timestamp-millis"}}
{"name":"ts","type":["null",{"type":"long","logicalType":"timestamp-millis"}]}

Before this change, the parser silently drops the field-level annotation because afield does not capture it. The schema is built as a plain-primitive long / int / string, and Encode/Decode against time.Time (or *big.Rat, [16]byte for UUID, etc.) fails:

avro: field ts: cannot use time.Time with Avro type long

Fix

Extend afield to capture Logical / Scale / Precision, and extend UnmarshalJSON to lift the field-level annotation into the type definition. After the lift the rest of the parser sees the canonical nested form.

Three shapes are covered:

  • Primitive type: {"type":"long","logicalType":"x"}{"type":{"type":"long","logicalType":"x"}}
  • Nullable union: {"type":["null","long"],"logicalType":"x"}{"type":["null",{"type":"long","logicalType":"x"}]} (applied to the first non-null branch that doesn't already carry its own annotation)
  • Already-object type: {"type":{"type":"long"},"logicalType":"x"} → fill in only the keys the inner object didn't declare

Conflict resolution: an annotation already present inside the type definition wins (closer-to-the-type wins) so an explicit author choice is never overridden by an outer scope. After lifting, the field-level copies are cleared so canonical re-emit doesn't duplicate them.

The existing complex-type lift (for "flat" record/enum/array/map/fixed fields) already propagated field-level keys into the nested type via its typeObj copy; the new code makes its handling of Logical/Scale/Precision explicit by clearing the field-level copies after the lift, preserving the previous canonical-output behaviour byte-for-byte.

Tests

  • TestFieldLevelLogicalType walks the matrix of timestamp-millis / timestamp-micros / local-timestamp-millis / date / time-millis / time-micros / uuid / decimal (with sibling precision/scale) under both primitive and nullable-union forms. For each, asserts the lifted parsed schema reports the same logicalType the nested form would.
  • TestFieldLevelLogicalType_RoundTripValue exercises the actual decoder via Encode/Decode of a time.Time through a flat-form schema (both primitive and union variants). Before the lift this errored with cannot use time.Time with Avro type long; after the lift the round-trip succeeds.
  • TestFieldLevelLogicalType_NestedAnnotationWins covers the edge case where both nested and field-level annotations are present — closer-to-the-type wins.

No existing tests required changes; the full pre-existing suite stays green.

Motivation

I hit this downstream in redpanda-data/connect while diagnosing a customer issue: an Oracle CDC pipeline whose Avro schemas use the field-level form was producing iceberg tables with BIGINT columns where the operator expected TIMESTAMP. The metadata-side parser in connect's schema_registry_decode had its own analogous bug; I fixed it there too. But making the upstream Avro library honour both forms means downstream value-side decoders no longer need a separate metadata-driven bridge to reconcile int64 values against Timestamp metadata — time.Time flows through end-to-end. The two halves of the symptom collapse into one fix.

The change is strictly more permissive — every schema that parsed before still parses, and the round-trip canonical output is unchanged.

Ashley Jeffs added 3 commits May 13, 2026 15:48
Confluent's Java code generator, kafka-connect-avro-converter, and most
Debezium CDC sources (Oracle, MySQL, PostgreSQL) emit Avro schemas with
the `logicalType` annotation (and, for decimal, `precision`/`scale`) as
a sibling of `type` on the field object rather than nested inside the
type definition. For example, a nullable timestamp field commonly
appears as:

    {"name":"ts","type":["null","long"],"logicalType":"timestamp-millis"}

rather than the spec-blessed nested form:

    {"name":"ts","type":["null",{"type":"long","logicalType":"timestamp-millis"}]}

Both shapes describe the same on-wire encoding; only the JSON layout
differs. The current parser silently drops the field-level annotation
because `afield` did not capture it, so reading such schemas produced a
plain-primitive type and Encode/Decode against time.Time failed with
"cannot use time.Time with Avro type long".

This commit extends `afield` to capture `Logical`/`Scale`/`Precision`
and `UnmarshalJSON` to lift the annotation into the type definition
during parse. The lift handles three shapes:

  - Primitive type: `{"type":"long","logicalType":"x"}` →
    `{"type":{"type":"long","logicalType":"x"}}`
  - Nullable union: `{"type":["null","long"],"logicalType":"x"}` →
    `{"type":["null",{"type":"long","logicalType":"x"}]}` (applied to
    the first non-null branch that doesn't already carry its own
    annotation).
  - Already-object type: `{"type":{"type":"long"},"logicalType":"x"}` →
    fill in only the keys the inner object didn't declare.

Conflict resolution: an annotation already present inside the type
definition wins (closer-to-the-type wins) so an explicit author choice
is never overridden by an outer scope. After lifting, the field-level
copies are cleared so canonical re-emit doesn't duplicate them.

The existing complex-type lift (for "flat" record/enum/array/map/fixed
fields) already propagated field-level keys into the nested type via
its typeObj copy; the new code makes its handling of Logical/Scale/
Precision explicit by clearing the field-level copies after the lift,
preserving the previous canonical-output behaviour byte-for-byte.

Tests:
  - TestFieldLevelLogicalType walks the matrix of timestamp-millis /
    timestamp-micros / local-timestamp-millis / date / time-millis /
    time-micros / uuid / decimal under both primitive and nullable-
    union forms; for each, asserts the lifted parsed schema reports
    the same logicalType the nested form would.
  - TestFieldLevelLogicalType_RoundTripValue exercises the actual
    decoder via Encode/Decode of a time.Time through a flat-form
    schema. Before the lift this errored with "cannot use time.Time
    with Avro type long"; after the lift the round-trip succeeds.
  - TestFieldLevelLogicalType_NestedAnnotationWins covers the
    closer-to-the-type wins semantics for the edge case where both
    nested and field-level annotations are present.

No existing tests required changes.
Four additional tests pin down behaviours the original commit relied on
but did not explicitly assert:

  - TestFieldLevelLogicalType_DecimalRoundTrip: end-to-end Encode/Decode
    of *big.Rat through both flat-form and union-form decimal schemas.
    Decimal is the most load-bearing case because the lift also
    propagates field-level precision/scale, not just logicalType.
  - TestFieldLevelLogicalType_CanonicalDoesNotDuplicate: regression
    guard against a future refactor that forgets to clear the field-
    level annotation after lifting. The canonical form must contain
    each of logicalType/precision/scale at most once.
  - TestFieldLevelLogicalType_FingerprintsMatch: byte-equality of
    Canonical() and SHA-256 Fingerprint() across the flat and nested
    forms for the full logical-type matrix. This is the load-bearing
    drop-in-compatibility invariant: downstream tooling (schema
    registries, schema caches) keys on fingerprints.
  - TestFieldLevelLogicalType_EncodeJSONMatchesNested: confirms the
    JSON encoder path produces identical output for both forms, since
    EncodeJSON is a separate code path from binary Encode.
  - TestFieldLevelLogicalType_MultiNonNullUnion: pins down the
    "first non-null branch wins" semantics for unusual unions like
    [null, long, string] with a sibling logicalType. The annotation
    is applied only to the first matching branch; subsequent branches
    are unchanged.

No production code changes; the lift itself is unchanged from the
previous commit.
The original RoundTripValue test only round-tripped timestamp-millis
because the value-side decoder already has nested-form unit tests for
every other type. Reviewing the PR surface, a reasonable reviewer
question is "does timestamp-micros also work?" — and indeed every
time-typed logical type runs through the same Encode/Decode path
against time.Time, so adding them is a small extension that closes
that question pre-emptively.

The expanded matrix:
  - timestamp-millis (primitive + union)
  - timestamp-micros (primitive + union)
  - local-timestamp-millis
  - local-timestamp-micros
  - timestamp-nanos
  - date

Each case picks an instant at the unit's full precision (e.g.
1700000000123456 microseconds, 1700000000123456789 nanoseconds) so
that a regression where the parser quietly fell back to long would
truncate sub-millisecond precision and the assertion would fail.

No production code changes; the lift itself is unchanged.
@twmb
Copy link
Copy Markdown
Owner

twmb commented May 13, 2026

Reviewing this applied on top of the current perf-and-audit branch (auto-merged schema.go and schema_test.go once, no conflicts). go build, go vet, and the full pre-existing test suite all stay green.

What works well

  • Surgical scope. All changes live inside afield.UnmarshalJSON + the new liftFieldLogicalIntoType / newLogicalObject / clonePtrInt helpers. The rest of the parser (buildPrimitive, validateLogical, the schema build, all encoders/decoders) sees the spec-blessed nested form unchanged.
  • Conflict resolution is per-field, closer-wins. For the {"type":{"type":"long"},"logicalType":"x"} shape the code only fills in Logical/Scale/Precision on the nested object when each is independently unset, and clears the field-level copies after — so canonical re-emit doesn't duplicate. TestFieldLevelLogicalType_NestedAnnotationWins and TestFieldLevelLogicalType_CanonicalDoesNotDuplicate lock both halves of this.
  • Fingerprint stability is asserted directly. TestFieldLevelLogicalType_FingerprintsMatch byte-compares flat-form vs nested-form Canonical() and SHA-256 fingerprints across primitive, union, and decimal shapes — the load-bearing invariant for schema-registry interop.
  • clonePtrInt correctly defensive-copies the int pointers across the lift, so a later mutation of the field-level copy cannot bleed back into the type.

Concerns (none blocking)

1. The "strictly more permissive" claim isn't quite right

Pre-PR, afield silently dropped unknown JSON keys (no Logical field, no extras map). That meant {"name":"x","type":"long","logicalType":"date"} parsed quietly as plain long — the bad annotation was thrown away and the field's data was decoded as raw int64 instead of the user's expected date. Post-PR the same schema now errors at parse:

invalid record field: invalid logicalType date type \"long\", can only be int

Same for int+uuid, string+uuid, int+timestamp-millis, bytes+date, every other strict-mismatch case in validateLogical. (decimal is an exception — its arm is intentionally lenient and silently clears the logical, so flat-form decimal-on-long still parses as plain long post-PR.)

In practice it's a consistency win: nested-form schemas ({\"type\":{\"type\":\"long\",\"logicalType\":\"date\"}}) have always errored on those mismatches, and post-PR flat-form behaves identically. Java itself silently ignores the mismatch (LogicalTypes.fromSchema returns null), so twmb is stricter than Java in both forms — but the PR brings the two twmb forms into alignment, which is what matters for users hitting this code path.

Suggested action: weaken the description's "strictly more permissive" wording, and add one pinning test asserting that flat-form {\"type\":\"long\",\"logicalType\":\"date\"} (and a couple of equivalents) now error. That way anyone considering reverting to the silent-tolerance has to confront it.

2. Union-lift can fall through to a second non-null branch when the first is already annotated

The union case in liftFieldLogicalIntoType:

case len(f.Type.union) > 0:
    for i := range f.Type.union {
        branch := &f.Type.union[i]
        if branch.primitive == \"null\" {
            continue
        }
        if branch.primitive != \"\" {
            f.Type.union[i] = aschema{object: f.newLogicalObject(branch.primitive)}
            break
        }
        if branch.object != nil && branch.object.Logical == \"\" {
            ...
            break
        }
    }

If the first non-null branch is an object with its own Logical already set, neither break fires and the loop continues to the next non-null branch. For the (admittedly weird) schema

{\"name\":\"v\",\"type\":[\"null\",{\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},\"long\"],\"logicalType\":\"timestamp-micros\"}

the lift attaches timestamp-micros onto the second non-null branch (the trailing \"long\"), producing [null, long+timestamp-millis, long+timestamp-micros] — which then fails downstream with duplicate union type \"long\". Pre-PR this schema parsed as [null, long+timestamp-millis, long] (the field-level annotation was silently dropped).

The doc comment above the case says "If every non-null branch already has a nested annotation, the field-level one is redundant and we drop it," which matches what a user would expect — but the implementation only checks the first non-null branch and falls through to subsequent ones. The straightforward fix is break unconditionally after examining the first non-null branch:

case len(f.Type.union) > 0:
    for i := range f.Type.union {
        branch := &f.Type.union[i]
        if branch.primitive == \"null\" {
            continue
        }
        switch {
        case branch.primitive != \"\":
            f.Type.union[i] = aschema{object: f.newLogicalObject(branch.primitive)}
        case branch.object != nil && branch.object.Logical == \"\":
            branch.object.Logical = f.Logical
            if branch.object.Scale == nil {
                branch.object.Scale = clonePtrInt(f.Scale)
            }
            if branch.object.Precision == nil {
                branch.object.Precision = clonePtrInt(f.Precision)
            }
        }
        break // first non-null branch only
    }

A user mixing nested and field-level annotations on the same union is probably non-existent in the wild, so this is "fix when next touching the file" rather than urgent — but it'd be worth a tiny pin test of the form [null, long-with-its-own-logical, ...] + field-level to lock the corrected semantics.

3. Union name-reference + field-level logical errors at parse

{\"name\":\"f\",\"type\":[\"null\",\"SomeRecord\"],\"logicalType\":\"uuid\"} lifts to aschema{object: aobject{Type: \"SomeRecord\", Logical: \"uuid\"}}, which then errors at validateLogical: invalid logicalType uuid type \"SomeRecord\", must be string or fixed(16). That's a clear and actionable parse error — same as the equivalent nested form would produce. No fix needed; calling out so it's not surprising.

4. Test gap: hybrid nested-precision/scale + field-level logicalType

The PR's TestFieldLevelLogicalType_DecimalRoundTrip covers {\"type\":\"bytes\",\"logicalType\":\"decimal\",\"precision\":9,\"scale\":2} (everything at field level). It does not test the hybrid {\"type\":{\"type\":\"bytes\",\"precision\":10,\"scale\":2},\"logicalType\":\"decimal\"} shape (precision/scale already nested, only logicalType at field level). I verified that case round-trips a *big.Rat correctly post-PR — the lift fills Logical=\"decimal\" into the existing object without disturbing Precision/Scale. Worth adding because it exercises the case f.Type.object != nil arm specifically and is a natural "user added the logicalType later" shape.

Recommendation

Approve with the four notes above. None block. Suggested follow-ups:

  1. (Required) Reword the description; add a pinning test asserting flat-form strict-mismatch logical now errors.
  2. (Should) Tighten the union-lift loop to break unconditionally after the first non-null branch + add a pin.
  3. (Nice-to-have) Add a TestFieldLevelLogicalType_DecimalRoundTrip hybrid case.

Everything else — lift logic, conflict resolution, canonical/fingerprint preservation, EncodeJSON parity, the test coverage — is in good shape. The change does what it claims for the common cases (Debezium CDC, Confluent codegen, Java kafka-connect-avro-converter) and the suite plus my probe tests confirm correctness across the matrix.

twmb added 2 commits May 13, 2026 12:42
Inspection of AvroData.java on master (lines 1075-1086) shows Confluent's
kafka-connect-avro-converter emits the spec-blessed nested form —
baseSchema.addProp("logicalType", ...) where baseSchema is the type
object, not the field. The flat form's actual source is hand-written
.avsc files and Java code that calls Schema.Field.addProp("logicalType",
...) instead of LogicalTypes.x.addToSchema(field.schema()). Apache's
TestSchemaWarnings.java reproduces this misuse explicitly and tracks
it across AVRO-2015 (2017) and AVRO-3014 (Schema.java:1874 warning
added Feb 2021).

Changes:

  - Corrects the doc comment on TestFieldLevelLogicalType_RoundTrip
    to attribute the idiom to its real source (hand-written .avsc /
    Field.addProp API misuse) rather than to Confluent's converter.

  - Adds TestFieldLevelLogicalType_RealWorldFixtures with two verbatim
    public schemas plus origin URLs for auditability:
      * OneCricketeer/kafka-connect-sandbox record_v3.avsc
      * the AVRO-3014 / AVRO-2015 canonical Apache reproducer

  - Adds TestFieldLevelLogicalType_OneCricketeerRoundTrip exercising
    end-to-end Encode/Decode of time.Time through the real-world
    fixture — the exact bug scenario the lift fixes.

Lifting remains a strict superset of the spec-blessed nested form:
canonical [STRIP] removes logicalType regardless, so fingerprint
behaviour is unchanged. The lift is purely an in-memory normalization
producing the same parse result the user would have gotten with the
spec-blessed nested form.
…atch pin, hybrid-decimal case)

Three follow-ups from the review on twmb#38:

  1. Fix the union-lift fall-through. liftFieldLogicalIntoType's loop
     used per-arm `break`s, so when the first non-null branch was an
     object with its own Logical already set (`branch.object != nil &&
     branch.object.Logical != ""`) neither arm fired and the loop
     continued to the next non-null branch. For a schema like
     `[null, {type:int,logicalType:date}, "string"]` with a field-level
     `logicalType:"uuid"`, the lift incorrectly grafted `uuid` onto the
     trailing `string`, silently giving the user's plain string branch
     a uuid semantic. The fix collapses the two arms into a `switch`
     and `break`s unconditionally after examining the first non-null
     branch — closer-to-the-type wins; the field-level annotation is
     dropped if the first non-null branch is already annotated.

     Pinned by TestFieldLevelLogicalType_UnionPreAnnotatedFirstBranch
     (verified to fail under the buggy code: branch 2 acquires
     "uuid"; passes under the fix: branch 2 stays plain).

  2. Pin the strict-mismatch behavior change. Pre-PR, `afield` had no
     Logical field so flat-form `{"type":"long","logicalType":"date"}`
     silently parsed as plain `long`, dropping the malformed
     annotation. Post-PR the annotation is lifted into the type
     object, runs through validateLogical, and the strict-mismatch
     arms produce a clear "invalid logicalType date type \"long\", can
     only be int"-class error at Parse. This is a strictness increase
     for malformed flat-form schemas, not a regression — but worth
     pinning so a future "be more permissive on flat-form" patch
     can't silently revert the behavior. TestFieldLevelLogicalType_
     StrictMismatchErrors covers long+date / int+uuid / string+ts-millis
     / int+time-micros / bytes+date plus the union variant. Decimal is
     intentionally excluded — its validateLogical arm clears the
     annotation rather than erroring, separate decision.

  3. Add a hybrid case to TestFieldLevelLogicalType_DecimalRoundTrip:
     `{"type":{"type":"bytes","precision":10,"scale":2},"logicalType":
     "decimal"}` — precision/scale already nested, only logicalType at
     field level. This exercises the `case f.Type.object != nil` arm
     specifically (the existing decimal cases all exercise the
     primitive / union arms). Round-trips *big.Rat correctly because
     the lift fills `Logical="decimal"` into the existing object
     without disturbing Precision/Scale.

Full test suite stays green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants