Skip to content

avro: resolve named-type references in nullable unions#4429

Merged
Jeffail merged 7 commits into
mainfrom
fix/avro-nullable-record-name-resolution
May 28, 2026
Merged

avro: resolve named-type references in nullable unions#4429
Jeffail merged 7 commits into
mainfrom
fix/avro-nullable-record-name-resolution

Conversation

@Jeffail
Copy link
Copy Markdown
Contributor

@Jeffail Jeffail commented May 14, 2026

Summary

Closes CON-468. The Avro JSON schema parser at internal/impl/confluent/ecs_avro.go was dropping named-type references in nullable unions — i.e. the ["null", "Fee"] idiom where Fee is a previously-defined record reused across more than one field — to schema.Any, so downstream sinks (notably iceberg) saw a VARCHAR column where the customer's Avro schema asked for a nested struct.

#4380 closed the inline-record form of this shape (["null", {"type":"record","name":"Fee",...}]). It did not address the more common name-reference form, which any non-trivial Avro schema with a reused record will produce — the Avro spec requires named types to be defined once and referenced by name thereafter.

What changed

ecsAvroConfig now carries a names map[string]schema.Common populated as the parser walks the schema. Every record, enum, and fixed definition is registered under both its simple name and its fully-qualified namespace.name form, matching Avro's lexical-scope resolution rules. String-form type references consult the map before falling back to the primitive-name lookup, so "Fee" resolves to the registered record's full structure instead of collapsing to schema.Any.

The two existing optional-union helpers (ecsAvroIsUnionJustOptional for primitives, ecsAvroIsUnionJustOptionalObject for inline objects) are replaced by a unified ecsAvroResolveOptionalUnion that handles all three branch forms (primitive, named reference, inline object) and accepts either ordering — ["null", X] and [X, null] are now equivalent, per CON-468 acceptance criterion 2.

Both the raw-union and lame-union (default raw_unions: false) hydrators route through the same ecsAvroResolveTypeRef helper, so the fix covers every customer regardless of their raw_unions setting.

Commit narrative

Commit Scope
4531c5171 Names map, unified resolver, named-type registration, raw-union path. Five tests covering CON-468 criterion 2 verbatim.
10000e8a4 Same fix routed through the lame-union path (default config). One additional test.

Scope notes

  • Self-referential records (a record whose own children reference it by name before its definition completes) remain unsupported. Closing that would require pre-registering a placeholder before walking children. Avro recursive types are rare; we can address as a follow-up if a customer hits it.
  • JSON Schema and Protobuf decode paths in the Schema Registry processor never produced schema.Common metadata to begin with — store_schema_metadata is wired only for the Avro arm. That is a pre-existing limitation, not a regression, and a feature request worth its own ticket if a customer asks for it.

Test plan

  • go test ./internal/impl/confluent/... — all green, including the six new tests below.
  • task lint0 issues.
  • Verified the existing TestEcsAvroRawUnionNestedRecord (the #4380 regression) still passes — the inline-record path is unchanged.

New tests:

  • TestEcsAvroRawUnionNullableRecordByName["null", "Fee"] name reference.
  • TestEcsAvroRawUnionNullableOrderIndependence (3 sub-tests) — [X, "null"] across inline records, primitives, and name references.
  • TestEcsAvroRawUnionNullableRecordNamespaced — short and fully-qualified namespace references.
  • TestEcsAvroRawUnionNullableRecordNested — record-containing-record where the inner is a name-reference union.
  • TestEcsAvroLameUnionNameResolution — same fix verified in the default-config (lame-union) path.

Comment thread internal/impl/confluent/ecs_avro.go Outdated
Comment on lines +156 to +161
case map[string]any:
inner, err := ecsAvroFromAnyMap(cfg, b)
if err != nil {
return schema.Common{}, false
}
return inner, true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor regression in error reporting: the err from ecsAvroFromAnyMap is dropped here, so the union hydrators downstream report "union ... child '...': could not resolve type map[string]interface {}" instead of the previous wrapped form "union ... child '...': decimal precision: not an integer: foo" (see the old map branch in ecsAvroHydrateRawUnion that previously called ecsAvroFromAnyMap directly with %w wrapping).

This violates the godev %w-wrapping pattern. Consider returning (Common, error) from ecsAvroResolveTypeRef and wrapping at the call site so malformed inline-record errors surface their root cause.

@claude
Copy link
Copy Markdown

claude Bot commented May 14, 2026

Commits
LGTM

Review
Focused, well-tested fix for CON-468. The names map threaded through ecsAvroConfig, the unified ecsAvroResolveOptionalUnion/ecsAvroResolveTypeRef helpers, and the parallel raw/lame routing all hang together cleanly. Aliasing of the registered schema.Common (and its Children slice) is correctly noted in the registration helper's comment and never mutated post-registration in the new code paths. Tests cover all four shapes called out in the PR description plus the lame-union path.

  1. Minor: ecsAvroResolveTypeRef drops the underlying err from ecsAvroFromAnyMap on malformed inline objects, replacing a previously-wrapped error chain with a generic "could not resolve type %T" message — see inline comment for details.

@Jeffail Jeffail force-pushed the fix/avro-nullable-record-name-resolution branch from 10000e8 to bbd78f9 Compare May 14, 2026 17:13
}
t = t2
}
return int32(t.UTC().Unix() / 86400), nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC3339 string path computes t.UTC().Unix() / 86400 without the floor-vs-truncate adjustment that the time.Time arm explicitly added at line 175-179. For a pre-epoch RFC3339 string with a fractional-day time component (e.g. "1969-12-31T12:00:00Z"), Unix() is -43200 and Go's integer division truncates toward zero to 0 (Jan 1 1970) instead of the correct -1 (Dec 31 1969). The bare-date "2006-01-02" shape lands exactly on midnight so it's unaffected, but RFC3339 inputs land here. Apply the same secs < 0 && secs%86400 != 0 adjustment used in the time.Time arm.

@claude
Copy link
Copy Markdown

claude Bot commented May 14, 2026

Commits
LGTM

Review
Large multi-commit PR closing the schema-metadata / value-vs-metadata mismatch class across confluent, iceberg, and parquet. Test coverage is thorough and well-targeted. One narrow but real bug in the new parquet DATE string-coercion path.

  1. internal/impl/parquet/schema_coercion.go#L194 — RFC3339 string path for coerceDateForEncode misses the floor-vs-truncate adjustment that the sibling time.Time arm applies. Pre-epoch RFC3339 inputs with a fractional-day component land off-by-one (Go integer division truncates toward zero for negative Unix() values). Bare YYYY-MM-DD strings are unaffected since they land exactly on midnight.

Comment on lines +181 to +194
case string:
t, errRFC := time.Parse(time.RFC3339, v)
if errRFC != nil {
t2, errDate := time.Parse("2006-01-02", v)
if errDate != nil {
// Surface both attempts — a malformed bare date like
// "2024-13-99" would otherwise yield only the RFC3339
// error, which misleadingly suggests the user must add
// a time component.
return nil, fmt.Errorf("parsing DATE string %q: tried RFC3339 (%v) and YYYY-MM-DD (%v)", v, errRFC, errDate)
}
t = t2
}
return int32(t.UTC().Unix() / 86400), nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string→date path doesn't apply the same floor-toward-negative-infinity adjustment that the time.Time case above (lines 175-180) does for pre-epoch dates. For an RFC3339 string like "1969-12-31T23:59:59Z", t.UTC().Unix() is -1, and Go's truncated division gives -1 / 86400 == 0 → days 0 (1970-01-01), but the correct days-since-epoch for that date is -1 (1969-12-31). The bare YYYY-MM-DD form happens to be safe because its time component is always midnight (Unix is a multiple of 86400), so this only affects RFC3339 strings with non-midnight times in pre-epoch dates. To stay consistent with the time.Time arm, compute secs := t.UTC().Unix() then apply the same if secs < 0 && secs%86400 != 0 { days-- } adjustment before the int32 cast.

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Commits

LGTM

Review

End-to-end Avro→Iceberg logical-type preservation work, including the CON-468 named-type-reference fix, a parquet/iceberg/JSON-schema type-coverage audit, a strict-mode escape hatch (require_schema_metadata), and the AddDataFiles dup-check skip (T6692). Coverage and tests are substantial.

  1. coerceDateForEncode pre-epoch RFC3339 string handling: the string path at internal/impl/parquet/schema_coercion.go#L181-L194 returns int32(t.UTC().Unix() / 86400) without the floor-toward-negative-infinity adjustment that the time.Time arm just above applies. RFC3339 inputs with pre-epoch dates and non-midnight times (e.g. "1969-12-31T23:59:59Z") produce day 0 instead of -1. See inline comment for the proposed fix.

@Jeffail Jeffail force-pushed the fix/avro-nullable-record-name-resolution branch from 80ff730 to b0edf02 Compare May 18, 2026 13:57
@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Commits
LGTM

Review
Thorough end-to-end fix for Avro logical-type preservation through schema_registry_decode → iceberg/parquet sinks, with strong test coverage (table-driven units, integration tests for the auto-create, coerce-on-existing-BIGINT, and strict-mode reject paths). Comments are WHY-focused, error wrapping uses %w with field naming, license headers and field-name constants follow the project conventions, and the strict-mode escape hatch (require_schema_metadata) is validated against schema_metadata at config-parse time. Plumbing of preserve_logical_types and translate_kafka_connect_types into ecsAvroParseFromBytes is symmetric with the value-side decoder.

LGTM

@claude
Copy link
Copy Markdown

claude Bot commented May 26, 2026

Commits
LGTM

Review
Reviewed the Avro logical-type / iceberg / parquet schema-metadata work across all 18 commits. The core CON-468 named-type reference resolver, the field-level logicalType (sibling-of-type) fix, the Kafka Connect / Debezium connect.name mapping, the new strict require_schema_metadata mode, the temporal-to-numeric coerce bridge for rolling upgrades, and the type-coverage gaps closed in iceberg/type_resolver (Map/Union/Date/TimeOfDay/UUID) and parquet/processor_encode all read cleanly. Edge cases — NaN/Inf rejection, pre-epoch DATE floor-rounding, namespace inheritance, FQN-vs-short-name resolution, self-referential record stubs, both [null, X] and [X, null] orderings, names-map clone-on-retrieval isolation, error wrapping through union resolvers — are covered by tests. The upstream-pinning test (TestUpstreamTwmbHonoursSiblingFormLogicalType) is a nice regression guard for the twmb/avro bump.

LGTM

Jeffail added 5 commits May 28, 2026 13:12
Avro JSON schemas may reference a previously-defined record/enum/fixed by
name rather than inlining the full definition — the Java/JDBC idiom for
any record reused across more than one field, e.g.

    {"name": "secondary_fee", "type": ["null", "Fee"]}

where "Fee" was defined inline by an earlier field. The metadata parser
in ecsAvroParseFromBytes was treating the string branch as an unknown
type and falling through to schema.Any, so the resulting common-schema
metadata reported the field as VARCHAR rather than the registered record
structure. Downstream sinks (notably iceberg) then created a string
column where the customer expected a nested struct.

Thread a names map through ecsAvroConfig, register every record/enum/
fixed by both its simple name and its fully-qualified namespace.name
form, and have the string-form type resolver consult the map before
falling back to the primitive-name lookup. Also generalise the optional-
union helper to accept either ordering -- [null, X] and [X, null] -- since
the Avro spec doesn't constrain branch order.

The lexical-scope assumption -- a name must be defined before it is
referenced -- is the Avro spec's, so a single forward-only pass suffices.
Self-referential records remain unsupported and would need pre-
registration with a placeholder; flagged in the registration helper's
comment.

Tests cover the four shapes CON-468's acceptance criteria call out:
nullable inline record (already green via #4380), nullable record by
name reference, both branch orderings, fully-qualified vs short-name
references, and record-with-nested-record where the inner level is
itself a name-reference union.
The lame-union path -- raw_unions: false on schema_registry_decode, which
is the documented default -- carried the same bug as the raw path: string
branches like "Fee" in ["null", "Fee"] went through ecsAvroTypeToCommon
directly and collapsed to schema.Any, even when "Fee" was a previously-
defined record. The tagged-JSON envelope around each branch then wrapped
an Any inner, producing a structureless metadata tree.

Reroute the lame hydrator through the same ecsAvroResolveTypeRef helper
the raw path now uses, then re-apply the lame-specific wrapping
(tagged-Object envelope, type-name preserved as Common.Name to match the
wire-form tag). The non-Avro behavior of the lame envelope is unchanged;
only the inner Common is now correctly populated for name references.

This closes the same CON-468 bug class for the default-config path that
commit 4531c51 closed for the raw-unions path.
Polish pass on the Avro JSON metadata parser following local review:

- Implement the Avro spec's namespace-inheritance rule. A name with no
  dot and no `namespace` field inherits the most-tightly-enclosing
  namespace; the new ecsAvroAssignFullname helper handles all three
  spelling forms (dot-in-name, explicit namespace, inheritance) and is
  mirrored by an inheritance-aware ecsAvroLookupName that tries
  `<enclosing>.<ref>` before the bare name.

- Pre-register a structural-stub placeholder before walking a record's
  children so a self-reference (linked-list style) resolves to a one-
  level Object stub instead of collapsing to schema.Any. Mutual
  recursion across distinct records remains out of scope.

- Deep-copy entries on retrieval from the names map (cloneCommon) so
  callers can mutate the returned Common without corrupting later
  look-ups. Removes a latent aliasing footgun.

- Restore the %w-wrapping contract through the union resolvers.
  ecsAvroResolveTypeRef now returns (Common, error) and
  ecsAvroResolveOptionalUnion returns (Common, bool, error); both union
  hydrators wrap the inner cause, so a malformed inline decimal surfaces
  as "decimal precision: not an integer: ..." rather than the type-
  stringifier fallback "could not resolve type map[string]interface{}".

Tests cover namespace inheritance with short and inherited-FQN refs,
dot-form names overriding the namespace field, self-reference stubs,
names-map immutability after caller mutation, and %w propagation across
raw / lame union paths and nullable / general-union shapes.
coerceDateForEncode's string path used Go's integer division, which
truncates toward zero. For a pre-epoch RFC3339 input with a non-midnight
component (e.g. "1969-12-31T23:59:59Z", Unix = -1) this rounded up to day
0 (1970-01-01) instead of the correct day -1 (1969-12-31). The time.Time
arm already had the floor-toward-negative-infinity correction; the string
arm did not.

Extract the rounding logic into a shared unixDaysFloor helper and call it
from both arms so the time.Time and string entry points produce
bit-identical days-since-epoch for the same instant. The bare-date
YYYY-MM-DD form parses to midnight UTC so its Unix is a multiple of
86400 — truncate and floor agree there — but it goes through the helper
too for consistency.

The regression test covers time.Time and RFC3339 inputs across pre-epoch
non-midnight, pre-epoch midnight, epoch, post-epoch non-midnight, and
bare-date shapes, asserting all paths agree on the floored day.
…ions

Kafka Connect / Debezium emit nullable string fields as
`[{"type":"string","connect.default":""}, "null"]` — an inline-object
non-null branch with extension annotations alongside the primitive type
name, rather than the bare `"null","string"` two-string spelling. A
field bug report claimed this shape failed to collapse to a nullable
STRING under `raw_unions: true`; on the current build it does collapse
correctly (the unknown annotations are ignored per Avro spec). Pinning
the behaviour with a regression test so any future change to the
optional-union resolver or the type-ref dispatcher trips on it.

Covers both branch orderings ([inline, null] and [null, inline]) — the
former was the shape from the bug report, the latter is the canonical
Avro JSON spelling.
@Jeffail Jeffail force-pushed the fix/avro-nullable-record-name-resolution branch from 8490a38 to 1480520 Compare May 28, 2026 12:28
@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Commits
LGTM

Review
Targeted fix for CON-468 (Avro JSON named-type references collapsing to schema.Any in nullable unions). Adds a names map registry, unifies the optional-union resolver to handle [null, X] and [X, null] across primitive/named-ref/inline-object branches, implements Avro namespace inheritance and self-reference stubs via cloned values, and routes both raw- and lame-union paths through a shared ecsAvroResolveTypeRef with %w-wrapped error propagation. Sibling parquet commit floors pre-epoch DATE strings via a shared unixDaysFloor helper so the time.Time and RFC3339 paths agree on the day. Tests pin down the CON-468 acceptance criteria plus namespace inheritance, dot-form names, self-reference stubs, names-map immutability, and error wrapping.

LGTM

Three review comments on PR #4427 that landed after merge:

- shredder: reject the schema/column mismatch where a Timestamp common
  reaches an Int32 column. coerceTemporalToNumeric returns int64
  UnixMilli/Micros/Nanos (~10^12), which the Int32Type arm cast to
  int32 with no bounds check, silently truncating into a garbage year.
  The arm is intended for Date / TimeOfDay coercions whose values do
  fit; bound-check post-coerce and fail loudly when they don't, with a
  message pointing the operator at BIGINT or a schema-metadata fix.

- output_iceberg: drop the redundant conf.Contains guard around
  require_schema_metadata parsing. The field declares Default(false)
  in the spec so FieldBool returns false-without-error on absence and
  the inline form fits cleanly.

- shredder: expose StrictTemporalMode as an exported struct field
  instead of the SetStrictTemporalMode setter. It's a single bool with
  no validation hook, so the setter added ceremony without value.
  Updates the sole writer.go caller to assign the field directly.

New regression test TestCoerceTemporalInt32OverflowGuard pins the
Int32 overflow rejection and confirms in-range Date / TimeOfDay
coercions still succeed.
@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Commits
LGTM

Review
Avro JSON name-resolution fix is well-structured: the new names map is threaded by reference through a value-copied cfg, namespace propagation correctly forks per child via the value-copy, and cloneCommon is exercised by an explicit isolation test. Self-reference handling is documented as a one-level stub. Error wrapping through the unified union resolvers preserves %w chains. The iceberg/parquet/shredder follow-ups (unixDaysFloor, int32 overflow guard, exported StrictTemporalMode field) are minimal and well-covered by new tests.

LGTM

Comment thread internal/impl/confluent/ecs_avro.go Outdated
placeholder := ecsAvroPlaceholder(typeName, shortName)
cfg.names[fullname] = placeholder
if shortName != "" && shortName != fullname {
cfg.names[shortName] = placeholder
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude highlighted this one, not sure if it's a risk? Same for line 647.

ecsAvroFromAnyMap registers every named type under both its fullname (com.a.Foo) and its short name (Foo). When two records share a short name in different
namespaces — e.g. com.a.Fee and com.b.Fee — the second parse overwrites cfg.names["Fee"]. Any unqualified reference to "Fee" that falls through to the bare-name
lookup (i.e. from root scope or a namespace that has no Fee of its own) silently resolves to whichever record was defined last.

The qualified lookup in ecsAvroLookupName (cfg.namespace + "." + ref) saves you when the reference is made from within the correct namespace — but references from
root scope or a third namespace skip that path and hit the stale entry directly.

ecsAvroFromAnyMap registered each record/enum/fixed under both its
fullname and an unqualified short-name shortcut, with the second
declaration silently overwriting the first when two types in different
namespaces shared a short name. An unqualified reference that missed the
enclosing-namespace prefix would then bind to whichever fullname
registered last — silent column corruption when downstream sinks key off
the resolved structure.

Route every registration through a new putName helper backed by a
nameOwners map that tracks which fullname currently owns each key. The
arbitration rules:

  - A canonical fullname binding (key == owner) always wins. Two
    different fullnames can never share a fullname key.
  - A short-name claim that collides with an existing canonical fullname
    for the same key is dropped (the fullname keeps the slot, regardless
    of registration order).
  - Two short-name claims from different fullnames mark the key as
    ambiguous: nameOwners becomes "" and the names entry is deleted, so
    the bare-name lookup falls through to schema.Any rather than guessing.

Unqualified references from inside the correct namespace still resolve
through the enclosing-prefix lookup (cfg.namespace + "." + ref) unchanged
— the ambiguity guard only affects the bare-name shortcut.

Two new tests pin the contract:

  - TestEcsAvroSharedShortNameAcrossNamespaces: com.a.Fee + com.b.Fee +
    bare "Fee" reference → bare reference resolves to schema.Any (loud
    ambiguity), fully-qualified references resolve to their respective
    types.
  - TestEcsAvroRootShortNameWinsOverNamespacedCollision: root-scope Fee
    + com.a.Fee, registered in both orders → bare "Fee" always resolves
    to the canonical root-scope Fee.
@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Commits

  1. Commit 042d53e ("iceberg: address PR confluent,iceberg,parquet: preserve avro logical types end-to-end across schema_metadata consumers #4427 follow-up review") bundles three logically independent changes: the Int32 overflow bug fix in convertLeafValue, the conf.Contains guard cleanup in parseSchemaEvolutionConfig, and the SetStrictTemporalMode setter → exported StrictTemporalMode field refactor. Per the commit-granularity rule ("each commit is one small, self-contained, logical change"), these should be split into separate commits — particularly the bug fix, which deserves its own focused commit so it can be cherry-picked or reverted independently of the refactors.

Review
The Avro name-resolution work threads a names registry and namespace through the recursion correctly, the short-name collision arbitration in putName is sound, and the optional-union resolver handles both branch orderings via a single helper. The placeholder-before-walk pattern correctly bottoms out self-references rather than collapsing to schema.Any. The Int32 overflow guard in convertLeafValue, the iceberg config Contains cleanup, and the unixDaysFloor extraction all look correct. Tests are comprehensive across the new behaviors (namespace inheritance, FQN vs short-name references, collision arbitration, self-reference stubs, %w propagation, pre-epoch DATE flooring, Int32 overflow rejection).

LGTM

@Jeffail Jeffail merged commit 5d2a90a into main May 28, 2026
7 of 8 checks passed
@Jeffail Jeffail deleted the fix/avro-nullable-record-name-resolution branch May 28, 2026 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants