avro: resolve named-type references in nullable unions#4429
Conversation
| case map[string]any: | ||
| inner, err := ecsAvroFromAnyMap(cfg, b) | ||
| if err != nil { | ||
| return schema.Common{}, false | ||
| } | ||
| return inner, true |
There was a problem hiding this comment.
Minor regression in error reporting: the err from ecsAvroFromAnyMap is dropped here, so the union hydrators downstream report "union ... child '...': could not resolve type map[string]interface {}" instead of the previous wrapped form "union ... child '...': decimal precision: not an integer: foo" (see the old map branch in ecsAvroHydrateRawUnion that previously called ecsAvroFromAnyMap directly with %w wrapping).
This violates the godev %w-wrapping pattern. Consider returning (Common, error) from ecsAvroResolveTypeRef and wrapping at the call site so malformed inline-record errors surface their root cause.
|
Commits Review
|
10000e8 to
bbd78f9
Compare
| } | ||
| t = t2 | ||
| } | ||
| return int32(t.UTC().Unix() / 86400), nil |
There was a problem hiding this comment.
The RFC3339 string path computes t.UTC().Unix() / 86400 without the floor-vs-truncate adjustment that the time.Time arm explicitly added at line 175-179. For a pre-epoch RFC3339 string with a fractional-day time component (e.g. "1969-12-31T12:00:00Z"), Unix() is -43200 and Go's integer division truncates toward zero to 0 (Jan 1 1970) instead of the correct -1 (Dec 31 1969). The bare-date "2006-01-02" shape lands exactly on midnight so it's unaffected, but RFC3339 inputs land here. Apply the same secs < 0 && secs%86400 != 0 adjustment used in the time.Time arm.
|
Commits Review
|
| case string: | ||
| t, errRFC := time.Parse(time.RFC3339, v) | ||
| if errRFC != nil { | ||
| t2, errDate := time.Parse("2006-01-02", v) | ||
| if errDate != nil { | ||
| // Surface both attempts — a malformed bare date like | ||
| // "2024-13-99" would otherwise yield only the RFC3339 | ||
| // error, which misleadingly suggests the user must add | ||
| // a time component. | ||
| return nil, fmt.Errorf("parsing DATE string %q: tried RFC3339 (%v) and YYYY-MM-DD (%v)", v, errRFC, errDate) | ||
| } | ||
| t = t2 | ||
| } | ||
| return int32(t.UTC().Unix() / 86400), nil |
There was a problem hiding this comment.
The string→date path doesn't apply the same floor-toward-negative-infinity adjustment that the time.Time case above (lines 175-180) does for pre-epoch dates. For an RFC3339 string like "1969-12-31T23:59:59Z", t.UTC().Unix() is -1, and Go's truncated division gives -1 / 86400 == 0 → days 0 (1970-01-01), but the correct days-since-epoch for that date is -1 (1969-12-31). The bare YYYY-MM-DD form happens to be safe because its time component is always midnight (Unix is a multiple of 86400), so this only affects RFC3339 strings with non-midnight times in pre-epoch dates. To stay consistent with the time.Time arm, compute secs := t.UTC().Unix() then apply the same if secs < 0 && secs%86400 != 0 { days-- } adjustment before the int32 cast.
|
Commits LGTM Review End-to-end Avro→Iceberg logical-type preservation work, including the CON-468 named-type-reference fix, a parquet/iceberg/JSON-schema type-coverage audit, a strict-mode escape hatch (
|
80ff730 to
b0edf02
Compare
|
Commits Review LGTM |
|
Commits Review LGTM |
Avro JSON schemas may reference a previously-defined record/enum/fixed by
name rather than inlining the full definition — the Java/JDBC idiom for
any record reused across more than one field, e.g.
{"name": "secondary_fee", "type": ["null", "Fee"]}
where "Fee" was defined inline by an earlier field. The metadata parser
in ecsAvroParseFromBytes was treating the string branch as an unknown
type and falling through to schema.Any, so the resulting common-schema
metadata reported the field as VARCHAR rather than the registered record
structure. Downstream sinks (notably iceberg) then created a string
column where the customer expected a nested struct.
Thread a names map through ecsAvroConfig, register every record/enum/
fixed by both its simple name and its fully-qualified namespace.name
form, and have the string-form type resolver consult the map before
falling back to the primitive-name lookup. Also generalise the optional-
union helper to accept either ordering -- [null, X] and [X, null] -- since
the Avro spec doesn't constrain branch order.
The lexical-scope assumption -- a name must be defined before it is
referenced -- is the Avro spec's, so a single forward-only pass suffices.
Self-referential records remain unsupported and would need pre-
registration with a placeholder; flagged in the registration helper's
comment.
Tests cover the four shapes CON-468's acceptance criteria call out:
nullable inline record (already green via #4380), nullable record by
name reference, both branch orderings, fully-qualified vs short-name
references, and record-with-nested-record where the inner level is
itself a name-reference union.
The lame-union path -- raw_unions: false on schema_registry_decode, which is the documented default -- carried the same bug as the raw path: string branches like "Fee" in ["null", "Fee"] went through ecsAvroTypeToCommon directly and collapsed to schema.Any, even when "Fee" was a previously- defined record. The tagged-JSON envelope around each branch then wrapped an Any inner, producing a structureless metadata tree. Reroute the lame hydrator through the same ecsAvroResolveTypeRef helper the raw path now uses, then re-apply the lame-specific wrapping (tagged-Object envelope, type-name preserved as Common.Name to match the wire-form tag). The non-Avro behavior of the lame envelope is unchanged; only the inner Common is now correctly populated for name references. This closes the same CON-468 bug class for the default-config path that commit 4531c51 closed for the raw-unions path.
Polish pass on the Avro JSON metadata parser following local review:
- Implement the Avro spec's namespace-inheritance rule. A name with no
dot and no `namespace` field inherits the most-tightly-enclosing
namespace; the new ecsAvroAssignFullname helper handles all three
spelling forms (dot-in-name, explicit namespace, inheritance) and is
mirrored by an inheritance-aware ecsAvroLookupName that tries
`<enclosing>.<ref>` before the bare name.
- Pre-register a structural-stub placeholder before walking a record's
children so a self-reference (linked-list style) resolves to a one-
level Object stub instead of collapsing to schema.Any. Mutual
recursion across distinct records remains out of scope.
- Deep-copy entries on retrieval from the names map (cloneCommon) so
callers can mutate the returned Common without corrupting later
look-ups. Removes a latent aliasing footgun.
- Restore the %w-wrapping contract through the union resolvers.
ecsAvroResolveTypeRef now returns (Common, error) and
ecsAvroResolveOptionalUnion returns (Common, bool, error); both union
hydrators wrap the inner cause, so a malformed inline decimal surfaces
as "decimal precision: not an integer: ..." rather than the type-
stringifier fallback "could not resolve type map[string]interface{}".
Tests cover namespace inheritance with short and inherited-FQN refs,
dot-form names overriding the namespace field, self-reference stubs,
names-map immutability after caller mutation, and %w propagation across
raw / lame union paths and nullable / general-union shapes.
coerceDateForEncode's string path used Go's integer division, which truncates toward zero. For a pre-epoch RFC3339 input with a non-midnight component (e.g. "1969-12-31T23:59:59Z", Unix = -1) this rounded up to day 0 (1970-01-01) instead of the correct day -1 (1969-12-31). The time.Time arm already had the floor-toward-negative-infinity correction; the string arm did not. Extract the rounding logic into a shared unixDaysFloor helper and call it from both arms so the time.Time and string entry points produce bit-identical days-since-epoch for the same instant. The bare-date YYYY-MM-DD form parses to midnight UTC so its Unix is a multiple of 86400 — truncate and floor agree there — but it goes through the helper too for consistency. The regression test covers time.Time and RFC3339 inputs across pre-epoch non-midnight, pre-epoch midnight, epoch, post-epoch non-midnight, and bare-date shapes, asserting all paths agree on the floored day.
…ions
Kafka Connect / Debezium emit nullable string fields as
`[{"type":"string","connect.default":""}, "null"]` — an inline-object
non-null branch with extension annotations alongside the primitive type
name, rather than the bare `"null","string"` two-string spelling. A
field bug report claimed this shape failed to collapse to a nullable
STRING under `raw_unions: true`; on the current build it does collapse
correctly (the unknown annotations are ignored per Avro spec). Pinning
the behaviour with a regression test so any future change to the
optional-union resolver or the type-ref dispatcher trips on it.
Covers both branch orderings ([inline, null] and [null, inline]) — the
former was the shape from the bug report, the latter is the canonical
Avro JSON spelling.
8490a38 to
1480520
Compare
|
Commits Review LGTM |
Three review comments on PR #4427 that landed after merge: - shredder: reject the schema/column mismatch where a Timestamp common reaches an Int32 column. coerceTemporalToNumeric returns int64 UnixMilli/Micros/Nanos (~10^12), which the Int32Type arm cast to int32 with no bounds check, silently truncating into a garbage year. The arm is intended for Date / TimeOfDay coercions whose values do fit; bound-check post-coerce and fail loudly when they don't, with a message pointing the operator at BIGINT or a schema-metadata fix. - output_iceberg: drop the redundant conf.Contains guard around require_schema_metadata parsing. The field declares Default(false) in the spec so FieldBool returns false-without-error on absence and the inline form fits cleanly. - shredder: expose StrictTemporalMode as an exported struct field instead of the SetStrictTemporalMode setter. It's a single bool with no validation hook, so the setter added ceremony without value. Updates the sole writer.go caller to assign the field directly. New regression test TestCoerceTemporalInt32OverflowGuard pins the Int32 overflow rejection and confirms in-range Date / TimeOfDay coercions still succeed.
|
Commits Review LGTM |
| placeholder := ecsAvroPlaceholder(typeName, shortName) | ||
| cfg.names[fullname] = placeholder | ||
| if shortName != "" && shortName != fullname { | ||
| cfg.names[shortName] = placeholder |
There was a problem hiding this comment.
Claude highlighted this one, not sure if it's a risk? Same for line 647.
ecsAvroFromAnyMap registers every named type under both its fullname (com.a.Foo) and its short name (Foo). When two records share a short name in different
namespaces — e.g. com.a.Fee and com.b.Fee — the second parse overwrites cfg.names["Fee"]. Any unqualified reference to "Fee" that falls through to the bare-name
lookup (i.e. from root scope or a namespace that has no Fee of its own) silently resolves to whichever record was defined last.
The qualified lookup in ecsAvroLookupName (cfg.namespace + "." + ref) saves you when the reference is made from within the correct namespace — but references from
root scope or a third namespace skip that path and hit the stale entry directly.
ecsAvroFromAnyMap registered each record/enum/fixed under both its
fullname and an unqualified short-name shortcut, with the second
declaration silently overwriting the first when two types in different
namespaces shared a short name. An unqualified reference that missed the
enclosing-namespace prefix would then bind to whichever fullname
registered last — silent column corruption when downstream sinks key off
the resolved structure.
Route every registration through a new putName helper backed by a
nameOwners map that tracks which fullname currently owns each key. The
arbitration rules:
- A canonical fullname binding (key == owner) always wins. Two
different fullnames can never share a fullname key.
- A short-name claim that collides with an existing canonical fullname
for the same key is dropped (the fullname keeps the slot, regardless
of registration order).
- Two short-name claims from different fullnames mark the key as
ambiguous: nameOwners becomes "" and the names entry is deleted, so
the bare-name lookup falls through to schema.Any rather than guessing.
Unqualified references from inside the correct namespace still resolve
through the enclosing-prefix lookup (cfg.namespace + "." + ref) unchanged
— the ambiguity guard only affects the bare-name shortcut.
Two new tests pin the contract:
- TestEcsAvroSharedShortNameAcrossNamespaces: com.a.Fee + com.b.Fee +
bare "Fee" reference → bare reference resolves to schema.Any (loud
ambiguity), fully-qualified references resolve to their respective
types.
- TestEcsAvroRootShortNameWinsOverNamespacedCollision: root-scope Fee
+ com.a.Fee, registered in both orders → bare "Fee" always resolves
to the canonical root-scope Fee.
|
Commits
Review LGTM |
Summary
Closes CON-468. The Avro JSON schema parser at
internal/impl/confluent/ecs_avro.gowas dropping named-type references in nullable unions — i.e. the["null", "Fee"]idiom whereFeeis a previously-defined record reused across more than one field — toschema.Any, so downstream sinks (notablyiceberg) saw a VARCHAR column where the customer's Avro schema asked for a nested struct.#4380 closed the inline-record form of this shape (
["null", {"type":"record","name":"Fee",...}]). It did not address the more common name-reference form, which any non-trivial Avro schema with a reused record will produce — the Avro spec requires named types to be defined once and referenced by name thereafter.What changed
ecsAvroConfignow carries anames map[string]schema.Commonpopulated as the parser walks the schema. Everyrecord,enum, andfixeddefinition is registered under both its simple name and its fully-qualifiednamespace.nameform, matching Avro's lexical-scope resolution rules. String-form type references consult the map before falling back to the primitive-name lookup, so"Fee"resolves to the registered record's full structure instead of collapsing toschema.Any.The two existing optional-union helpers (
ecsAvroIsUnionJustOptionalfor primitives,ecsAvroIsUnionJustOptionalObjectfor inline objects) are replaced by a unifiedecsAvroResolveOptionalUnionthat handles all three branch forms (primitive, named reference, inline object) and accepts either ordering —["null", X]and[X, null]are now equivalent, per CON-468 acceptance criterion 2.Both the raw-union and lame-union (default
raw_unions: false) hydrators route through the sameecsAvroResolveTypeRefhelper, so the fix covers every customer regardless of theirraw_unionssetting.Commit narrative
4531c517110000e8a4Scope notes
schema.Commonmetadata to begin with —store_schema_metadatais wired only for the Avro arm. That is a pre-existing limitation, not a regression, and a feature request worth its own ticket if a customer asks for it.Test plan
go test ./internal/impl/confluent/...— all green, including the six new tests below.task lint—0 issues.TestEcsAvroRawUnionNestedRecord(the #4380 regression) still passes — the inline-record path is unchanged.New tests:
TestEcsAvroRawUnionNullableRecordByName—["null", "Fee"]name reference.TestEcsAvroRawUnionNullableOrderIndependence(3 sub-tests) —[X, "null"]across inline records, primitives, and name references.TestEcsAvroRawUnionNullableRecordNamespaced— short and fully-qualified namespace references.TestEcsAvroRawUnionNullableRecordNested— record-containing-record where the inner is a name-reference union.TestEcsAvroLameUnionNameResolution— same fix verified in the default-config (lame-union) path.