all: cumulative perf, parity, spec-compliance, and DRY pass#39
Merged
Conversation
Decode-side perf vs main (Apple M1): DecodeAny -58%,
DecodeAnyTaggedUnions -54%, DecodeJSON_Any -57%, DecodeJSON_Struct
-46%, DeserializeGeneric -42%, Deserialize/cold -41%,
CustomTypeDecodeAny -52%, EncodeJSONTagged -34%. Encode regression:
SerializeGeneric +15% (771ns → 891ns at -benchtime=5s; the cost is
spread across reflect-package virtual calls in the new shared
helpers and the Float64→Float32 Inf-clamp guard, which can't be
inlined back without abandoning reflect). MapEncode +35% from
typed-string-key map support (the mapKeyAs helper that fixed the
panic on `type UserID string; map[UserID]V` shapes). The unsafe-
struct fast path is unaffected — encode hot paths with concrete
struct types don't touch these helpers. Decode B/op deltas are
small in absolute terms (+56 bytes on the new alias-aware JSON
record path). Every behavioral fix has a regression test under
TestRegression_* / TestSpec_*.
Correctness / parity:
- safe vs unsafe parity: udInt/udLong narrow-signed truncation,
udDouble Float32 Inf-clamp, usFloat Float64 Inf-clamp,
usLong Uint/Uint64 overflow rejection
- serArray/serMap specialized serFloat/serDouble paths match the
scalar paths (float32 Inf-clamp + integer coercion); factored
into appendAvroFloat32/64 helpers
- text-codec parity for "string": serArray/serMap go through
appendAvroString; JSON encoder routes via avroStringValue; JSON
decodeString gets TextUnmarshaler-via-Addr branch
- depth tracking on deserRecordFast, udArray*, usArray* paths
- 32-bit length-narrowing guards (deser/skip/promote/unsafe/ocf)
and array start+n overflow
- deserFixedUUIDReflect no longer aliases input buffer
- serTimeMicros accepts time.Time (parity with serTimeMillis);
JSON "long" arm gets matching time-micros case
- serArray/serMap primitive specializations chase multi-level
pointer elements (**T, ***T) via unwrapElemPtr; parity with
unsafe fast path
- JSON encoder "bytes" arm accepts [N]byte (parity with "fixed"
and binary serBytes)
- timeToDate uses calendar-date interpretation (Java LocalDate.
toEpochDay / fastavro parity), not UTC instant. Re-anchors
wall-clock year/month/day at UTC for the day count, ignoring
t's zone offset. Pre-fix a time.Time whose wall-clock date was
D in a non-UTC zone encoded to D-1 or D+1 depending on offset
sign — and the same calendar date reached different wire
values whether the user pre-converted to UTC or passed a
TZ-offset string. Returns (int32, error) for overflow (kept
from the prior change)
- JSON encoder bytes/fixed Go-string input → raw UTF-8 (matches
serBytes/serSize); pre-fix Encode("é"/bytes) gave c3a9 binary
vs e9 JSON. Defaults unaffected (convertDefaultBytes runs first)
- findUnionBranch logical-tag fallback matches (kind, logical)
pair, not kind alone; "fixed" added to participating kind list
so TagLogicalTypes() output for fixed-with-logical round-trips
- JSON decoder accepts bare NaN/Infinity/-Infinity tokens
(fastavro/Python json.dumps form); casing-lenient via
parseSpecialFloat delegation. Dispatch in decodeFloat/Double
extended to lowercase first letters via isBareSpecialFloat-
Start helper — pre-fix the gate only checked uppercase N/I/-I
so "nan", "inf", "-inf" routed to the wrong arm and failed
with confusing errors, contradicting the documented casing-
parity claim with the quoted-string form. null still routes
to the null arm via isJSONNullStart (peek-2 disambiguation)
- decodeUnionObject/Bare wrap the underlying branch-decode error
in their "no union branch matched" message so the real cause
(e.g. target-type mismatch) is visible
- String-target decode arms for the seven string-accepting time
logicals (date + timestamp-{millis,micros,nanos} + local-
timestamp-{millis,micros,nanos}). Encoder accepted RFC 3339 /
DateOnly Go strings (extractTime/tryParseDateString) but the
decoder rejected string targets with "cannot use string with
Avro type {int,long}"; Encode(string)+Decode(*string) on the
same schema would break. Fix lands in deserDate,
deserTimeAsLong, decodeInt's "date" arm, decodeLong's timestamp
arm — all four sites emit the canonical form (DateOnly for
date, RFC3339Nano for the long timestamps) so re-encode is
wire-stable.
- Promotion path now chains the reader's logical-type conversion
on top of the wire widening (Java Resolver.Action carries
logicalType + conversion orthogonally to Promote). Pre-fix,
Resolve(writer=int, reader=long+timestamp-millis) produced raw
int64 instead of time.Time at every nesting (top-level, record
field, array item, map value, reader-union branch).
promotionDeserForLogical wraps the int→long, string↔bytes
promotion deser when the reader has a recoverable logical
(timestamp-*, local-timestamp-*, time-micros, decimal, uuid).
Factored setTimeAsLongTarget shared with natural deserTimeAsLong.
- resolveUnionUnion's TaggedUnions tag now uses the READER branch
name, not the writer's. Sibling resolveReaderUnion already
used rb. Pre-fix Resolve(["null","int"], ["null","long"])
decoded into *any with TaggedUnions emitted {"int":42} instead
of {"long":42} — the tag named a branch absent from the
reader's schema.
- SchemaCache + CustomType: a cached named-type reference now
errors LOUD at Parse time when the current Parse registers a
CustomType that would match a node in the cached subtree but
the cached entry wasn't itself built with a matching CT.
Pre-fix the CT silently failed to fire on the cached fields:
the cached *schemaNode's ser/deser/decodeJSON were baked at
the original parse and reused verbatim. Detection requires
preserving the original (unknown-at-original-parse) logical
type via a new schemaNode.unknownLogical field, plus tracking
namedType.hadCustomType so the documented remediation path
"re-parse Inner with the CT first" succeeds. buildPrimitive's
bare-string and namespace-qualified arms plus buildComplex's
wrapped-form ref arm all go through the same rejection check.
Spec compliance / Java interop:
- decimal precision enforced on encode; invalid precision/scale
rejected at parse time
- decimal encode rejects values not exactly representable at the
schema's scale (Java RoundingMode.UNNECESSARY parity);
tryCoerceToRat uses shortest-decimal FormatFloat so 12.34 stays
expressible at scale 2
- local-timestamp-* uses wall-clock-as-UTC (Java parity)
- long-typed schema defaults > 2^53 preserved via UseNumber()
- JSON bytes decode uses codepoint semantics (spec compliance)
- decimal JSON encode emits spec form (codepoint-mapped string);
decoder remains lenient on bare-number / decimal-value-string
- time-micros decode rejects values that would overflow Duration;
guard inside timeMicrosToDuration covers all four call sites
- timestamp-millis/micros/nanos + local-timestamp-* encode return
error on int64 overflow; helpers return (int64, error)
- parseJSONInt64 rejects > MaxInt64 (pre-multiply cutoff guard)
- JSON DecodeJSON concurrency-safe across goroutines via
closure-captured custom-type dispatch
- JSON EncodeJSON of nil into no-null-branch union errors instead
of silently emitting "null"
- JSON DecodeJSON of bytes/fixed into [N]byte rejects length
mismatch (Java parity)
- union-default coercion no longer overrides earlier branches
(Java/fastavro/hamba parity)
- JSON DecodeJSON of decimal accepts same Go targets as binary
(*big.Rat / json.Number / *float32 / *float64 / *string / *any)
with float-overflow guards; centralized in setDecimalRat
- Maps with named-string key types no longer panic; mapKeyAs
helper centralizes typed-key construction
- JSON fixed length validated on decode
- union resolution: exact-kind before promotion (Java bestBranch)
- writer-union resolution uniformly fail-fast (deliberate, eager;
documented divergence from Java's per-branch ErrorAction)
- null-union fast paths read full varint, accept non-canonical
multi-byte index encodings
- Canonical() output byte-for-byte equal to Java
- OCF reader validates the sync marker on count=0 blocks before
signalling io.EOF (spec compliance + Java/fastavro parity).
Pre-fix readBlock bailed at count==0 without reading size +
sync, so a tail-truncated file whose count byte happened to
read as 0 was silently accepted as clean EOF. Now the block
envelope (count + size + data + sync) is fully validated; the
EOF signal only fires after a valid empty block boundary.
- OCF Writer.Close() always closes the codec, even when the
writer is in a poisoned w.err state. Pre-fix the early
`return w.err` skipped codec.Close() — zstd and similar codecs
leaked internal goroutines + buffers. Mirrors Java
DataFileWriter.close's try { flush } finally { codec.close }.
- OCF NewReader / NewAppendWriter close the codec on every
error path after resolveCodec succeeds (readerSchemaFn error,
Resolve error, Seek error). Named-return + deferred-close
pattern. Pre-fix the codec was orphaned on these paths.
- Schema parser accepts {"type":"Node"} wrapped name reference
- JSON DecodeJSON honors record-field aliases
DRY / structural-guarantee refactors:
- setFloatValue / setBytesValue / setStringValue helpers capture
the accepted-Go-target set for Avro float/double, bytes, string.
Natural deser (deserFloat/Double/Bytes/String), promotion deser
(promote{Int,Long}To{Float,Double}, promoteFloatToDouble,
promoteStringToBytes, promoteBytesToString), and JSON decoders
all route through them. assignFloatToTarget removed (was near-
duplicate). decodeString stays parallel to setStringValue by
design (slab-string optimization) — lock-step enforced by
comments + TestRegression_StringTargetParityBinaryJSON.
Pre-factoring promotion target sets were a strict subset of
natural-deser targets.
- JSON encoder custom-type dispatch closure-captured at schema
build (parallel to binary path's wrapDeser wrapping); drops the
customDecoders/customSNs maps + inProgress recursion guard
Documentation:
- EncodeJSON/TaggedUnions doc the bare-union default rejection
by Java/fastavro/avro-tools (Jira AVRO-2899)
- validateDefault/encodeDefault doc Avro 1.12 union-default
relaxation (Jira AVRO-3649)
- README gains big-decimal entry + carrier-limitation note
Test infrastructure:
- TestParity_RoundTripMatrix: NaN/+Inf/-Inf cells (float +
double); non-ASCII Go-string cells for bytes ("héllo",
"€uros") and fixed; equal helper gained NaN-aware float arms
- TestParity_EncoderOptionMatrix: every cell now asserts
DecodeJSON(EncodeJSON(input, opts), opts) round-trips to a
documented decodeBack value (not just substring match). Cells
for TagLogicalTypes + TaggedUnions over fixed-uuid /
fixed-duration; option-combo cells (LinkedinFloats ×
TaggedUnions / TagLogicalTypes; TaggedUnions × TagLogicalTypes
record-nested). Recursive floatEqual handles NaN in nested
map/slice decodeBack
- TestRegression_PromotionTargetSetMatchesNatural: 28 cells
(every promotion × supported target)
- TestRegression_JSONDecodeBareNaNInfinityTokens: top-level +
nested (record / array / union) + casing-parity coverage
- TestRegression_EncodeBytesStringBinaryJSONParity: 6 cells
(2-byte / 3-byte runes; sized fixed)
- TestRegression_TaggedUnionLogicalDisambiguation: positive +
negative cases for (kind, logical) pair-match
- TestRegression_TagLogicalTypesFixedRoundTrip: uuid / decimal /
duration fixed-with-logical branches
- TestRegression_StringTargetParityBinaryJSON: binary +
JSON + promotion paths all accept TextUnmarshaler targets
- TestRegression_DefaultValueMaterializationParity: int / string
/ bytes-codepoint / fixed-codepoint / enum defaults produce
semantically-equal binary + JSON output
- TestRegression_TimeLogicalStringRoundTrip: 10 sub-tests
covering Encode(string)→Decode(*string) for date and the six
long-typed timestamps, including struct-field shapes.
TestParity_RoundTripMatrix gains seven /string cells for the
same logicals so any binary/JSON encoder divergence on the
Go-string input shape fails the matrix
- TestRegression_LenientInputAudit: 43 cells enumerating every
documented encoder-lenient Go-input shape (numeric coercions,
string/bytes/fixed, enum, UUID, decimal, duration, time
logicals, opaque-bytes pass-through for decimal-bytes /
decimal-fixed / big-decimal / duration). Asserts symmetric
Decode(Encode(input)) on both the binary AND JSON paths.
Documented intentional asymmetries are explicit skipReason
cells. Structural guarantee against the "encoder accepts X
but decoder can't read it back into the same Go type" class
- TestRegression_UnionDispatchAmbiguous: 9 cells locking the
union-encode dispatch rule (1: tagged-map unwrap, 2: Go-type
natural Avro kind via unionTypeNameForValue, 3: fall-through
schema-order try-each). Pre-test the rule was implicit; tests
now pin natural-kind-vs-schema-order distinctions like
int64(42) against ["int","long"] always picking long
- TestRegression_ArrayMapElementSpecializationParity: 16 sub-
tests asserting every per-primitive serArray/serMap
specialization accepts the same lenient Go-element shapes
as its scalar counterpart (multi-level pointers, whole-
number-float-as-int, json.Number into long). Locks the
sibling-sweep that previously had to be done by hand for
each leniency added to scalar serializers
- README's "Encode/decode behavior contract" section
documents every deliberate encode/decode divergence: lossy-
by-design time.Time → time-millis/date, spec/interop choices
vs Java/fastavro, decoder-leniency-only cases. doc.go
references the section; no inline godoc duplication
- TestRegression_PromoteLogical: 13 sub-tests covering the
full cross-product of promotion×reader-logical. int→long
against all 6 long-typed timestamps (timestamp-{millis,
micros,nanos} + local-timestamp-{millis,micros,nanos}) +
time-micros at top-level/record/union/array nestings;
bytes→string+uuid into [16]byte; string→bytes+decimal and
string→bytes+big-decimal (the latter added a missing wrapper
promoteStringToBytesBigDecimal — the original fix had a gap
for big-decimal vs plain decimal)
- TestRegression_TaggedUnionTagAfterPromotionBothUnion:
Resolve([null,int]→[null,long]) decode with TaggedUnions
must emit reader-side {"long":42}, not writer-side
{"int":42}
- TestRegression_ResolverTagNameParity: 11 cells walking every
(writer-union × reader-union × promotion) combination with
TaggedUnions, asserting reader-side branch names emit
everywhere. Catches any future drift between resolveReader-
Union, resolveUnionUnion, or a new third site
- TestRegression_BlockCountZeroValidatesSync: count=0 block
with corrupt sync must error with "sync marker mismatch",
not return io.EOF silently
- TestRegression_OCFBlockEnvelopeInvariant: 3 cells covering
spec-bypass sibling cases to the count=0 finding — negative
count, negative size, size > safety limit must all error
loudly without fall-through
- TestRegression_OCFWriterCloseClosesCodecWhenPoisoned: codec
is closed even when writer is in poisoned state
- TestRegression_OCFNewReaderClosesCodecOn{ReaderSchemaFn,
Resolve}Error: NewReader's error paths after resolveCodec
succeed always close the codec
- TestRegression_CheckCompatVsResolveParity: 24 cells walking
compatible and incompatible (writer, reader) pairs.
CheckCompatibility and Resolve are parallel implementations
sharing the compat rules; this matrix asserts they agree on
every pair so future drift in either site fails the test
- TestRegression_LogicalTypedDefaults: 3 cells (timestamp-
millis, date, decimal) covering the "logical type + default
value" composition that the basic DefaultValueMaterialization
matrix didn't reach — same shape as Finding 1 but for the
defaults path instead of promotion
- TestRegression_FieldAliasWithPromotionLogical: aliased field
+ writer→reader promotion + reader logical type. Composition
of two evolution mechanisms not covered before
- TestRegression_RecordFieldReorderWithPromotion: writer
[a:int, b:string] → reader [b:string, a:long] decodes by
name AND applies int→long promotion. Composition test
- TestRegression_ResolveDoesNotMutateInputs: 32 concurrent
Resolve calls against shared writer + readers; canonical
forms remain stable. Sibling of the JSON DecodeJSON
concurrency fix — would catch a shared-state mutation in
the resolve path
- TestRegression_SingleObjectRoundTrip: AppendSingleObject /
DecodeSingleObject magic-byte frame round-trip across
primitive/record/union schemas + corrupt-magic-byte rejection
- TestRegression_SchemaCacheCustomTypeRefRejected: 5 cells
locking the cached-named-type-ref + CT silent-drop rejection
(Parse errors on unrelated CT add; re-parsing Inner with CT
enables clean Outer ref; no-match cases pass; wildcard CTs
pass; namespace-qualified refs also rejected)
- TestRegression_LookupCIDeterminism: parses a schema with
case-colliding keys; Root() and Canonical() must be stable
across 100 calls. Locks the lookupCI fuzzer finding (smallest-
matching-key picked deterministically instead of Go's
randomized map iteration)
- TestRegression_DateEncodeWallClock: 7 cells (UTC, +05, -05
zones × calendar dates + string TZ-offset matches date-only)
locking calendar-date interpretation. Pre-fix Java parity was
broken — same calendar date encoded to different epoch days
depending on zone offset
- TestRegression_JSONDecodeBareNaNInfinityCasingParity: 27
cells (13 casings × 2 schemas + null arm guard) locking the
bare-token dispatch-gate fix
- TestRegression_BigDecimalJSONOpaquePassThrough: locks JSON's
assignBytes "big-decimal" arm falling through to
setBytesValue for byte-like targets when the inner payload
fails to parse. Pre-fix the JSON arm returned the parse
error immediately, breaking JSON encode→decode round-trip
for any raw []byte the binary path accepts via
serBigDecimal's serBytes fall-through. Structured targets
(big.Rat / json.Number / float) still surface the parse error
- bench_perf_test.go captures focused decode benchmarks +
BenchmarkSpec(Array|Map)MultiLevelPointer
- ocf/testdata/bigdec.avro vendored from apache/avro-rs anchors
cross-impl big-decimal compatibility
- ~210+ TestRegression_* / TestSpec_* pin spec / parity / interop
fixes. TestSpec_* locks behaviors verified clean during audit
rounds but lacking explicit test coverage.
Process:
- README's "Encode/decode behavior contract" section tracks
deliberate behavior choices (writer-union fail-fast resolution;
decimal JSON-decode leniency; non-canonical multi-byte varint
acceptance; local-timestamp wall-clock-as-UTC; whole-number-
float lenient encode; quoted special-floats default vs Java's
bare tokens; big-decimal canonical-scale carrier limitation;
cached named-type CT rejection at Parse; OCF Snappy CRC
verification on read; OCF codec memory-bound caveat for
untrusted input)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Single cumulative commit (
55787af) covering parity, spec-compliance, DRY refactors, and a perf pass. Full diff log is in the commit message; high-level:mapKeyAs(fixes panics on typed-string-key maps). Unsafe-struct fast path unaffected.setFloatValue/setBytesValue/setStringValuehelpers, factoredappendAvroFloat32/64,setTimeAsLongTarget,promotionDeserForLogicalfor chained promotion+logical conversion.Test plan
go test -count=1 -race ./...— passesgo vet ./...— clean