schema: adopt Decimal and BigDecimal common types across sources and converters#4358
schema: adopt Decimal and BigDecimal common types across sources and converters#4358josephwoodward merged 9 commits intomainfrom
Conversation
…and converters Threads benthos's new Decimal and BigDecimal common-schema types end-to-end: - Five CDC sources (postgres, mysql, mssqlserver, oracledb, mongodb) now emit Decimal(p, s) when precision and scale are declared and BigDecimal when they are not, with values normalised to canonical decimal strings via a new internal/sqlutil canonicaliser. - Four format converters (iceberg, parquet, avro, json schema) honour Decimal natively. BigDecimal is rejected by the bounded-format encoders with an actionable error and accepted by JSON Schema as a permissive string-with-pattern. - ecs_avro detects logicalType: decimal in Avro specs and the schema_registry_decode store_schema_metadata path normalises decoded big.Rat values to canonical strings. - Shared Parquet decimal-byte helpers extracted into internal/impl/parquet/parquetdecimal so the parquet encoder and the iceberg shredder no longer carry duplicate implementations. The adoption is wired through a temporary go.mod replace directive pointing at the local benthos checkout while a tagged release is prepared; that directive is the one remaining follow-up before merge.
b76af04 to
0956e4d
Compare
|
|
||
| replace github.com/99designs/keyring => github.com/Jeffail/keyring v1.2.3 | ||
|
|
||
| // TEMPORARY: replace benthos with local checkout while decimal common-type adoption is in flight. |
There was a problem hiding this comment.
This replace directive points at an absolute path on the author's machine (/Users/ash/src/ai/benthos-schema-decimal-types), which will break go mod download / go build for every other developer and CI. The commit message acknowledges this as the one remaining follow-up before merge — flagging it here so it isn't lost: must be flipped back to a tagged benthos version before this PR can land.
| // License (the "License"); you may not use this file except in compliance with | ||
| // the License. You may obtain a copy of the License at | ||
| // | ||
| // https://github.com/redpanda-data/connect/v4/blob/main/licenses/rcl.md |
There was a problem hiding this comment.
RCL license URL has an erroneous /v4/ segment — https://github.com/redpanda-data/connect/v4/blob/main/licenses/rcl.md is a 404. Every other RCL header in the repo (e.g. public/components/all/package.go) uses https://github.com/redpanda-data/connect/blob/main/licenses/rcl.md. The same typo is in internal/sqlutil/decimal_test.go:7; please fix both.
|
Commits Review
|
| return int32(n), nil | ||
| case nil: | ||
| return 0, fmt.Errorf("missing") | ||
| default: |
There was a problem hiding this comment.
Avro's scale is optional isn't it but would fall into this default case?
There was a problem hiding this comment.
Good catch — fixed in 8f2cc5b. The decimal logical-type handler now defaults scale to 0 when the field is absent, matching the Avro spec. precision remains required. Added TestEcsAvroFromBytesDecimalScaleDefaultsToZero covering the omitted-scale case.
…y integration tests Two integration-test failures pinned during PR-readiness verification: 1. Oracle bare NUMBER columns (no declared precision and scale) were routed through the Decimal canonicaliser because go-ora's *sql.ColumnType.DecimalSize() reports (precision=38, scale=255, ok=true) for them — 255 is the driver's "any-scale" sentinel. The snapshot mapper treated that as a real (p, s) and called Decimal(38, 255), producing "decimal value has 255 significant digits" errors. The oracleNumberToCommon schema mapping had the same hole. Both now treat scale > precision as undeclared and fall back to BigDecimal so the schema cache and the value mapper agree, leaving the source lossless. 2. MSSQL CDC streaming scanned DECIMAL/NUMERIC columns into *any, which go-mssqldb coerced to a lossy float64. The streaming iterator now pre-allocates *sql.NullString scan targets for DECIMAL/NUMERIC and MONEY/SMALLMONEY so the driver hands back the lossless text representation. The stream-snapshot code path in replication/snapshot.go was also still wrapping DECIMAL/NUMERIC values in json.Number from the pre-Decimal era; it now routes through sqlutil.CanonicaliseDecimal / CanonicaliseBigDecimal in line with the regular snapshot and streaming paths. Improves the snapshot mapper's error message in oracledb to include the column name and input text so future driver quirks are easier to spot, and updates the streaming-block fixture for NOLEADINGZERO_COL in the oracledb all-types integration test (previously asserted as a float64, now a canonical BigDecimal string).
|
|
||
| // TEMPORARY: replace benthos with local checkout while decimal common-type adoption is in flight. | ||
| // Flip back to a tagged version once benthos releases with both Decimal and BigDecimal types. | ||
| replace github.com/redpanda-data/benthos/v4 => /Users/ash/src/ai/benthos-schema-decimal-types |
There was a problem hiding this comment.
Blocker for merge: this replace directive points at a local filesystem path (/Users/ash/src/ai/benthos-schema-decimal-types) that only exists on one developer's machine. Anyone else cloning the repo — including CI — will fail to resolve github.com/redpanda-data/benthos/v4 because that path doesn't exist for them.
The corresponding go.sum entries for github.com/redpanda-data/benthos/v4 were also removed at go.sum#L1554-L1558, so even temporarily reverting the replace would not produce a buildable tree.
Per CLAUDE.md → bump-benthos, benthos updates flow through task bump-benthos. This needs to be flipped to a tagged benthos release with the new Decimal/BigDecimal types (and the corresponding go.sum entries restored) before this PR can be merged. The commit body acknowledges this as a follow-up — flagging here so it isn't missed.
|
Commits Review One blocker:
|
Three small fixes to bring the repo to a clean lint/test state:
- internal/impl/confluent/ecs_avro.go: drop the unused ecsAvroFromBytes
wrapper (callers were migrated to ecsAvroParseFromBytes during the
decoder normalisation work) and replace a perfsprint-flagged
fmt.Errorf("missing") with errors.New.
- internal/impl/postgresql/pglogicalstream/schema_test.go: simplify the
redundant "((1 << 16) | 0) + 4" atttypmod fixture to "(1 << 16) + 4"
per staticcheck's SA4016.
- internal/impl/tigerbeetle/integration_test.go: switch the docker
container types import from "github.com/docker/docker/api/types/container"
to "github.com/moby/moby/api/types/container" to match the signature
testcontainers-go now expects (used elsewhere in this repo). The
docker module relocated these types upstream; the existing import
was a pre-existing typecheck failure that golangci-lint surfaces.
|
|
||
| // Slower path: permit extended forms (scientific notation, leading +, | ||
| // fewer fractional digits than scale, etc.). | ||
| bf, _, err := new(big.Float).SetPrec(256).Parse(text, 10) |
There was a problem hiding this comment.
The big.Float fallback adds ±0.5 before truncating to get an integer. For an input like 1.56789 against a NUMBER(10,2) column, it silently produces "1.57" rather than rejecting the value.
There was a problem hiding this comment.
Adding the following test case to TestCanonicaliseDecimal verifies this:
{name: "more fractional digits than scale should error not round", input: "1.56789", precision: 10, scale: 2, wantErr: true},There was a problem hiding this comment.
Good catch — fixed in 8f2cc5b. The big.Float fallback has been replaced with a big.Rat parse and an exact IsInt check on value × 10^scale. "1.56789" against NUMBER(10, 2) now returns decimal "1.56789" has more fractional digits than the column's scale 2 rather than silently rounding to "1.57". Scientific notation, leading +, and shorter-than-scale fractional inputs still canonicalise as before. Test cases covering both rejection paths ("1.56789" at (10, 2) and "1.5e-5" at (10, 2)) added.
| // License (the "License"); you may not use this file except in compliance with | ||
| // the License. You may obtain a copy of the License at | ||
| // | ||
| // https://github.com/redpanda-data/connect/v4/blob/main/licenses/rcl.md |
There was a problem hiding this comment.
The RCL license URL has an extra /v4 segment — should be https://github.com/redpanda-data/connect/blob/main/licenses/rcl.md, not .../connect/v4/blob/.... Compare against neighboring RCL files like internal/impl/oracledb/replication/snapshot.go. License headers are CI-enforced; the same fix is needed in internal/sqlutil/decimal_test.go.
There was a problem hiding this comment.
Fixed in 8f2cc5b — both internal/sqlutil/decimal.go:7 and internal/sqlutil/decimal_test.go:7 now use the headless URL form (https://github.com/redpanda-data/connect/blob/main/licenses/rcl.md) matching the rest of the repo's RCL headers.
|
Commits
Review Two blocking issues; the rest of the work (sqlutil canonicaliser, parquetdecimal extraction, source/converter wiring, test coverage) looks solid.
|
|
Commits Review LGTM |
…onicaliser rounding Three review fixes: - internal/impl/confluent/ecs_avro.go: per the Avro spec scale is optional in the decimal logical type and defaults to 0 when absent. The reverse-direction reader was treating a missing scale as an error; it now returns Decimal(precision, 0) for those fields. (joseph.woodward) - internal/sqlutil/decimal.go: replace the big.Float fallback with a big.Rat parse and an exact "fits at the declared scale" check. Previously an input like "1.56789" against a NUMBER(10, 2) column silently rounded to "1.57" because the big.Float path added 0.5 and truncated; rationals represent decimals exactly, so the check is now a real precision-loss test. Inputs that lose precision at the declared scale return an error. Scientific notation, leading +, and fewer-than-scale fractional digits continue to canonicalise as before. (joseph.woodward) - License URLs in internal/sqlutil/decimal.go and decimal_test.go: drop the erroneous "/v4" segment to match the rest of the RCL headers in the repo. (claude[bot]) Adds tests for both the scale-default Avro spec and the precision-loss rejection.
|
Commits
Review LGTM |
|
Commits
Review Adoption of the new LGTM |
9c53c36 to
ae6b135
Compare
|
Commits
Review The PR threads benthos's new LGTM |
…shape
TestIntegrationOracleDBCDCStreaming uses tables with bare NUMBER columns
(id NUMBER GENERATED ALWAYS AS IDENTITY and val NUMBER), both of which
now fall through to BigDecimal under the new schema mapping. Their
values are emitted as canonical decimal strings rather than json.Number
integers, so the per-subtest content assertions move from
{"ID":1,"VAL":1} / {"ID":1,"VAL":2} to {"ID":"1","VAL":"1"} /
{"ID":"1","VAL":"2"}.
|
Commits
Review Decimal/BigDecimal threading looks coherent end-to-end: the new LGTM |
|
Commits
Review The change cleanly threads the new LGTM |
|
Failing integration test appears to be due to timeout, ran locally to verify. |
Summary
DecimalandBigDecimalcommon-schema types end-to-end across five CDC sources (postgres, mysql, mssqlserver, oracledb, mongodb) and four format converters (iceberg, parquet, avro, json schema), with values normalised to canonical decimal strings via a newinternal/sqlutilcanonicaliser.*big.Rat→ canonical-string normalisation in theschema_registry_decodestore_schema_metadatapath so downstream metadata-driven encoders see consistent inputs.internal/impl/parquet/parquetdecimal, dedupes againsticeberg/icebergx, and tightens the iceberg shredder to preferschema.ParseDecimalfor canonical inputs.Note for reviewers
The branch carries a temporary
replacedirective ingo.modpointing at a local benthos checkout while the matching benthos release is prepared. A follow-up commit will flip that directive to a tagged version once benthos publishes one — please flag if there's a preferred form for the interim.Test plan
task fmtclean.task lintclean for every touched package (the pre-existinginternal/impl/tigerbeetletypecheck failure is unrelated).task test:unitgreen across iceberg, parquet, parquetdecimal, confluent, postgresql, mysql, mssqlserver, mongodb, oracledb, and the newinternal/sqlutil.sqlutilandparquetdecimalhelpers, every converter Decimal/BigDecimal case, the Avro reverse reader's logical-type detection, thenormaliseAvroDecimalswalker (including tagged-union dispatch), the MongoDecimal128walker (including scientific-notation inputs), the Postgresatttypmodparser, and the Postgres value-side decoder.task test:integration) once Docker is available in the reviewer's environment, particularlyoracledbandmssqlserverwhere the value-shape change is most visible.CHANGELOG
Per-source
AddedandChangedentries are included under## Unreleased.