Skip to content

schema: adopt Decimal and BigDecimal common types across sources and converters#4358

Merged
josephwoodward merged 9 commits intomainfrom
schema-decimal-types
Apr 28, 2026
Merged

schema: adopt Decimal and BigDecimal common types across sources and converters#4358
josephwoodward merged 9 commits intomainfrom
schema-decimal-types

Conversation

@Jeffail
Copy link
Copy Markdown
Contributor

@Jeffail Jeffail commented Apr 28, 2026

Summary

  • Threads benthos's new Decimal and BigDecimal common-schema types end-to-end across five CDC sources (postgres, mysql, mssqlserver, oracledb, mongodb) and four format converters (iceberg, parquet, avro, json schema), with values normalised to canonical decimal strings via a new internal/sqlutil canonicaliser.
  • Adds Decimal logical-type detection in the Avro reverse reader and *big.Rat → canonical-string normalisation in the schema_registry_decode store_schema_metadata path so downstream metadata-driven encoders see consistent inputs.
  • Extracts shared Parquet decimal-byte helpers into internal/impl/parquet/parquetdecimal, dedupes against iceberg/icebergx, and tightens the iceberg shredder to prefer schema.ParseDecimal for canonical inputs.

Note for reviewers

The branch carries a temporary replace directive in go.mod pointing at a local benthos checkout while the matching benthos release is prepared. A follow-up commit will flip that directive to a tagged version once benthos publishes one — please flag if there's a preferred form for the interim.

Test plan

  • task fmt clean.
  • task lint clean for every touched package (the pre-existing internal/impl/tigerbeetle typecheck failure is unrelated).
  • task test:unit green across iceberg, parquet, parquetdecimal, confluent, postgresql, mysql, mssqlserver, mongodb, oracledb, and the new internal/sqlutil.
  • New unit-test coverage for the sqlutil and parquetdecimal helpers, every converter Decimal/BigDecimal case, the Avro reverse reader's logical-type detection, the normaliseAvroDecimals walker (including tagged-union dispatch), the Mongo Decimal128 walker (including scientific-notation inputs), the Postgres atttypmod parser, and the Postgres value-side decoder.
  • Parquet end-to-end smoke test exercising all three precision buckets (Int32, Int64, FixedLenByteArray) plus BigDecimal rejection.
  • Integration-test assertions in postgres, oracledb, and mssqlserver updated to match the new schema metadata and canonical-string value shape; mysql and mongodb integration tests already aligned.
  • Run integration suites (task test:integration) once Docker is available in the reviewer's environment, particularly oracledb and mssqlserver where the value-shape change is most visible.

CHANGELOG

Per-source Added and Changed entries are included under ## Unreleased.

…and converters

Threads benthos's new Decimal and BigDecimal common-schema types end-to-end:

- Five CDC sources (postgres, mysql, mssqlserver, oracledb, mongodb) now emit
  Decimal(p, s) when precision and scale are declared and BigDecimal when
  they are not, with values normalised to canonical decimal strings via a
  new internal/sqlutil canonicaliser.
- Four format converters (iceberg, parquet, avro, json schema) honour
  Decimal natively. BigDecimal is rejected by the bounded-format encoders
  with an actionable error and accepted by JSON Schema as a permissive
  string-with-pattern.
- ecs_avro detects logicalType: decimal in Avro specs and the
  schema_registry_decode store_schema_metadata path normalises decoded
  big.Rat values to canonical strings.
- Shared Parquet decimal-byte helpers extracted into
  internal/impl/parquet/parquetdecimal so the parquet encoder and the
  iceberg shredder no longer carry duplicate implementations.

The adoption is wired through a temporary go.mod replace directive
pointing at the local benthos checkout while a tagged release is
prepared; that directive is the one remaining follow-up before merge.
@Jeffail Jeffail force-pushed the schema-decimal-types branch from b76af04 to 0956e4d Compare April 28, 2026 16:00
Comment thread go.mod Outdated

replace github.com/99designs/keyring => github.com/Jeffail/keyring v1.2.3

// TEMPORARY: replace benthos with local checkout while decimal common-type adoption is in flight.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replace directive points at an absolute path on the author's machine (/Users/ash/src/ai/benthos-schema-decimal-types), which will break go mod download / go build for every other developer and CI. The commit message acknowledges this as the one remaining follow-up before merge — flagging it here so it isn't lost: must be flipped back to a tagged benthos version before this PR can land.

Comment thread internal/sqlutil/decimal.go Outdated
// License (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// https://github.com/redpanda-data/connect/v4/blob/main/licenses/rcl.md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RCL license URL has an erroneous /v4/ segment — https://github.com/redpanda-data/connect/v4/blob/main/licenses/rcl.md is a 404. Every other RCL header in the repo (e.g. public/components/all/package.go) uses https://github.com/redpanda-data/connect/blob/main/licenses/rcl.md. The same typo is in internal/sqlutil/decimal_test.go:7; please fix both.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Commits
LGTM

Review
Single-commit PR threading the new benthos Decimal and BigDecimal common-schema types through five CDC sources (postgres/mysql/mssql/oracle/mongodb), four format converters (avro/parquet/iceberg/json-schema), and the schema-registry decoder, plus a shared internal/sqlutil canonicaliser and an extracted parquetdecimal helper package. Tests look thorough.

  1. The replace github.com/redpanda-data/benthos/v4 => /Users/ash/src/ai/... directive in go.mod points at an absolute local path and will break the build for everyone else / CI. The commit message flags this as the remaining follow-up before merge — must be reverted to a tagged benthos before landing.
  2. RCL license header URL in the new internal/sqlutil/decimal.go and internal/sqlutil/decimal_test.go files has an extra /v4/ segment, producing a 404. Should match the canonical RCL URL used elsewhere in the repo.

return int32(n), nil
case nil:
return 0, fmt.Errorf("missing")
default:
Copy link
Copy Markdown
Contributor

@josephwoodward josephwoodward Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avro's scale is optional isn't it but would fall into this default case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 8f2cc5b. The decimal logical-type handler now defaults scale to 0 when the field is absent, matching the Avro spec. precision remains required. Added TestEcsAvroFromBytesDecimalScaleDefaultsToZero covering the omitted-scale case.

…y integration tests

Two integration-test failures pinned during PR-readiness verification:

1. Oracle bare NUMBER columns (no declared precision and scale) were
   routed through the Decimal canonicaliser because go-ora's
   *sql.ColumnType.DecimalSize() reports (precision=38, scale=255, ok=true)
   for them — 255 is the driver's "any-scale" sentinel. The snapshot
   mapper treated that as a real (p, s) and called Decimal(38, 255),
   producing "decimal value has 255 significant digits" errors. The
   oracleNumberToCommon schema mapping had the same hole. Both now treat
   scale > precision as undeclared and fall back to BigDecimal so the
   schema cache and the value mapper agree, leaving the source lossless.

2. MSSQL CDC streaming scanned DECIMAL/NUMERIC columns into *any, which
   go-mssqldb coerced to a lossy float64. The streaming iterator now
   pre-allocates *sql.NullString scan targets for DECIMAL/NUMERIC and
   MONEY/SMALLMONEY so the driver hands back the lossless text
   representation. The stream-snapshot code path in replication/snapshot.go
   was also still wrapping DECIMAL/NUMERIC values in json.Number from the
   pre-Decimal era; it now routes through sqlutil.CanonicaliseDecimal /
   CanonicaliseBigDecimal in line with the regular snapshot and streaming
   paths.

Improves the snapshot mapper's error message in oracledb to include the
column name and input text so future driver quirks are easier to spot,
and updates the streaming-block fixture for NOLEADINGZERO_COL in the
oracledb all-types integration test (previously asserted as a float64,
now a canonical BigDecimal string).
Comment thread go.mod Outdated

// TEMPORARY: replace benthos with local checkout while decimal common-type adoption is in flight.
// Flip back to a tagged version once benthos releases with both Decimal and BigDecimal types.
replace github.com/redpanda-data/benthos/v4 => /Users/ash/src/ai/benthos-schema-decimal-types
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker for merge: this replace directive points at a local filesystem path (/Users/ash/src/ai/benthos-schema-decimal-types) that only exists on one developer's machine. Anyone else cloning the repo — including CI — will fail to resolve github.com/redpanda-data/benthos/v4 because that path doesn't exist for them.

The corresponding go.sum entries for github.com/redpanda-data/benthos/v4 were also removed at go.sum#L1554-L1558, so even temporarily reverting the replace would not produce a buildable tree.

Per CLAUDE.md → bump-benthos, benthos updates flow through task bump-benthos. This needs to be flipped to a tagged benthos release with the new Decimal/BigDecimal types (and the corresponding go.sum entries restored) before this PR can be merged. The commit body acknowledges this as a follow-up — flagging here so it isn't missed.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Commits
LGTM

Review
The PR threads benthos's new Decimal / BigDecimal common-schema types end-to-end across the five CDC sources (postgres, mysql, mssqlserver, oracledb, mongodb), the four format converters (iceberg, parquet, avro, json schema), and the schema-registry decode path, with a shared internal/sqlutil canonicaliser and a new internal/impl/parquet/parquetdecimal helper package. Tests look thorough — table-driven units for the canonicaliser, decimal-pattern regex coverage, parquet bucket smoke tests, and bson Decimal128 walk coverage.

One blocker:

  1. go.mod#L7-L10replace github.com/redpanda-data/benthos/v4 => /Users/ash/src/ai/benthos-schema-decimal-types points at a local-only filesystem path and the matching go.sum entries for benthos were removed. The tree is unbuildable for anyone but the author until this is flipped to a tagged benthos release with the new types. Author called this out in the commit body as a pre-merge follow-up; flagged inline so it isn't missed.

Three small fixes to bring the repo to a clean lint/test state:

- internal/impl/confluent/ecs_avro.go: drop the unused ecsAvroFromBytes
  wrapper (callers were migrated to ecsAvroParseFromBytes during the
  decoder normalisation work) and replace a perfsprint-flagged
  fmt.Errorf("missing") with errors.New.
- internal/impl/postgresql/pglogicalstream/schema_test.go: simplify the
  redundant "((1 << 16) | 0) + 4" atttypmod fixture to "(1 << 16) + 4"
  per staticcheck's SA4016.
- internal/impl/tigerbeetle/integration_test.go: switch the docker
  container types import from "github.com/docker/docker/api/types/container"
  to "github.com/moby/moby/api/types/container" to match the signature
  testcontainers-go now expects (used elsewhere in this repo). The
  docker module relocated these types upstream; the existing import
  was a pre-existing typecheck failure that golangci-lint surfaces.
Comment thread internal/sqlutil/decimal.go Outdated

// Slower path: permit extended forms (scientific notation, leading +,
// fewer fractional digits than scale, etc.).
bf, _, err := new(big.Float).SetPrec(256).Parse(text, 10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big.Float fallback adds ±0.5 before truncating to get an integer. For an input like 1.56789 against a NUMBER(10,2) column, it silently produces "1.57" rather than rejecting the value.

Copy link
Copy Markdown
Contributor

@josephwoodward josephwoodward Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the following test case to TestCanonicaliseDecimal verifies this:

{name: "more fractional digits than scale should error not round", input: "1.56789", precision: 10, scale: 2, wantErr: true},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 8f2cc5b. The big.Float fallback has been replaced with a big.Rat parse and an exact IsInt check on value × 10^scale. "1.56789" against NUMBER(10, 2) now returns decimal "1.56789" has more fractional digits than the column's scale 2 rather than silently rounding to "1.57". Scientific notation, leading +, and shorter-than-scale fractional inputs still canonicalise as before. Test cases covering both rejection paths ("1.56789" at (10, 2) and "1.5e-5" at (10, 2)) added.

Comment thread go.mod Outdated
Comment thread internal/sqlutil/decimal.go Outdated
// License (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// https://github.com/redpanda-data/connect/v4/blob/main/licenses/rcl.md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RCL license URL has an extra /v4 segment — should be https://github.com/redpanda-data/connect/blob/main/licenses/rcl.md, not .../connect/v4/blob/.... Compare against neighboring RCL files like internal/impl/oracledb/replication/snapshot.go. License headers are CI-enforced; the same fix is needed in internal/sqlutil/decimal_test.go.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8f2cc5b — both internal/sqlutil/decimal.go:7 and internal/sqlutil/decimal_test.go:7 now use the headless URL form (https://github.com/redpanda-data/connect/blob/main/licenses/rcl.md) matching the rest of the repo's RCL headers.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Commits

  1. Commits 1 and 2 both have headlines that end in and continue into the body (schema: adopt Decimal and BigDecimal common types across CDC sources … → body …and converters; oracledb, mssqlserver: fix decimal value-shape regressions surfaced b… → body …y integration tests). The commit policy expects standalone, comprehensible headlines — these read as truncated mid-sentence and need to be rewritten so each headline is self-contained.
  2. Commit 2's oracledb, mssqlserver: prefix uses two comma-separated systems. The accepted formats are system: message, system(subsystem): message, chore: message, or sentence-case repo-wide. Either pick the dominant system or split into two commits.

Review

Two blocking issues; the rest of the work (sqlutil canonicaliser, parquetdecimal extraction, source/converter wiring, test coverage) looks solid.

  1. go.mod#L7-L9 — temporary replace directive points at a developer-local path (/Users/ash/src/ai/benthos-schema-decimal-types). The module graph cannot resolve anywhere else, and go.sum has been stripped of the benthos lines, so this PR is unmergeable until a tagged benthos release is wired in. Author flagged this in the PR description; surfacing as a hard merge blocker.
  2. internal/sqlutil/decimal.go#L1-L7 (and internal/sqlutil/decimal_test.go) — RCL license header URL has an extra /v4 segment (.../connect/v4/blob/...) that doesn't match the standard form used in neighboring RCL files. License headers are CI-enforced.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Commits
LGTM

Review
Threads benthos's new Decimal/BigDecimal common types end-to-end across the five CDC sources and four format converters, with a shared internal/sqlutil canonicaliser and an extracted parquetdecimal helper. The schema metadata, value normalisation, and converter rejection paths line up consistently, and test coverage is broad (config-spec parsing, walk-and-normalise, integration fixtures).

LGTM

…onicaliser rounding

Three review fixes:

- internal/impl/confluent/ecs_avro.go: per the Avro spec scale is
  optional in the decimal logical type and defaults to 0 when absent.
  The reverse-direction reader was treating a missing scale as an error;
  it now returns Decimal(precision, 0) for those fields. (joseph.woodward)

- internal/sqlutil/decimal.go: replace the big.Float fallback with a
  big.Rat parse and an exact "fits at the declared scale" check.
  Previously an input like "1.56789" against a NUMBER(10, 2) column
  silently rounded to "1.57" because the big.Float path added 0.5 and
  truncated; rationals represent decimals exactly, so the check is now
  a real precision-loss test. Inputs that lose precision at the
  declared scale return an error. Scientific notation, leading +, and
  fewer-than-scale fractional digits continue to canonicalise as
  before. (joseph.woodward)

- License URLs in internal/sqlutil/decimal.go and decimal_test.go:
  drop the erroneous "/v4" segment to match the rest of the RCL
  headers in the repo. (claude[bot])

Adds tests for both the scale-default Avro spec and the
precision-loss rejection.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Commits

  1. The fifth commit (8f2cc5b) uses the prefix review: (review: address PR feedback on ecs_avro scale default and decimal canonicaliser rounding). Per the project commit policy, the prefix must be a known system area (e.g. confluent: ..., sqlutil: ...) or chore: for housekeeping. review: is not a system. Since the commit touches internal/impl/confluent/ecs_avro.go and internal/sqlutil/decimal.go, a conformant header would be e.g. confluent, sqlutil: fix decimal scale default and canonicaliser rounding (or split into per-system commits, or squash into the parent). Consider squashing this and the Pin main for benthos (temporary) fixup-shaped commit before merge.

Review

LGTM

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Commits

  1. The final commit bc759c8 address unsigned decimal in mysql doesn't match any of the allowed commit message formats (system: message, system(subsystem): message, chore: message, or sentence-case for repo-wide). It has no system prefix, lowercase first word, and an empty body. Should be something like mysql: handle unsigned decimal raw type.

  2. Commit 8f2cc5b review: address PR feedback on ecs_avro scale default and decimal can…onicaliser rounding uses review: as a prefix, but review isn't a system/component area in the codebase. The commit also bundles three unrelated fixes (Avro scale default, sqlutil rounding, license URL header) that would normally be split. Consider splitting and using proper system prefixes (e.g., confluent(ecs_avro): default decimal scale to 0, sqlutil: parse decimals via big.Rat for exact scale check, chore: drop /v4 from RCL header URLs).

  3. Several commits use a Unicode ellipsis to fake a multi-line title (schema: …and converters, oracledb, mssqlserver: … surfaced b…y integration tests, review: … decimal can…onicaliser rounding). Titles should be a single line that fits on its own; the body covers the rest. The current form is hard to read in tooling that shows commit headlines.

Review

Adoption of the new Decimal/BigDecimal common-schema types is implemented cleanly across all five CDC sources (postgres, mysql, mssqlserver, oracledb, mongodb) and the four format converters (iceberg, parquet, avro, json schema). The shared internal/sqlutil canonicaliser and internal/impl/parquet/parquetdecimal extraction are well-scoped, the value contract (canonical decimal strings, no scientific notation) is consistent across sources, and the BigDecimal rejection paths in bounded-format encoders give actionable errors. Test coverage for the new helpers (sqlutil, parquetdecimal, ecs_avro decimal handling, mongo Decimal128 normalisation, pg atttypmod parsing) is thorough, including edge cases like the go-ora (38, 255) sentinel, scale-greater-than-precision, scientific notation, and inexact-rat fallback.

LGTM

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Commits

  1. Pin main for benthos (temporary) (ed9b74d) — the first commit's body explicitly says this go.mod replace is "the one remaining follow-up before merge". This temporary pin must be removed (and the benthos dependency repinned to a tagged release) before merging.
  2. address unsigned decimal in mysql (bc759c8) — message format violation. Project policy requires system: message (e.g. mysql: handle unsigned decimal columns). Also looks like a fixup of work introduced by 0956e4d and should likely be squashed into it.
  3. address special NUMERIC values in PostgreSQL (ae6b135) — same issue. Should be e.g. postgresql: pass through NaN/Infinity NUMERIC values and squashed into the originating commit.
  4. review: address PR feedback on ecs_avro scale default and decimal can… (8f2cc5b) — review is not a recognised system. The commit also bundles unrelated changes (avro scale default, sqlutil rounding behavior, license URL fixes) across multiple systems; should be split per system and squashed into their respective parent commits rather than carried as a separate review-fixup commit.

Review

The PR threads benthos's new Decimal/BigDecimal common-schema types end-to-end across the CDC sources and format converters, with new internal/sqlutil and internal/impl/parquet/parquetdecimal helpers, and includes solid unit-test coverage for the new code paths.

LGTM

…shape

TestIntegrationOracleDBCDCStreaming uses tables with bare NUMBER columns
(id NUMBER GENERATED ALWAYS AS IDENTITY and val NUMBER), both of which
now fall through to BigDecimal under the new schema mapping. Their
values are emitted as canonical decimal strings rather than json.Number
integers, so the per-subtest content assertions move from
{"ID":1,"VAL":1} / {"ID":1,"VAL":2} to {"ID":"1","VAL":"1"} /
{"ID":"1","VAL":"2"}.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Commits

  1. 8f2cc5b9 headline review: address PR feedback on ecs_avro scale default and decimal can…review is not a known system area and does not match any allowed pattern (system:, system(subsystem):, chore:, or sentence-case repo-wide). This is review fix-up so chore: or splitting per touched system would fit; see CLAUDE.md commit policy.
  2. bc759c86 headline address unsigned decimal in mysql — lowercase plain message with no system prefix; does not match any allowed format. Suggested: mysql: address unsigned decimal handling. Body is also empty so the why (driver behaviour for unsigned decimals) is not captured anywhere.
  3. ae6b135e headline address special NUMERIC values in PostgreSQL — same issue. Suggested: postgresql: handle special NUMERIC values (NaN, Infinity, -Infinity). Body is empty too.

Review

Decimal/BigDecimal threading looks coherent end-to-end: the new internal/sqlutil canonicaliser is exercised from every CDC source, the four format converters all gate on the bounded vs. unbounded distinction with actionable errors, and the parquet/iceberg byte-encoding helpers consolidate cleanly into internal/impl/parquet/parquetdecimal. New unit-test coverage for the walker, tagged-union dispatch, atttypmod parser, MySQL raw-type regex, MongoDB Decimal128 walker, and the parquet smoke test (Int32/Int64/FixedLenByteArray buckets + BigDecimal rejection) is thorough. Integration-test fixture changes track the new value shapes consistently across postgres / oracledb / mssqlserver.

LGTM

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Commits

  1. address unsigned decimal in mysql (bc759c8) — does not match an enforced commit message format. It is neither system: message (no system prefix), nor a sentence-case plain message (first word is lowercase). Suggest mysql: address unsigned decimal.
  2. address special NUMERIC values in PostgreSQL (ae6b135) — same format violation. Suggest postgresql: address special NUMERIC values.
  3. update failing test (0a847cf) — vague subject (which test, why) and lowercase plain-message format. Suggest scoping it (e.g. oracledb: update failing decimal test) and putting the why in the body.
  4. review: address PR feedback on ecs_avro scale default and decimal canonicaliser rounding (8f2cc5b) — review: is not a known system area; the body describes three independent fixes across confluent and sqlutil. Either split per-system or fold each fix into the commit it amends.

Review

The change cleanly threads the new Decimal / BigDecimal common-schema types end-to-end across the five CDC sources and four format converters, with thorough table-driven tests for each new conversion. Sharing parquetdecimal.{ByteWidth,Pow10,EncodeBytes} between the parquet encoder and iceberg shredder is a nice deduplication, and the special-case handling (Oracle's (38, 255) sentinel, MSSQL *sql.NullString scan targets, Postgres NaN/±Infinity, MongoDB Decimal128 walker) is well-justified in code comments.

LGTM

@josephwoodward
Copy link
Copy Markdown
Contributor

Failing integration test appears to be due to timeout, ran locally to verify.

@josephwoodward josephwoodward merged commit 5d37ef3 into main Apr 28, 2026
8 of 10 checks passed
@josephwoodward josephwoodward deleted the schema-decimal-types branch April 28, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants