mcp-data-platform-v1.66.0
This release ships two changes: observability and shutdown hardening so freshly-deployed pods are immediately profileable and drain cleanly on SIGTERM, and an OpenAPI 3.1 resolver fix so api_get_endpoint_schema no longer drops oneOf / anyOf / allOf branches, response headers blocks, or const values.
Highlights
Metrics enabled by default (#449)
OTEL_METRICS_ENABLED now defaults to true. A freshly-deployed pod exposes Prometheus metrics on :9090/metrics without an env flip. The metrics listener is on a dedicated port isolated from the MCP and admin paths; restrict access with a NetworkPolicy. Set OTEL_METRICS_ENABLED=false to opt out.
Per-tool latency, apigateway upstream latency, and tool-call counters are now scrapable out of the box during sizing exercises and incidents.
Idempotent meter-provider shutdown (#449)
The OTel SDK MeterProvider.Shutdown returns reader is shutdown on the second call. With the new default surfacing a real provider in tests, Platform.Close invoked twice would error. Metrics.Shutdown is now guarded by sync.Once with a cached error, and the call goes through an injectable shutdownFn so the error path is testable without forcing the SDK to misbehave.
Graceful shutdown wiring (#449)
cmd/mcp-data-platform/main.go was calling platform.Close() but never platform.Stop(). Every lifecycle.OnStop callback registered by the embed-jobs subsystem (worker, reaper, reconciler, LISTEN/NOTIFY listener) was being abandoned mid-flight on SIGTERM, leaving worker goroutines running while the DB pool closed underneath them.
Fixes:
closeServernow callsplatform.Stop(ctx)with a 10s bounded context beforeplatform.Close().- The embed-jobs
OnStopclosure is extracted intoPlatform.stopAPIGatewayEmbedJobsand runs eachStopcall inside a newboundedStophelper that races the work againstctx.Done. A hung component cannot stall past the bounded deadline. Abandoned jobs are safe because their PostgreSQL leases expire and another replica reclaims them.
api_get_endpoint_schema surfaces full OpenAPI 3.1 contract (#450)
The schema flattener walked only properties and items, silently dropping three patterns that are canonical in OpenAPI 3.1:
oneOf: [$ref, {type: "null"}](the 3.1 replacement for the 3.0nullable: truekeyword) collapsed to an empty object. Both the inlined$refand the null branch disappeared. Same foranyOfandallOf.- Response
headersblocks were stripped. Callers had no signal aboutLocationon a 301,Retry-Afteron a 429,ETag/Last-Modifiedon cache-aware endpoints, or custom headers. constvalues on properties (common as JSON:API discriminator fields, e.g.type: "show") were dropped, leaving an unconstrained string.enumwith a single value followed the same code path.
Fixes:
ResponseDetailgains aHeaders map[string]HeaderDetailfield, marshaled asheaderswithomitempty.- New
flattenResponseHeadershelper emits per-header description, required flag, and schema. populateSchemaCompoundsnow recurses intooneOf/anyOf/allOf/notvia a newflattenSchemaRefshelper.$refbranches get inlined by the existingschemaToValuerecursion;type: "null"branches are preserved. The samemaxSchemaDepthcap that protects against recursive schemas still applies.populateSchemaScalarsnow copiess.Constalongside the existing scalar keywords.
Five new tests against a JSON:API style OpenAPI 3.1 fixture exercise every reported pattern, including nested cases (anyOf inside array items).
Operator impact
Existing installs
A rolling restart will start exposing /metrics on :9090. If you do not want that, set OTEL_METRICS_ENABLED=false in the deployment env. If you do want it but the port is not yet allowed in your NetworkPolicy, scrapers will fail until you allow it. The endpoint is unauthenticated by design (isolated listener); restrict with a NetworkPolicy.
Recommended manifest tuning
The existing terminationGracePeriodSeconds: 30 is tight for the new shutdown chain. The default budget is roughly 2s pre-delay + 25s HTTP grace + 10s lifecycle stop + close overhead. Suggested:
spec:
template:
spec:
terminationGracePeriodSeconds: 60For deployments expecting higher traffic, see docs/reference/tuning-and-scaling.md for GOMEMLIMIT / GOMAXPROCS / GOGC guidance. Without them the Go runtime is not cgroup-aware: GOMAXPROCS defaults to the node's CPU count (often 32-64) inside a 500m cgroup quota, causing scheduler thrash; an unset GOMEMLIMIT means the GC will not pace against the memory limit.
Compatibility
- Tool inputs, tool names, configuration schema, migrations, and wire-format identifiers are unchanged.
api_get_endpoint_schemaoutput is additive: the newheadersfield and theoneOf/anyOf/allOf/not/constkeys appear only for specs that declare them. Clients that ignore unknown JSON fields are unaffected.- The metrics-default flip is the only behavior change at deploy time. Pin
OTEL_METRICS_ENABLED=falsein your env if you need the previous posture.
New documentation
docs/reference/tuning-and-scaling.mdwalks through resource sizing by traffic tier, Go runtime env vars, the HA-safety matrix for multi-replica deployments, PostgreSQL pool sizing per replica count, a four-stage graceful-shutdown budget tied toterminationGracePeriodSeconds, and an HPA example.
Changelog
Features
Bug Fixes
- fc38bb7: fix(apigateway): surface oneOf/anyOf/allOf, response headers, and const in api_get_endpoint_schema (#450) (@cjimti)
Installation
Homebrew (macOS)
brew install txn2/tap/mcp-data-platformClaude Code CLI
claude mcp add mcp-data-platform -- mcp-data-platformDocker
docker pull ghcr.io/txn2/mcp-data-platform:v1.66.0Verification
All release artifacts are signed with Cosign. Verify with:
cosign verify-blob --bundle mcp-data-platform_1.66.0_linux_amd64.tar.gz.sigstore.json \
mcp-data-platform_1.66.0_linux_amd64.tar.gz