mcp-data-platform-v1.77.0
mcp-data-platform v1.77.0
This release hardens the API gateway against memory exhaustion, fixes incorrect HTTP status codes and metric labels on the gateway, bounds the indexing job history, and stops the observability dashboards from polluting the audit trail.
API gateway memory safety (#536, #538)
The api-gateway is a single shared process serving every connection and toolkit. Per-request size caps existed, but they were per-request only, so a burst of large responses (each individually under its cap) could collectively exceed the heap and get the container OOMKilled, taking down the pod and every other in-flight request on it.
- Global in-flight memory budget. A process-wide admission controller (
MemBudget) now accounts for the body bytes held in flight across the whole process, not per request. Both buffering tools reserve their worst-case buffer before allocating it: a refused reservation is rejected before allocation with a structuredgateway_memory_budget_exhaustederror (HTTP 429, retryable, withRetry-After) rather than letting the allocation push the process into an OOM. The reservation uses the upstreamContent-Lengthwhen present and the per-request cap otherwise. - Memory-bounded raw passthrough. A new REST route
POST /api/v1/gateway/{connection}/invoke-rawstreams the upstream body straight to the client with the upstream status and content headers, still injecting the held upstream credential, never buffering it. It routes through the same in-memory MCP session as/invoke, so authentication, persona authorization, route policy, and audit all apply identically. This closes a capability gap: there was previously no memory-bounded way to retrieve a large or binary body through the gateway. An over-cap raw body returns a structured 413 (permanent, non-retryable). api_exportstreams directly to S3.api_exportpreviously read the entire upstream response into a[]bytebefore uploading (forced by the oldmcp-s3client, which only acceptedBody []byte). It now streams the upstream response straight to S3 viamcp-s3v1.2.0'sPutObjectStream, so memory stays roughly constant regardless of export size, andapi_exportis no longer an independent OOM vector. The per-export cap is still enforced all-or-nothing: an over-cap response is rejected up front on declaredContent-Length, and a chunked/undeclared stream that crosses the cap aborts the multipart upload, leaving no partial asset and no orphaned parts.
Configuration
apigateway:
memory:
max_in_flight_bytes: 314572800 # 300 MiB; 0 = disabled
raw_max_bytes: 1073741824 # 1 GiB; 0 = no raw capSizing guidance: budget roughly 3x the raw body per concurrent large request (raw body, decoded copy, JSON-escaped envelope copy), around (container_memory_limit * 0.6) / 3, not the whole limit.
Note: trino_export is unchanged. It is blocked upstream because mcp-trino materializes the full result set into memory before formatting, so a streaming upload there would be cosmetic. Genuine streaming requires an upstream streaming query API, tracked in txn2/mcp-trino#69.
Gateway returns correct HTTP status for auth denials (#534)
The REST gateway shim returned HTTP 500 for authorization and authentication denials that should be 403 and 401. Because 5xx is retryable under standard HTTP client semantics and 4xx is not, automation clients retry-looped on a permission denial that can never succeed, inflating latency and upstream call volume. The shim now normalizes both the JSON-envelope and bare-string error shapes before classifying, so a persona denial maps to 403 and an authentication failure maps to 401. Unknown messages default to 400 (not retryable) rather than 500, since every genuine transient or server-side condition is matched explicitly. Integration tests now wire a real authorization denial through the in-memory MCP session to the REST status code.
Gateway traffic metrics stop resolving list calls to "unknown" (#520)
In the admin Traffic Flow view and anywhere apigateway_inbound_requests_total is grouped by operation_id, a large share of inbound traffic resolved to the literal operation unknown even though the requests targeted operations that exist in the catalog. The resolver matched the runtime path without stripping the query string first, so collection/list endpoints invoked with query parameters (/v1/orders?limit=100&offset=0) failed to match their route and fell through to unknown. The resolver now strips the query string and fragment before matching, and synthesizes the catalog's id (<METHOD> <rawPath>) when an operation declares no operationId, so the metric label agrees with what api_list_endpoints advertises. This is purely an observability/labeling fix; the upstream call itself was never affected.
Indexing job history is now bounded (#524)
The admin Indexing dashboard's job list grew without bound: nothing purged finished index_jobs rows, and the table fetched up to 500 rows into one flat DOM block.
- Retention. A new hourly sweep deletes only rows safe to forget:
succeededrows, andfailedrows that have been resolved. It never matches an open failure or an in-flight (pending/running) row, so the failure-triage surface and the active queue are unaffected. The DELETE drains in 5000-row batches, oldest-first, on a new partial index, so a first sweep against a large backlog makes incremental progress instead of timing out on one multi-million-row statement. Configured viaapigateway.embed_jobs.retention_days(0= 14-day default, negative = disabled, positive = that many days). Retention keys only on status and timestamps, never on kind, so it applies uniformly across api_catalog, tools, and memory. - Pagination. The jobs table now renders 25 rows per page with Prev/Next controls and a range label, resets to the first page on filter change, and clamps to a valid page so a live-poll refresh that shrinks the set never strands you on a blank page.
This release includes migration 000057_index_jobs_retention, applied automatically on startup. It adds a partial index whose predicate matches the purge query exactly; no data is rewritten.
Observability dashboards stop auditing themselves (#518)
The observability dashboards were auditing themselves. The PromQL proxy wrote an audit row for every query it forwarded, and the dashboards poll the proxy on a refresh interval, so the Events table filled with observability.query rows and Top Tools / Total Calls / My Activity counted dashboard reads as MCP tool calls. These queries are admin-only, rate-limited, read-only reads of the platform's own metrics, not agent tool calls, so the audit emission has been removed from the proxy. Auth gating, the per-persona rate limit, and pass-through forwarding are otherwise unchanged. This stops new rows going forward; existing observability.query rows age out with the configured audit retention.
Mermaid node labels render in markdown assets (#522)
Mermaid diagrams in text/markdown portal assets rendered with invisible node labels: boxes and edges drew correctly but the text inside nodes did not, while subgraph titles stayed visible. The SVG-only DOMPurify sanitizer was stripping the HTML <foreignObject> content that Mermaid v11 uses for node labels. A single shared sanitizeMermaidSvg() now allows foreignObject HTML through while still stripping <script>, <iframe>, and inline event-handler attributes, including when smuggled inside a foreignObject.
Dependency and CI updates
mcp-s3bumped to v1.2.0 (addsPutObjectStream; see #538).go.opentelemetry.io/otel/exporters/prometheusbumped (#528).vitestbumped in the UI (#532).docker/setup-buildx-action4.0.0 to 4.1.0 (#527) anddocker/setup-qemu-action4.0.0 to 4.1.0 (#526).
Changelog
Features
- 3360ec3: feat(apigateway): add global in-flight memory budget and raw streaming passthrough (#535) (#536) (@cjimti)
- eafc3d0: feat(apigateway): stream api_export directly to S3 instead of buffering the full body (#537) (#538) (@cjimti)
- 864b420: feat(indexing): bound index_jobs history with retention + paginate the jobs table (#524) (@cjimti)
Bug Fixes
- f285682: fix(apigateway): strip query string before operation match so list traffic stops resolving to "unknown" (#519) (#520) (@cjimti)
- 6b48af2: fix(gatewayhttp): map auth/authz denials to 403/401 instead of 500 (#533) (#534) (@cjimti)
- 4417fe6: fix(observability): stop auditing proxy queries that flooded the audit trail (#518) (@cjimti)
- 19943fe: fix(ui): preserve mermaid node labels in markdown assets (#521) (#522) (@cjimti)
Others
- 4e9057f: ci: bump docker/setup-buildx-action from 4.0.0 to 4.1.0 (#527) (@dependabot[bot])
- c2f9d27: ci: bump docker/setup-qemu-action from 4.0.0 to 4.1.0 (#526) (@dependabot[bot])
- 9e97d35: ci: bump vitest in /ui in the npm_and_yarn group across 1 directory (#532) (@dependabot[bot])
- 10a3686: deps: bump go.opentelemetry.io/otel/exporters/prometheus (#528) (@dependabot[bot])
Installation
Homebrew (macOS)
brew install txn2/tap/mcp-data-platformClaude Code CLI
claude mcp add mcp-data-platform -- mcp-data-platformDocker
docker pull ghcr.io/txn2/mcp-data-platform:v1.77.0Verification
All release artifacts are signed with Cosign. Verify with:
cosign verify-blob --bundle mcp-data-platform_1.77.0_linux_amd64.tar.gz.sigstore.json \
mcp-data-platform_1.77.0_linux_amd64.tar.gz