Skip to content

mcp-data-platform-v1.69.0

Choose a tag to compare

@github-actions github-actions released this 29 May 00:45
· 62 commits to main since this release
98d5047

mcp-data-platform v1.69.0

This release delivers the observability epic (#459) end to end: inbound API gateway metrics, an authenticated PromQL query proxy to read them, audit event classification to separate MCP from gateway traffic, and a ground-up redesign of the admin dashboard into a dense, operator-grade observability console. It also hardens connection management so a single malformed connection config can no longer crash-loop the server on startup.

Highlights

  • Redesigned admin dashboard into a four-tab observability console (MCP, API Gateway, Health, Events) with d3 visualizations.
  • Authenticated PromQL query proxy so the portal reads Prometheus through the platform's own auth and persona model, keeping Prometheus on the internal network.
  • Inbound API gateway metrics with per-endpoint OpenAPI operation_id labels and cardinality-safe bounded label sets.
  • Audit events tagged with event_kind (mcp_tool_call vs apigateway_invoke) so gateway traffic no longer drowns the human MCP signal.
  • Connection config validation on save plus skip-and-warn on bad instances at startup.

Features

Redesigned admin observability dashboard (#464, #495)

Turns the /admin Dashboard into a four-tab observability console, replacing the previous thin stat-card layout. The standalone "Audit Log" sidebar item is removed; /admin/audit remains as a backward-compatible alias. The user-facing "My Activity" page is untouched; observability stays admin-only via the observability:read capability gate.

  • MCP: tool-call activity from the audit database, scoped to event_kind=mcp_tool_call. KPI tiles with d3 sparklines and period-over-period deltas, a weekday-by-hour usage heatmap, a log-scale latency percentile panel (p50 / avg / p95 / p99 / max), and breakdown bars by tool, user, persona, and toolkit.
  • API Gateway: inbound and outbound HTTP metrics via the PromQL proxy. Status-class (2xx / 4xx / 5xx) stacked area over time, a connection-to-operation Sankey (volume-sorted), and connection-to-endpoint drilldown with per-connection totals, error rate, and p50/p95/p99 latency.
  • Health (new): one full-width status band per node, designed for the typical 1-3 node install. Per-node uptime, CPU, memory (RSS), heap, goroutines, and in-flight counts (per node, not fleet aggregates), a recent-restart badge, and per-node CPU and memory trend charts. Node identity is the Kubernetes pod name, with host:port fallback off-cluster.
  • Events: the existing faceted, sortable audit-log table.

Visualizations use modular d3 (d3-scale, d3-shape, d3-array, d3-sankey, d3-scale-chromatic) for the heatmap, sankey, latency track, and sparklines where recharts falls short; new chart components live in ui/src/components/charts/ (FlowSankey, LatencyPanel, MetricTile, Sparkline, StatusStackChart, TrendArea, UsageHeatmap). The API Gateway and Health tabs degrade to a clean "metrics unavailable" empty state when Prometheus is not configured; MCP and Events read the audit database and work with no Prometheus at all.

Authenticated PromQL query proxy (#462, #494)

Adds a thin authenticated proxy (pkg/observability/proxy/) so the portal can query Prometheus without exposing an internal service to the browser:

Endpoint Forwards to
GET /api/v1/observability/query Prometheus /api/v1/query
GET /api/v1/observability/query_range Prometheus /api/v1/query_range

The upstream response body is returned unchanged. Each request must be authenticated and the caller's persona must grant observability:read (default-deny; admin personas with allow: ["*"] get it automatically). A per-persona token-bucket rate limit (default 10/s) returns 429 when exceeded, and every query is audited as observability.query with the PromQL truncated to 1KB. When Prometheus is unconfigured the endpoints return 503 so the portal renders an empty state. The forward request is built from a pre-parsed, validated base URL with a fixed path; only encoded query-string values are request-controlled. Configured under observability.prometheus in platform.yaml.

Inbound API gateway metrics with per-endpoint labels (#460, #493)

The REST shim POST /api/v1/gateway/{connection}/invoke previously had no instrumentation; only outbound calls to upstreams were measured. This adds two OTel instruments recorded by HTTP middleware on the inbound path (pkg/observability/metrics.go, pkg/gatewayhttp/metrics.go):

  • apigateway_inbound_requests_total{connection, operation_id, method, status_class, identity}
  • apigateway_inbound_duration_seconds{connection, operation_id, method, status_class} (no identity on the histogram, by cardinality design)

operation_id is the OpenAPI operationId resolved from the connection's catalog by path-template matching (new lazily-built per-connection gorillamux router, pkg/toolkits/apigateway/operation_resolver.go), falling back to unknown. Label sets are bounded for cardinality safety: identity is recorded on the counter only, connection is clamped to the registered-connection set, and method is clamped to a supported-method allowlist, so arbitrary request input cannot mint unbounded label values.

Audit event classification by event_kind (#465, #491)

Adds an event_kind column to audit_logs, populated at write time, that categorizes every audited tool call as mcp_tool_call or apigateway_invoke. The kind is derived from the toolkit kind (not tool-name matching), so it is stable against tool renames. Migration 000048_audit_event_kind adds the nullable column and backfills existing rows. The filter is wired through every admin endpoint that powers dashboard data: event list, stats, the filter dropdown, and all six metrics endpoints (timeseries, breakdown, overview, performance, enrichment, discovery). An integration test drives the real middleware to adapter to store path and asserts each kind is derived and persisted correctly.

Fixes

Connection config validation and resilient startup (#467, #481, #490)

A malformed connection_instances row could crash-loop the server, because the aggregate toolkit loader aborted the entire toolkit kind on the first parse failure, and the admin save endpoint persisted rows without per-kind validation.

  • Write-time: ValidateConnectionConfig (pkg/registry/factories.go) dispatches to the per-kind ParseConfig before the store write; the admin save endpoint returns 400 on invalid config.
  • Read-time: ParseMultiConfig in the trino, gateway, and apigateway toolkits and the registry loader now log a warning and skip bad instances instead of failing, so one bad row never blocks startup.
  • Portal: required-field indicators on Host / Endpoint / Base URL, the Save button disabled until required fields are filled, and human-readable field labels in the connection detail view (ui/src/pages/settings/ConnectionsPanel.tsx).

Documentation

docs/server/observability.md, docs/server/audit.md, the LLM-readable docs/llms.txt and docs/llms-full.txt, and the generated Swagger (internal/apidocs/) were updated for the new metrics, proxy endpoints, audit field, and filters.

Full Changelog: v1.68.0...v1.69.0