mcp-data-platform-v1.64.0
Highlights
- API gateway embed-job queue is observable and parallelizable. A long-running spec embed no longer sits silent at
0/N indexedfor minutes. The worker publishes a chunk-progress counter on every batch boundary so the catalog badge ticks upward live, and a new config knob lets multiple worker goroutines drain the queue in parallel. - Ollama batch embedding is now actually batched.
EmbedBatchcalls Ollama'sPOST /api/embed(singular path, plural input field) instead of N sequentialPOST /api/embeddings. Modern servers return one HTTP round-trip per batch; older servers are detected on the first 404 and the path transparently falls back. A 164-operation spec on CPU-only Ollama drops from ~10 minutes to model-inference floor. - Embedding provider misconfiguration is visible and fail-safe. When
memory.embedding.provideris unset the platform substitutes the noop placeholder and now: logs one actionable WARN at startup, refuses to wire the apigateway embed-job queue (no more[0,0,...,0]rows inapi_catalog_operation_embeddingsmasquerading as indexed), persists memory writes withEmbedding: nil, exposes the state viaGET /api/v1/admin/embedding/status, and surfaces an amber banner in the API Catalogs and Memory panels. - SRE-grade observability at the chokepoints. Prometheus metrics on every
tools/call(latency P50/P95/P99, error rate, in-flight counts) and on every outboundapi_invoke_endpoint(HTTP status class, transport-error category, per-connection latency). Cardinality-bounded by design: a closed allow-list of attributes, user-identifying fields kept off labels. - Correct gateway semantics for
api_invoke_endpoint. A 4xx/5xx response from the upstream is no longer reported as a gateway failure. Wire-level HTTP from the REST shim stays 200 with the upstream code embedded in the payload; only true transport failures and timeouts surface as 502 / 504. - Auth chain output is quieter and easier to correlate. Successful fallthroughs between OIDC, API-key, and admin-key auth no longer log routine fallthroughs as warnings. Every auth decision carries a request-correlation ID so multi-step flows can be reconstructed from a single grep.
Operator-visible changes
New configuration
apigateway:
embed_jobs:
workers: 1 # default; raise to 2 or 4 if you have many specs and a fast embedderCPU-only embedders typically saturate at 1 because the bottleneck is the model. Increase only when you have evidence the worker is the bottleneck.
New admin endpoints
| Method | Path | Purpose |
|---|---|---|
| GET | /api/v1/admin/embedding/status |
Returns {kind, model, dimension, status}. status="unconfigured" indicates the noop placeholder is in use; the portal renders an amber banner in this state and the apigateway embed-job queue refuses to start. |
| GET | /metrics |
Prometheus scrape endpoint. Histograms + counters on tool-call latency, outbound apigateway latency by HTTP status class, and audit-write rate. |
Wire-format additions (backward compatible, omitempty)
GET /api/v1/admin/api-catalogs/{id}/embedding-status response gains:
{
"embedded_so_far": 12
}embedded_so_far is the worker's in-flight chunk-progress counter. While job_status == "running" the portal renders this against operation_count so a long embed pass shows incremental progress instead of staying at 0/N until the final atomic upsert commits embedding_count.
Database migration
000046_api_catalog_embedding_jobs_progress adds embedded_so_far INTEGER NOT NULL DEFAULT 0 to api_catalog_embedding_jobs. On PostgreSQL 16 this is metadata-only for INTEGER columns; no table rewrite, no extended lock. Existing deployments roll forward automatically on platform start.
UI
- API Catalogs panel: per-spec badge ticks
indexing N/Mupward during a running embed, distinct fromqueued(amber, 0/M) andN/N indexed(green). - API Catalogs + Memory tab on My Knowledge: amber banner when
memory.embedding.provideris unset, naming the config key and clarifying that semantic features are disabled while lexical / keyword / entity-URN lookup still works.
Detailed changes
Features
- #428 / #431. Prometheus observability at platform chokepoints. Histograms + counters on the
MCPToolCallMiddlewareand onapi_invoke_endpointoutbound HTTP. Two instrumentation sites cover every toolkit at once: every tool call goes through the middleware, every model-driven API call goes through the gateway. Cardinality allowlist:tool,connection,persona,status_category,http_status_class,toolkit_kind. Forbidden labels: anything user-identifying (request IDs go on trace spans, not Prometheus labels). Pure Prometheus exporter; no OTel agent required. - #430 / #437. Incremental embed progress + configurable worker concurrency. Worker publishes
embedded_so_farat every chunk boundary via a newStore.UpdateProgresscall; the column lives inapi_catalog_embedding_jobsand is read alongside the existing job-status fields. Newapigateway.embed_jobs.workersconfig (default 1) spawns N goroutines that share the queue; existing FOR UPDATE SKIP LOCKED + lease guarantee keeps them from racing on the same job. After each successful Claim the worker notifies a sibling so a backlog drains in parallel. Worker exposesConcurrency()for assertion in wiring tests.
Bug fixes
- #429 / #436. Silent noop embedding provider stored
[0,0,...,0]vectors. Five defenses now refuse to persist zero vectors: apigateway wiring skips the queue when the embedder is noop; memoryhandleRememberandhandleUpdatepersistEmbedding: nilinstead of zero vectors; knowledgecapture_insightskips the memory-store update under noop;Providerinterface gainsKind() string+ a package-levelIsConfigured(p) boolhelper for one-line guards; startup logs one WARN naming the config key. Existing recall-side guard atmemory/recall.go:127already refused all-zero query embeddings; the write side now mirrors that. - #435. Ollama
EmbedBatchwas named "batch" but didn't batch. Sent N sequentialPOST /api/embeddingscalls (one per text). Now sends onePOST /api/embedper batch with the fullinputarray. Detects servers that lack the batch endpoint (HTTP 404) on the first call, logs one WARN naming the URL and model, transparently falls back to the singular path. Probe is paid at most once per provider lifetime. Eliminates 31 of every 32 HTTP round-trips on modern Ollama for the default batch size of 32. - #432 / #433.
api_invoke_endpointconflated upstream and transport failures. A 4xx/5xx response from the upstream was wrapped as a tool error (IsError=true) and the REST shim translated it to a 502, hiding the real upstream status. Now the wrappingCallToolResultreportsIsError=falsefor any upstream response with a status code;Statuscarries the upstream HTTP code; the REST shim returns 200 with the upstream code embedded in the body. Only true transport failures (DNS, dial, TLS, mid-stream EOF) and request-timeouts surface asIsError=trueand map to wire 502 / 504 respectively. Outcome classification (OutcomeOK,OutcomeUpstream4xx,OutcomeUpstream5xx,OutcomeTransportErr,OutcomeUpstreamTimeout) flows through to audit logs via_meta.audit_outcomeso operators can grep by category. - #434. Auth chain fallthrough noise and missing request correlation. OIDC followed by API-key followed by admin-key authentication logged every successful fallthrough as a WARN; an admin-key request produced two warnings for every call. Successful fallthroughs are now silent (Debug level); only auth failures or unexpected errors log at WARN. Every auth decision carries a request-correlation ID propagated from the inbound request through every chain step, so a multi-step auth flow can be reconstructed by grepping a single ID.
Upgrade notes
- Migration is automatic.
migrate.Runon startup applies000046. No operator action required. - No breaking config changes.
apigateway.embed_jobs.workersis new and defaults to 1, which preserves prior single-goroutine behavior. - No breaking wire-format changes. New JSON fields are
omitempty; consumers that ignore unknown fields are unaffected. - The
Kind() stringaddition to the publicembedding.Providerinterface is a source-incompatible change for external implementations. Both in-tree implementations (Ollama, noop) are updated. Anyone vendoring this module and implementingembedding.Providerexternally needs to addKind() stringreturning a stable identifier. - First scrape of
/metricsexposes zero values. Prometheus alerting rules that fire on absent series should be reviewed; on a freshly-deployed v1.64.0 some histograms only appear after the first relevant call.
Installation
Homebrew (macOS)
brew install txn2/tap/mcp-data-platformClaude Code CLI
claude mcp add mcp-data-platform -- mcp-data-platformDocker
docker pull ghcr.io/txn2/mcp-data-platform:v1.64.0Verification
All release artifacts are signed with Cosign. Verify with:
cosign verify-blob --bundle mcp-data-platform_1.64.0_linux_amd64.tar.gz.sigstore.json \
mcp-data-platform_1.64.0_linux_amd64.tar.gz