mcp-data-platform-v1.62.1
Hotfix: api-gateway embedding workers now actually start
v1.62.0 shipped the Postgres-backed embedding job queue (#421) but the worker, reaper, reconciler, and listener never ran on live deployments. The store handle was created, the wire function returned cleanly, no error logged, and no goroutine processed the queue. The API Catalogs UI rendered 0/N specs indexed forever; the apigateway embed jobs: started log never appeared.
This release fixes the lifecycle ordering bug responsible.
PR #422.
Root cause
Three pieces had to be true at once. They were:
Lifecycle.OnStartdid not checkstarted. Any callback registered afterStart()was appended to a slice that was never iterated again. Silent drop.platform.Startruns insideserver.Newatinternal/server/server.go:36, settinglifecycle.started = truebefore control returns to the caller.- The api-gateway embed-jobs wire runs in
startHTTPServeratcmd/mcp-data-platform/main.go:290, which is called afterserver.Newreturns. That wire function registers the worker / reaper / reconciler boot viap.lifecycle.OnStart(...). Lifecycle was already started by this point, so the OnStart callback was appended to a slice that would never be read.
Verified against the live v1.62.0 pod: only the database / config / toolkit / HTTP-server boot logs appeared. No apigateway embed jobs: started. No apigateway embed jobs: skipped (...) either.
Fix
Lifecycle.OnStart now branches on started. If false, it queues the callback as before. If true, it invokes the callback immediately with context.Background() and logs any error at warn level:
func (l *Lifecycle) OnStart(callback func(context.Context) error) {
l.mu.Lock()
if !l.started {
l.startCallbacks = append(l.startCallbacks, callback)
l.mu.Unlock()
return
}
l.mu.Unlock()
if err := callback(context.Background()); err != nil {
slog.Warn("lifecycle: late-registered OnStart callback failed", "error", err)
}
}This closes the entire bug class. Embed-jobs is the obvious victim today, but any future toolkit wired from startHTTPServer would have hit the same trap. The lock is released before invoking the callback so the callback can register OnStop hooks without deadlocking.
Tests added
Three regression tests in pkg/platform/lifecycle_test.go:
TestLifecycle_OnStartAfterStarted_FiresImmediately. the canonical regression: construct a lifecycle, callStart, thenOnStart, assert the callback fired. This is the test whose absence let v1.62.0 ship broken.TestLifecycle_OnStartBeforeStarted_DeferredUntilStart. confirms the pre-Start path still defers.TestLifecycle_OnStartAfterStarted_ErrorIsLogged. confirms a failing late-registered callback does not panic or deadlock.
Upgrade notes
- Drop-in upgrade from v1.62.0. No schema change, no config change, no API change.
- Existing v1.62.0 deployments stay non-functional for api-gateway embedding until pods are rolled. Roll to v1.62.1, then the worker / reaper / reconciler / listener start as the wire calls run. The reconciler runs on boot and converges any spec where
operation_count <> embedding_count. - Watch for the
apigateway embed jobs: startedlog on pod boot to confirm the fix took effect.
Why this slipped through
Honest accounting:
- Every existing test called
worker.Start(ctx)directly. None constructed aLifecycle, calledStart, then calledOnStart, and verified the callback fired. The lifecycle path was unit-tested in pieces but never end-to-end in the production order. - The pre-commit adversarial review reads the diff. It caught the SQL and dedup bugs in #421 because those were visible in the code. It did not catch lifecycle ordering because that requires simulating the boot sequence.
- No live verification before tagging v1.62.0. A single hit against
/api/v1/admin/api-catalogs/{id}/embedding-healthafter deploy would have shown0/N indexedimmediately.
The structural fix is the new regression test, which is the test that would have caught this.
Installation
Homebrew (macOS)
brew install txn2/tap/mcp-data-platformClaude Code CLI
claude mcp add mcp-data-platform -- mcp-data-platformDocker
docker pull ghcr.io/txn2/mcp-data-platform:v1.62.1Verification
All release artifacts are signed with Cosign. Verify with:
cosign verify-blob --bundle mcp-data-platform_1.62.1_linux_amd64.tar.gz.sigstore.json \
mcp-data-platform_1.62.1_linux_amd64.tar.gz