Skip to content

mcp-data-platform-v1.62.1

Choose a tag to compare

@github-actions github-actions released this 17 May 00:37
· 107 commits to main since this release
1731da1

Hotfix: api-gateway embedding workers now actually start

v1.62.0 shipped the Postgres-backed embedding job queue (#421) but the worker, reaper, reconciler, and listener never ran on live deployments. The store handle was created, the wire function returned cleanly, no error logged, and no goroutine processed the queue. The API Catalogs UI rendered 0/N specs indexed forever; the apigateway embed jobs: started log never appeared.

This release fixes the lifecycle ordering bug responsible.

PR #422.

Root cause

Three pieces had to be true at once. They were:

  1. Lifecycle.OnStart did not check started. Any callback registered after Start() was appended to a slice that was never iterated again. Silent drop.
  2. platform.Start runs inside server.New at internal/server/server.go:36, setting lifecycle.started = true before control returns to the caller.
  3. The api-gateway embed-jobs wire runs in startHTTPServer at cmd/mcp-data-platform/main.go:290, which is called after server.New returns. That wire function registers the worker / reaper / reconciler boot via p.lifecycle.OnStart(...). Lifecycle was already started by this point, so the OnStart callback was appended to a slice that would never be read.

Verified against the live v1.62.0 pod: only the database / config / toolkit / HTTP-server boot logs appeared. No apigateway embed jobs: started. No apigateway embed jobs: skipped (...) either.

Fix

Lifecycle.OnStart now branches on started. If false, it queues the callback as before. If true, it invokes the callback immediately with context.Background() and logs any error at warn level:

func (l *Lifecycle) OnStart(callback func(context.Context) error) {
    l.mu.Lock()
    if !l.started {
        l.startCallbacks = append(l.startCallbacks, callback)
        l.mu.Unlock()
        return
    }
    l.mu.Unlock()
    if err := callback(context.Background()); err != nil {
        slog.Warn("lifecycle: late-registered OnStart callback failed", "error", err)
    }
}

This closes the entire bug class. Embed-jobs is the obvious victim today, but any future toolkit wired from startHTTPServer would have hit the same trap. The lock is released before invoking the callback so the callback can register OnStop hooks without deadlocking.

Tests added

Three regression tests in pkg/platform/lifecycle_test.go:

  • TestLifecycle_OnStartAfterStarted_FiresImmediately. the canonical regression: construct a lifecycle, call Start, then OnStart, assert the callback fired. This is the test whose absence let v1.62.0 ship broken.
  • TestLifecycle_OnStartBeforeStarted_DeferredUntilStart. confirms the pre-Start path still defers.
  • TestLifecycle_OnStartAfterStarted_ErrorIsLogged. confirms a failing late-registered callback does not panic or deadlock.

Upgrade notes

  • Drop-in upgrade from v1.62.0. No schema change, no config change, no API change.
  • Existing v1.62.0 deployments stay non-functional for api-gateway embedding until pods are rolled. Roll to v1.62.1, then the worker / reaper / reconciler / listener start as the wire calls run. The reconciler runs on boot and converges any spec where operation_count <> embedding_count.
  • Watch for the apigateway embed jobs: started log on pod boot to confirm the fix took effect.

Why this slipped through

Honest accounting:

  • Every existing test called worker.Start(ctx) directly. None constructed a Lifecycle, called Start, then called OnStart, and verified the callback fired. The lifecycle path was unit-tested in pieces but never end-to-end in the production order.
  • The pre-commit adversarial review reads the diff. It caught the SQL and dedup bugs in #421 because those were visible in the code. It did not catch lifecycle ordering because that requires simulating the boot sequence.
  • No live verification before tagging v1.62.0. A single hit against /api/v1/admin/api-catalogs/{id}/embedding-health after deploy would have shown 0/N indexed immediately.

The structural fix is the new regression test, which is the test that would have caught this.

Installation

Homebrew (macOS)

brew install txn2/tap/mcp-data-platform

Claude Code CLI

claude mcp add mcp-data-platform -- mcp-data-platform

Docker

docker pull ghcr.io/txn2/mcp-data-platform:v1.62.1

Verification

All release artifacts are signed with Cosign. Verify with:

cosign verify-blob --bundle mcp-data-platform_1.62.1_linux_amd64.tar.gz.sigstore.json \
  mcp-data-platform_1.62.1_linux_amd64.tar.gz