Skip to content

mcp-data-platform-v1.57.5

Choose a tag to compare

@github-actions github-actions released this 01 May 18:43
· 152 commits to main since this release
1e6abfd

Highlights

This release fixes the gateway's OAuth-authorized Connect flow against MCP servers fronted by buffering reverse proxies (Cloudflare, etc.) and removes a toolkit-wide mutex bottleneck that caused the admin connections page to hang whenever any single connection was misbehaving.

Bug fixes

Gateway Connect flow no longer hangs ~125s on Cloudflare-fronted upstreams

When an upstream MCP server was fronted by a proxy that buffers responses (Cloudflare being the most common case), clicking Connect in the admin UI took ~125 seconds to redirect after a successful OAuth flow.

The MCP go-sdk's Client.Connect() opens an optional standalone SSE GET / stream synchronously after initialize succeeds (MCP spec §2.2.3). Buffering proxies hold the SSE response open waiting for the upstream to produce events; in steady state there are none to push, so the proxy times out at ~100s with HTTP 524. The synchronous Connect therefore took 100s+ to return, and the dial context expired before the SDK could send notifications/initialized. Operators saw this as a long-spinning "Loading..." after Connect and the misleading log message round trip: context deadline exceeded on notifications/initialized.

Fix: the gateway client now sets StreamableClientTransport.DisableStandaloneSSE = true. Gateway forwarding is request/response only — every tool call is a discrete tools/call from the platform with a synchronous response. We don't currently consume server-pushed notifications, so disabling the standalone SSE stream eliminates the proxy hang without losing functionality.

If streaming-tool support is added in the future, this can be revisited (either re-enable the standalone SSE and disable proxy buffering for the relevant path, or open the GET stream lazily on demand).

Toolkit mutex no longer blocks during network I/O

Toolkit.mu was held during three network-I/O operations:

Method Network I/O held under lock
addParsedConnection discover() — full MCP handshake + ListTools
RemoveConnection client.close() — DELETE the MCP session against the upstream
Close every connection's client.close() in sequence

Result: any slow connection serialized all toolkit operations. Status polls (the admin UI poll loop), ListConnections (the connections page itself), Tools, even RemoveConnection of an unrelated connection — every one of them blocked behind a single slow handshake.

Fix: mirrors the existing SetTokenStore pattern (snapshot under lock, do I/O outside the lock, install under lock):

  • addParsedConnection is split into claimConnectionSlot (briefly under lock, inserts a sentinel into the connections map), discover() (no lock — the actual network I/O), and installDialResult (briefly under lock, replaces the sentinel with the live connection / placeholder / drop).
  • RemoveConnection pops the entry from the map under the lock; client.close() runs outside the lock.
  • Close snapshots all live client pointers under the lock; closes them outside.

Status, ListConnections, Tools, HasConnection, and RegisterTools are unchanged — they were already correct in not doing network I/O, but were transitively blocked by the writers above.

Concurrent-claim race in installDialResult

The new claim-sentinel mechanism initially used a claiming boolean to identify an in-flight slot. Under a Remove + Add cycle during a slow dial, this could allow the first dial's stale result to overwrite the second caller's freshly-claimed slot:

T1 AddConnection (slow upstream) → claims sentinel S1 → dialing
RemoveConnection                  → deletes S1
T2 AddConnection (different upstream) → claims sentinel S2 → dialing
T1's dial completes WHILE T2 still claiming
  → existing.claiming == true → would install T1's result over S2
T2's dial completes
  → existing.claiming == false → discards
Net: T2's connection is silently replaced by T1's stale dial.

Fix: claimConnectionSlot returns the inserted *upstream; installDialResult gates on pointer identity (t.connections[name] != claim) before installing. A Remove + Add cycle replaces the entry with a different pointer, so T1's stale dial is correctly discarded regardless of how many cycles overlap a slow dial.

Tests

Six new concurrency tests in pkg/toolkits/gateway/toolkit_test.go:

  • Mutex-contention proofs — start an AddConnection against a slow upstream, then assert a parallel toolkit operation returns within 200ms:
    • TestStatus_DoesNotBlockDuringSlowAddConnection
    • TestListConnections_DoesNotBlockDuringSlowAddConnection
    • TestRemoveConnection_OfDifferentName_DoesNotBlockDuringSlowAddConnection
  • Claim-state contract — exercise the new claim sentinel under contention:
    • TestAddConnection_SameName_OnlyOneWins
    • TestRemoveConnection_DuringSlowAdd_DiscardsResult
    • TestAddConnection_RemoveAndReAdd_DuringSlowDial_DoesNotCorruptSlot (verified to fail without the pointer-identity check; passes with it)

All six are stable across 10× race-detector runs.

Behavior change visible to admin-UI users

When AddConnection is in flight, the connection appears immediately in ListConnections / Status with the description Connecting to <endpoint> and Healthy=false. Previously the connection didn't show up until the dial completed. The new transient state is brief (typically sub-second against a healthy upstream); on dial failure it transitions to either an Awaiting OAuth authorization placeholder (for authorization_code grants) or is removed from the map (other failure modes).

Files changed

  • pkg/toolkits/gateway/client.goDisableStandaloneSSE: true on the SDK transport
  • pkg/toolkits/gateway/toolkit.go — mutex refactor; claiming field + pointer-identity check on upstream; helpers extracted from addParsedConnection
  • pkg/toolkits/gateway/toolkit_test.go — six new concurrency tests

Installation

Homebrew (macOS)

brew install txn2/tap/mcp-data-platform

Claude Code CLI

claude mcp add mcp-data-platform -- mcp-data-platform

Docker

docker pull ghcr.io/txn2/mcp-data-platform:v1.57.5

Verification

All release artifacts are signed with Cosign. Verify with:

cosign verify-blob --bundle mcp-data-platform_1.57.5_linux_amd64.tar.gz.sigstore.json \
  mcp-data-platform_1.57.5_linux_amd64.tar.gz

Full changelog

v1.57.4...v1.57.5