mcp-data-platform-v1.57.5
Highlights
This release fixes the gateway's OAuth-authorized Connect flow against MCP servers fronted by buffering reverse proxies (Cloudflare, etc.) and removes a toolkit-wide mutex bottleneck that caused the admin connections page to hang whenever any single connection was misbehaving.
Bug fixes
Gateway Connect flow no longer hangs ~125s on Cloudflare-fronted upstreams
When an upstream MCP server was fronted by a proxy that buffers responses (Cloudflare being the most common case), clicking Connect in the admin UI took ~125 seconds to redirect after a successful OAuth flow.
The MCP go-sdk's Client.Connect() opens an optional standalone SSE GET / stream synchronously after initialize succeeds (MCP spec §2.2.3). Buffering proxies hold the SSE response open waiting for the upstream to produce events; in steady state there are none to push, so the proxy times out at ~100s with HTTP 524. The synchronous Connect therefore took 100s+ to return, and the dial context expired before the SDK could send notifications/initialized. Operators saw this as a long-spinning "Loading..." after Connect and the misleading log message round trip: context deadline exceeded on notifications/initialized.
Fix: the gateway client now sets StreamableClientTransport.DisableStandaloneSSE = true. Gateway forwarding is request/response only — every tool call is a discrete tools/call from the platform with a synchronous response. We don't currently consume server-pushed notifications, so disabling the standalone SSE stream eliminates the proxy hang without losing functionality.
If streaming-tool support is added in the future, this can be revisited (either re-enable the standalone SSE and disable proxy buffering for the relevant path, or open the GET stream lazily on demand).
Toolkit mutex no longer blocks during network I/O
Toolkit.mu was held during three network-I/O operations:
| Method | Network I/O held under lock |
|---|---|
addParsedConnection |
discover() — full MCP handshake + ListTools |
RemoveConnection |
client.close() — DELETE the MCP session against the upstream |
Close |
every connection's client.close() in sequence |
Result: any slow connection serialized all toolkit operations. Status polls (the admin UI poll loop), ListConnections (the connections page itself), Tools, even RemoveConnection of an unrelated connection — every one of them blocked behind a single slow handshake.
Fix: mirrors the existing SetTokenStore pattern (snapshot under lock, do I/O outside the lock, install under lock):
addParsedConnectionis split intoclaimConnectionSlot(briefly under lock, inserts a sentinel into the connections map),discover()(no lock — the actual network I/O), andinstallDialResult(briefly under lock, replaces the sentinel with the live connection / placeholder / drop).RemoveConnectionpops the entry from the map under the lock;client.close()runs outside the lock.Closesnapshots all live client pointers under the lock; closes them outside.
Status, ListConnections, Tools, HasConnection, and RegisterTools are unchanged — they were already correct in not doing network I/O, but were transitively blocked by the writers above.
Concurrent-claim race in installDialResult
The new claim-sentinel mechanism initially used a claiming boolean to identify an in-flight slot. Under a Remove + Add cycle during a slow dial, this could allow the first dial's stale result to overwrite the second caller's freshly-claimed slot:
T1 AddConnection (slow upstream) → claims sentinel S1 → dialing
RemoveConnection → deletes S1
T2 AddConnection (different upstream) → claims sentinel S2 → dialing
T1's dial completes WHILE T2 still claiming
→ existing.claiming == true → would install T1's result over S2
T2's dial completes
→ existing.claiming == false → discards
Net: T2's connection is silently replaced by T1's stale dial.
Fix: claimConnectionSlot returns the inserted *upstream; installDialResult gates on pointer identity (t.connections[name] != claim) before installing. A Remove + Add cycle replaces the entry with a different pointer, so T1's stale dial is correctly discarded regardless of how many cycles overlap a slow dial.
Tests
Six new concurrency tests in pkg/toolkits/gateway/toolkit_test.go:
- Mutex-contention proofs — start an
AddConnectionagainst a slow upstream, then assert a parallel toolkit operation returns within 200ms:TestStatus_DoesNotBlockDuringSlowAddConnectionTestListConnections_DoesNotBlockDuringSlowAddConnectionTestRemoveConnection_OfDifferentName_DoesNotBlockDuringSlowAddConnection
- Claim-state contract — exercise the new claim sentinel under contention:
TestAddConnection_SameName_OnlyOneWinsTestRemoveConnection_DuringSlowAdd_DiscardsResultTestAddConnection_RemoveAndReAdd_DuringSlowDial_DoesNotCorruptSlot(verified to fail without the pointer-identity check; passes with it)
All six are stable across 10× race-detector runs.
Behavior change visible to admin-UI users
When AddConnection is in flight, the connection appears immediately in ListConnections / Status with the description Connecting to <endpoint> and Healthy=false. Previously the connection didn't show up until the dial completed. The new transient state is brief (typically sub-second against a healthy upstream); on dial failure it transitions to either an Awaiting OAuth authorization placeholder (for authorization_code grants) or is removed from the map (other failure modes).
Files changed
pkg/toolkits/gateway/client.go—DisableStandaloneSSE: trueon the SDK transportpkg/toolkits/gateway/toolkit.go— mutex refactor;claimingfield + pointer-identity check onupstream; helpers extracted fromaddParsedConnectionpkg/toolkits/gateway/toolkit_test.go— six new concurrency tests
Installation
Homebrew (macOS)
brew install txn2/tap/mcp-data-platformClaude Code CLI
claude mcp add mcp-data-platform -- mcp-data-platformDocker
docker pull ghcr.io/txn2/mcp-data-platform:v1.57.5Verification
All release artifacts are signed with Cosign. Verify with:
cosign verify-blob --bundle mcp-data-platform_1.57.5_linux_amd64.tar.gz.sigstore.json \
mcp-data-platform_1.57.5_linux_amd64.tar.gz