Skip to content

fix(mcp-sse): decouple /health liveness from upstream readiness#151

Merged
jack-arturo merged 2 commits intoverygoodplugins:mainfrom
George-RD:fix/sse-health-liveness-probe
Apr 24, 2026
Merged

fix(mcp-sse): decouple /health liveness from upstream readiness#151
jack-arturo merged 2 commits intoverygoodplugins:mainfrom
George-RD:fix/sse-health-liveness-probe

Conversation

@George-RD
Copy link
Copy Markdown
Contributor

Summary

PR #114's /health endpoint returns HTTP 503 whenever upstream AutoMem reports anything other than "healthy" — including any non-zero FalkorDB/Qdrant sync drift. Railway treats 503 as unavailable and blocks the deploy, so the SSE bridge becomes undeployable during brief upstream degradation even though the Node process itself is perfectly able to serve traffic.

In an eventually-consistent system, drift between FalkorDB writes and Qdrant embedding is normal under write load. Gating container liveness on zero drift causes the bridge to fail to deploy whenever writes are happening — defeating the point of a bridge (which should be able to serve errors upstream when its backend is unavailable).

This PR:

  • GET /health now always returns 200 when the Node process is able to serve HTTP. The response body still carries upstream status (status, upstream, upstream_error, upstream_details) for observability, so the richer info the PR fix: harden MCP bridge resilience, adopt stateless transport, and update cross-client docs #114 change wanted to expose is still there — just not as HTTP status.
  • GET /ready (new) keeps the old strict behavior: 200 iff upstream is healthy, 503 otherwise. For orchestrators that explicitly want to gate traffic on upstream availability (e.g. Kubernetes readiness probes), this is the right endpoint.
  • Railway's healthcheckPath should remain /health.

Test plan

  • Existing health tests updated — the "upstream unreachable" test now asserts 200 + degraded body (was 503)
  • New tests for /ready: 200 when upstream healthy, 503 when unreachable
  • All 13 node tests pass locally (node --test test/server.test.js)

Context

Surfaced while trying to ship the release 0.15.1 deployment on Railway — build succeeded, but every deploy was killed by healthcheck failing with 503, even though the Node process was up and the upstream was only showing minor sync drift (17 in-flight memories out of 29,012 — 0.06%, and self-healing). The same failure mode will happen on any deployment where upstream is temporarily degraded for any reason, which this PR addresses.

🤖 Generated with Claude Code

The /health endpoint introduced in verygoodplugins#114 returned HTTP 503 whenever the
upstream AutoMem service reported anything other than "healthy" (e.g.
FalkorDB/Qdrant disconnected, sync drift). Railway treats 503 as
unavailable and blocks the deploy — so the SSE bridge became un-deployable
any time its upstream was even partially degraded, even though the bridge
itself was perfectly able to serve traffic and return useful errors.

Split liveness from readiness:

- GET /health now always returns 200 when the Node process is able to
  serve HTTP. Upstream status is still reported in the response body
  (status, upstream, upstream_error, upstream_details) for observability.
- GET /ready keeps the strict behavior: 200 iff upstream is healthy,
  503 otherwise. Use this for orchestrators that should gate traffic on
  upstream availability.

Railway's healthcheckPath should remain /health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 13:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the MCP SSE bridge’s health semantics so container liveness is not blocked by transient upstream AutoMem degradation, while still providing a strict upstream-gated readiness signal for orchestrators.

Changes:

  • Changed GET /health to always return HTTP 200 while still reporting upstream status in the JSON body.
  • Added GET /ready to return 200 only when upstream is healthy, otherwise 503.
  • Updated and expanded Node tests to cover the new /health and /ready behaviors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
mcp-sse-server/server.js Makes /health always 200 and introduces /ready with upstream-gated status codes.
mcp-sse-server/test/server.test.js Updates existing /health test expectations and adds /ready tests for healthy/unreachable upstream.

Comment thread mcp-sse-server/server.js
Comment thread mcp-sse-server/server.js Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0429796502

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread mcp-sse-server/server.js Outdated
Address Copilot review on PR verygoodplugins#151:

- /health: replace await with void so the first request never blocks on
  an upstream probe. The body still reports upstream/checked_at for
  observability; status is always 200 while the process can serve HTTP.
- /ready: drop the checked_at guard so readiness always reflects current
  upstream state. healthProbePromise dedups concurrent hits, so upstream
  load is bounded by orchestrator cadence rather than per-request fan-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jack-arturo
Copy link
Copy Markdown
Member

Good catch @George-RD , thanks!!

@jack-arturo jack-arturo merged commit 5bcfb8b into verygoodplugins:main Apr 24, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants