Skip to content

fix remote MCP servers stuck in starting when upstream returns errors#4844

Open
slyt3 wants to merge 3 commits intostacklok:mainfrom
slyt3:fix/remote-server-unhealhty-on-errors
Open

fix remote MCP servers stuck in starting when upstream returns errors#4844
slyt3 wants to merge 3 commits intostacklok:mainfrom
slyt3:fix/remote-server-unhealhty-on-errors

Conversation

@slyt3
Copy link
Copy Markdown
Contributor

@slyt3 slyt3 commented Apr 15, 2026

When a remote server keeps returning HTTP 500, toolhive just leaves the workload stuck in
starting forever and never tells the user anything is wrong.

Turns out there were two bugs causing this:

  1. Health checks for remote workloads were off by default — you had to set
    TOOLHIVE_REMOTE_HEALTHCHECKS=true manually, which most people would never do.
  2. Even with that set, monitorHealth was skipping checks until the server finished
    initializing — but a server returning 500 never initializes, so it just skips forever.

So the workload sits in starting indefinitely with no way to know something is broken
unless you dig into debug logs.

What I changed:

  • Removed the initialization guard so health checks always run
  • Enabled health checks for remote workloads unconditionally, removed the env var gate
  • After enough consecutive failures the workload goes to unhealthy and shows the reason in thv list
  • Unhealthy remote workloads now show in thv list without needing -a

Fixes #4459

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Manual testing (describe below)

Added TestTransparentProxy_RemoteServerFailure_HTTP500 — runs a proxy against a server
returning 500 without ever initializing it, confirms health check fires and proxy shuts down.
Passes.

Changes

File Change
pkg/transport/proxy/transparent/transparent_proxy.go Remove serverInitialized() guard in monitorHealth
pkg/transport/http.go Always enable health checks; remove dead shouldEnableHealthCheck helper
pkg/workloads/manager.go Propagate StatusContext; show unhealthy without -a
cmd/thv/app/list.go Print failure context in thv list status column
pkg/transport/proxy/transparent/transparent_test.go Add regression test for HTTP 500 before initialization
pkg/transport/http_test.go Update test to reflect always-enabled behavior
test/e2e/health_check_zombie_test.go Remove no-op TOOLHIVE_REMOTE_HEALTHCHECKS=true
test/e2e/stateless_proxy_test.go Remove no-op TOOLHIVE_REMOTE_HEALTHCHECKS=true
docs/arch/03-transport-architecture.md Remove outdated env var docs

Does this introduce a user-facing change?

Yes. Remote servers that keep failing now show as unhealthy in thv list with the reason,
instead of staying stuck in starting with nothing shown to the user.

Large PR Justification

The line count is inflated by line-ending normalization in transparent_test.go. Running gci on Windows wrote LF endings to a file that previously had CRLF endings, causing git to report ~1800 changed lines when the actual code changes are fewer than 50 lines.

The real content changes in this PR are:

Proxy stays alive after health check failures on remote servers (instead of shutting down)
Auto-recovery: status resets to running when the upstream server comes back
TOOLHIVE_REMOTE_HEALTHCHECKS env var removed (health checks always enabled)
Tests updated to assert stay-alive and recovery behavior
These changes cannot be split further without breaking the transport interface, the mock, and the tests simultaneously

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Apr 15, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6181e8215d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/transport/proxy/transparent/transparent_proxy.go
@slyt3 slyt3 force-pushed the fix/remote-server-unhealhty-on-errors branch from 6181e82 to dd528ec Compare April 15, 2026 11:34
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Apr 15, 2026
@amirejaz
Copy link
Copy Markdown
Contributor

amirejaz commented Apr 15, 2026

Review: Regression risk + proposed approach

The fix for the stuck-in-`starting` bug is correct and the removal of `TOOLHIVE_REMOTE_HEALTHCHECKS` is the right call. However, the current implementation introduces a regression for remote MCP servers with a sleep/awake lifecycle (e.g. scale-to-zero serverless backends): after ~45s of downtime (3 failures × 10s interval + 5s retry), the proxy self-terminates.

Root cause of the regression

When `handleHealthCheckFailure` exhausts retries, it calls both `onHealthCheckFailed()` (sets status → `unhealthy`) and `p.Stop()` (kills the listener, exits `monitorHealth`). For a temporarily unavailable remote server there is nothing left to observe recovery — the proxy is dead.

Proposed fix: soft-unhealthy with auto-recovery for remote workloads

The key insight: ToolHive does not control the lifecycle of a remote server the way it controls a container. Stopping the proxy because a remote server is temporarily unavailable is too aggressive.

`handleHealthCheckFailure` — branch on `isRemote`:

```go
if p.onHealthCheckFailed != nil {
p.onHealthCheckFailed()
}
if p.isRemote {
// Stay in degraded mode, keep monitoring for recovery
return consecutiveFailures, true // shouldContinue = true, no Stop()
}
// Local container: we own the lifecycle, stop it
if err := p.Stop(ctx); err != nil { ... }
return consecutiveFailures, false
```

`monitorHealth` — fire a recovery callback when health returns:

```go
if consecutiveFailures >= healthCheckRetryCount && p.onHealthCheckRecovered != nil {
p.onHealthCheckRecovered()
}
consecutiveFailures = 0
```

`runner.go` — wire both callbacks:

```go
transportHandler.SetOnHealthCheckFailed(func() {
r.statusManager.SetWorkloadStatus(ctx, r.Config.BaseName, rt.WorkloadStatusUnhealthy, "Server stopped responding")
})
transportHandler.SetOnHealthCheckRecovered(func() {
r.statusManager.SetWorkloadStatus(ctx, r.Config.BaseName, rt.WorkloadStatusRunning, "")
})
```

`runner.go` — extend the existing unauthenticated status guard to also preserve `unhealthy`, preventing `waitForInitializeSuccess` from racing with the health check loop:

```go
if currentWorkload.Status == rt.WorkloadStatusUnauthenticated ||
currentWorkload.Status == rt.WorkloadStatusUnhealthy {
// health checks drive this state, don't overwrite
} else {
SetWorkloadStatus(running)
}
```

Resulting state machine for remote workloads

Scenario Behaviour
Server returns 500 from start (original bug) Marked `unhealthy` in ~45s, stays there, visible in `thv list`
Server temporarily unavailable (e.g. scale-to-zero cold start) Marked `unhealthy` during downtime, auto-recovers when it comes back — no manual restart
Permanently broken server Stays `unhealthy` indefinitely; user sees it in `thv list` and runs `thv stop` manually
Local container workload Unchanged — proxy stops on failure (ToolHive controls the container lifecycle)

Why `TOOLHIVE_REMOTE_HEALTHCHECKS` stays removed

The env var existed solely to prevent health checks from killing the proxy for temporarily unavailable servers. With auto-recovery, that reason is gone. Health checks are now safe to enable unconditionally for remote workloads: they catch broken servers fast and survive temporary downtime transparently.

Copy link
Copy Markdown
Contributor

@amirejaz amirejaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment

@slyt3 slyt3 force-pushed the fix/remote-server-unhealhty-on-errors branch from dd528ec to 49d6979 Compare April 16, 2026 10:30
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Apr 16, 2026
@slyt3
Copy link
Copy Markdown
Contributor Author

slyt3 commented Apr 16, 2026

@amirejaz Yeah you right, stopping the proy was too aggresive, fixed it so remote proxies stay alive after fail and auto-recover when he serer come back, instead of dying and requiring a manual restrt, also wired recovery callback in runner.go and extended the atus guard to cover unhealhty checks and removed TOOLHIVE_REMOTE_HEALTHCEKS` the only reason it existed was to avoid the proxy ding on temporary outages, which now hanlded properly

When remote server consistenlly returns HTTP 500, toolhive was leaving the workload in a starting state indefinitely with no feedback to user

two bugs combined to cause this:

1 - health check were disabled for remote workloads by default (gated behind TOOLHIVE_REMOTE_HEALTHCHECKS=TRUE)

2 - Even when enanbled monitorHealth skipped checks until the server had been initialized but server returning 500 never triggers initialization so checks were skipped forever

Removed the initialization guard so health check run unconditionally and always enable health checks for remote workloads, after threshold of consecutive failutres the workload transitions to unhealhty and the error surfaces in thv list output

Also show unhealhty remote worklaods in thv list without -a, since the whole point is lettting the user know something si wrong
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 26.47059% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.12%. Comparing base (e54e6ae) to head (cb4b63e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/runner/runner.go 0.00% 12 Missing ⚠️
...g/transport/proxy/transparent/transparent_proxy.go 53.84% 4 Missing and 2 partials ⚠️
pkg/workloads/manager.go 0.00% 4 Missing ⚠️
pkg/transport/http.go 50.00% 2 Missing ⚠️
pkg/transport/stdio.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4844      +/-   ##
==========================================
- Coverage   69.14%   69.12%   -0.03%     
==========================================
  Files         531      531              
  Lines       55183    55202      +19     
==========================================
+ Hits        38156    38157       +1     
- Misses      14106    14121      +15     
- Partials     2921     2924       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@slyt3 slyt3 force-pushed the fix/remote-server-unhealhty-on-errors branch from 49d6979 to e4f6532 Compare April 16, 2026 10:36
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 16, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 16, 2026
@github-actions github-actions bot removed the size/M Medium PR: 300-599 lines changed label Apr 16, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 16, 2026
@github-actions github-actions bot dismissed their stale review April 16, 2026 11:08

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote MCP servers stuck in "starting" state when upstream returns errors

2 participants