fix remote MCP servers stuck in starting when upstream returns errors by slyt3 · Pull Request #4844 · stacklok/toolhive

slyt3 · 2026-04-15T11:25:47Z

When a remote server keeps returning HTTP 500, toolhive just leaves the workload stuck in
starting forever and never tells the user anything is wrong.

Turns out there were two bugs causing this:

Health checks for remote workloads were off by default — you had to set
TOOLHIVE_REMOTE_HEALTHCHECKS=true manually, which most people would never do.
Even with that set, monitorHealth was skipping checks until the server finished
initializing — but a server returning 500 never initializes, so it just skips forever.

So the workload sits in starting indefinitely with no way to know something is broken
unless you dig into debug logs.

What I changed:

Removed the initialization guard so health checks always run
Enabled health checks for remote workloads unconditionally, removed the env var gate
After enough consecutive failures the workload goes to unhealthy and shows the reason in thv list
Unhealthy remote workloads now show in thv list without needing -a

Fixes #4459

Type of change

Bug fix

Test plan

Unit tests (task test)
Manual testing (describe below)

Added TestTransparentProxy_RemoteServerFailure_HTTP500 — runs a proxy against a server
returning 500 without ever initializing it, confirms health check fires and proxy shuts down.
Passes.

Changes

File	Change
`pkg/transport/proxy/transparent/transparent_proxy.go`	Remove `serverInitialized()` guard in `monitorHealth`
`pkg/transport/http.go`	Always enable health checks; remove dead `shouldEnableHealthCheck` helper
`pkg/workloads/manager.go`	Propagate `StatusContext`; show unhealthy without `-a`
`cmd/thv/app/list.go`	Print failure context in `thv list` status column
`pkg/transport/proxy/transparent/transparent_test.go`	Add regression test for HTTP 500 before initialization
`pkg/transport/http_test.go`	Update test to reflect always-enabled behavior
`test/e2e/health_check_zombie_test.go`	Remove no-op `TOOLHIVE_REMOTE_HEALTHCHECKS=true`
`test/e2e/stateless_proxy_test.go`	Remove no-op `TOOLHIVE_REMOTE_HEALTHCHECKS=true`
`docs/arch/03-transport-architecture.md`	Remove outdated env var docs

Does this introduce a user-facing change?

Yes. Remote servers that keep failing now show as unhealthy in thv list with the reason,
instead of staying stuck in starting with nothing shown to the user.

Large PR Justification

The line count is inflated by line-ending normalization in transparent_test.go. Running gci on Windows wrote LF endings to a file that previously had CRLF endings, causing git to report ~1800 changed lines when the actual code changes are fewer than 50 lines.

The real content changes in this PR are:

Proxy stays alive after health check failures on remote servers (instead of shutting down)
Auto-recovery: status resets to running when the upstream server comes back
TOOLHIVE_REMOTE_HEALTHCHECKS env var removed (health checks always enabled)
Tests updated to assert stay-alive and recovery behavior
These changes cannot be split further without breaking the transport interface, the mock, and the tests simultaneously

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6181e8215d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

amirejaz · 2026-04-15T14:08:51Z

Review: Regression risk + proposed approach

The fix for the stuck-in-`starting` bug is correct and the removal of `TOOLHIVE_REMOTE_HEALTHCHECKS` is the right call. However, the current implementation introduces a regression for remote MCP servers with a sleep/awake lifecycle (e.g. scale-to-zero serverless backends): after ~45s of downtime (3 failures × 10s interval + 5s retry), the proxy self-terminates.

Root cause of the regression

When `handleHealthCheckFailure` exhausts retries, it calls both `onHealthCheckFailed()` (sets status → `unhealthy`) and `p.Stop()` (kills the listener, exits `monitorHealth`). For a temporarily unavailable remote server there is nothing left to observe recovery — the proxy is dead.

Proposed fix: soft-unhealthy with auto-recovery for remote workloads

The key insight: ToolHive does not control the lifecycle of a remote server the way it controls a container. Stopping the proxy because a remote server is temporarily unavailable is too aggressive.

`handleHealthCheckFailure` — branch on `isRemote`:

```go
if p.onHealthCheckFailed != nil {
p.onHealthCheckFailed()
}
if p.isRemote {
// Stay in degraded mode, keep monitoring for recovery
return consecutiveFailures, true // shouldContinue = true, no Stop()
}
// Local container: we own the lifecycle, stop it
if err := p.Stop(ctx); err != nil { ... }
return consecutiveFailures, false
```

`monitorHealth` — fire a recovery callback when health returns:

```go
if consecutiveFailures >= healthCheckRetryCount && p.onHealthCheckRecovered != nil {
p.onHealthCheckRecovered()
}
consecutiveFailures = 0
```

`runner.go` — wire both callbacks:

```go
transportHandler.SetOnHealthCheckFailed(func() {
r.statusManager.SetWorkloadStatus(ctx, r.Config.BaseName, rt.WorkloadStatusUnhealthy, "Server stopped responding")
})
transportHandler.SetOnHealthCheckRecovered(func() {
r.statusManager.SetWorkloadStatus(ctx, r.Config.BaseName, rt.WorkloadStatusRunning, "")
})
```

`runner.go` — extend the existing unauthenticated status guard to also preserve `unhealthy`, preventing `waitForInitializeSuccess` from racing with the health check loop:

```go
if currentWorkload.Status == rt.WorkloadStatusUnauthenticated ||
currentWorkload.Status == rt.WorkloadStatusUnhealthy {
// health checks drive this state, don't overwrite
} else {
SetWorkloadStatus(running)
}
```

Resulting state machine for remote workloads

Scenario	Behaviour
Server returns 500 from start (original bug)	Marked `unhealthy` in ~45s, stays there, visible in `thv list`
Server temporarily unavailable (e.g. scale-to-zero cold start)	Marked `unhealthy` during downtime, auto-recovers when it comes back — no manual restart
Permanently broken server	Stays `unhealthy` indefinitely; user sees it in `thv list` and runs `thv stop` manually
Local container workload	Unchanged — proxy stops on failure (ToolHive controls the container lifecycle)

Why `TOOLHIVE_REMOTE_HEALTHCHECKS` stays removed

The env var existed solely to prevent health checks from killing the proxy for temporarily unavailable servers. With auto-recovery, that reason is gone. Health checks are now safe to enable unconditionally for remote workloads: they catch broken servers fast and survive temporary downtime transparently.

amirejaz

added a comment

slyt3 · 2026-04-16T10:32:53Z

@amirejaz Yeah you right, stopping the proy was too aggresive, fixed it so remote proxies stay alive after fail and auto-recover when he serer come back, instead of dying and requiring a manual restrt, also wired recovery callback in runner.go and extended the atus guard to cover unhealhty checks and removed TOOLHIVE_REMOTE_HEALTHCEKS` the only reason it existed was to avoid the proxy ding on temporary outages, which now hanlded properly

When remote server consistenlly returns HTTP 500, toolhive was leaving the workload in a starting state indefinitely with no feedback to user two bugs combined to cause this: 1 - health check were disabled for remote workloads by default (gated behind TOOLHIVE_REMOTE_HEALTHCHECKS=TRUE) 2 - Even when enanbled monitorHealth skipped checks until the server had been initialized but server returning 500 never triggers initialization so checks were skipped forever Removed the initialization guard so health check run unconditionally and always enable health checks for remote workloads, after threshold of consecutive failutres the workload transitions to unhealhty and the error surfaces in thv list output Also show unhealhty remote worklaods in thv list without -a, since the whole point is lettting the user know something si wrong

codecov · 2026-04-16T10:36:30Z

Codecov Report

❌ Patch coverage is 26.47059% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.12%. Comparing base (e54e6ae) to head (cb4b63e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/runner/runner.go	0.00%	12 Missing ⚠️
...g/transport/proxy/transparent/transparent_proxy.go	53.84%	4 Missing and 2 partials ⚠️
pkg/workloads/manager.go	0.00%	4 Missing ⚠️
pkg/transport/http.go	50.00%	2 Missing ⚠️
pkg/transport/stdio.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4844      +/-   ##
==========================================
- Coverage   69.14%   69.12%   -0.03%     
==========================================
  Files         531      531              
  Lines       55183    55202      +19     
==========================================
+ Hits        38156    38157       +1     
- Misses      14106    14121      +15     
- Partials     2921     2924       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.

This review will be automatically dismissed once you add the justification section.

Large PR justification has been provided. Thank you!

github-actions · 2026-04-16T11:08:33Z

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

slyt3 requested review from ChrisJBurns, JAORMX, amirejaz, blkt, jhrozek, lujunsan, rdimitrov and yrobla as code owners April 15, 2026 11:25

claude bot reviewed Apr 15, 2026

View reviewed changes

github-actions bot added the size/S Small PR: 100-299 lines changed label Apr 15, 2026

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

Comment thread pkg/transport/proxy/transparent/transparent_proxy.go

slyt3 force-pushed the fix/remote-server-unhealhty-on-errors branch from 6181e82 to dd528ec Compare April 15, 2026 11:34

github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Apr 15, 2026

amirejaz requested changes Apr 15, 2026

View reviewed changes

slyt3 force-pushed the fix/remote-server-unhealhty-on-errors branch from dd528ec to 49d6979 Compare April 16, 2026 10:30

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Apr 16, 2026

slyt3 force-pushed the fix/remote-server-unhealhty-on-errors branch from 49d6979 to e4f6532 Compare April 16, 2026 10:36

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 16, 2026

Merge branch 'main' into fix/remote-server-unhealhty-on-errors

f51eb7a

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 16, 2026

lint fix

cb4b63e

github-actions bot removed the size/M Medium PR: 300-599 lines changed label Apr 16, 2026

github-actions bot previously requested changes Apr 16, 2026

View reviewed changes

github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix remote MCP servers stuck in starting when upstream returns errors#4844

fix remote MCP servers stuck in starting when upstream returns errors#4844
slyt3 wants to merge 3 commits intostacklok:mainfrom
slyt3:fix/remote-server-unhealhty-on-errors

slyt3 commented Apr 15, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

amirejaz commented Apr 15, 2026 •

edited

Loading

Uh oh!

amirejaz left a comment

Uh oh!

slyt3 commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

slyt3 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

Test plan

Changes

Does this introduce a user-facing change?

Large PR Justification

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

amirejaz commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Regression risk + proposed approach

Root cause of the regression

Proposed fix: soft-unhealthy with auto-recovery for remote workloads

Resulting state machine for remote workloads

Why `TOOLHIVE_REMOTE_HEALTHCHECKS` stays removed

Uh oh!

amirejaz left a comment

Choose a reason for hiding this comment

Uh oh!

slyt3 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Large PR Detected

How to unblock this PR:

Alternative:

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

slyt3 commented Apr 15, 2026 •

edited

Loading

amirejaz commented Apr 15, 2026 •

edited

Loading

slyt3 commented Apr 16, 2026 •

edited

Loading

codecov bot commented Apr 16, 2026 •

edited

Loading