fix remote MCP servers stuck in starting when upstream returns errors#4844
fix remote MCP servers stuck in starting when upstream returns errors#4844slyt3 wants to merge 3 commits intostacklok:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6181e8215d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
6181e82 to
dd528ec
Compare
Review: Regression risk + proposed approachThe fix for the stuck-in-`starting` bug is correct and the removal of `TOOLHIVE_REMOTE_HEALTHCHECKS` is the right call. However, the current implementation introduces a regression for remote MCP servers with a sleep/awake lifecycle (e.g. scale-to-zero serverless backends): after ~45s of downtime (3 failures × 10s interval + 5s retry), the proxy self-terminates. Root cause of the regressionWhen `handleHealthCheckFailure` exhausts retries, it calls both `onHealthCheckFailed()` (sets status → `unhealthy`) and `p.Stop()` (kills the listener, exits `monitorHealth`). For a temporarily unavailable remote server there is nothing left to observe recovery — the proxy is dead. Proposed fix: soft-unhealthy with auto-recovery for remote workloadsThe key insight: ToolHive does not control the lifecycle of a remote server the way it controls a container. Stopping the proxy because a remote server is temporarily unavailable is too aggressive. `handleHealthCheckFailure` — branch on `isRemote`: ```go `monitorHealth` — fire a recovery callback when health returns: ```go `runner.go` — wire both callbacks: ```go `runner.go` — extend the existing unauthenticated status guard to also preserve `unhealthy`, preventing `waitForInitializeSuccess` from racing with the health check loop: ```go Resulting state machine for remote workloads
Why `TOOLHIVE_REMOTE_HEALTHCHECKS` stays removedThe env var existed solely to prevent health checks from killing the proxy for temporarily unavailable servers. With auto-recovery, that reason is gone. Health checks are now safe to enable unconditionally for remote workloads: they catch broken servers fast and survive temporary downtime transparently. |
dd528ec to
49d6979
Compare
|
@amirejaz Yeah you right, stopping the proy was too aggresive, fixed it so remote proxies stay alive after fail and auto-recover when he serer come back, instead of dying and requiring a manual restrt, also wired recovery callback in runner.go and extended the atus guard to cover |
When remote server consistenlly returns HTTP 500, toolhive was leaving the workload in a starting state indefinitely with no feedback to user two bugs combined to cause this: 1 - health check were disabled for remote workloads by default (gated behind TOOLHIVE_REMOTE_HEALTHCHECKS=TRUE) 2 - Even when enanbled monitorHealth skipped checks until the server had been initialized but server returning 500 never triggers initialization so checks were skipped forever Removed the initialization guard so health check run unconditionally and always enable health checks for remote workloads, after threshold of consecutive failutres the workload transitions to unhealhty and the error surfaces in thv list output Also show unhealhty remote worklaods in thv list without -a, since the whole point is lettting the user know something si wrong
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4844 +/- ##
==========================================
- Coverage 69.14% 69.12% -0.03%
==========================================
Files 531 531
Lines 55183 55202 +19
==========================================
+ Hits 38156 38157 +1
- Misses 14106 14121 +15
- Partials 2921 2924 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
49d6979 to
e4f6532
Compare
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
When a remote server keeps returning HTTP 500, toolhive just leaves the workload stuck in
startingforever and never tells the user anything is wrong.Turns out there were two bugs causing this:
TOOLHIVE_REMOTE_HEALTHCHECKS=truemanually, which most people would never do.monitorHealthwas skipping checks until the server finishedinitializing — but a server returning 500 never initializes, so it just skips forever.
So the workload sits in
startingindefinitely with no way to know something is brokenunless you dig into debug logs.
What I changed:
unhealthyand shows the reason inthv listthv listwithout needing-aFixes #4459
Type of change
Test plan
task test)Added
TestTransparentProxy_RemoteServerFailure_HTTP500— runs a proxy against a serverreturning 500 without ever initializing it, confirms health check fires and proxy shuts down.
Passes.
Changes
pkg/transport/proxy/transparent/transparent_proxy.goserverInitialized()guard inmonitorHealthpkg/transport/http.goshouldEnableHealthCheckhelperpkg/workloads/manager.goStatusContext; show unhealthy without-acmd/thv/app/list.gothv liststatus columnpkg/transport/proxy/transparent/transparent_test.gopkg/transport/http_test.gotest/e2e/health_check_zombie_test.goTOOLHIVE_REMOTE_HEALTHCHECKS=truetest/e2e/stateless_proxy_test.goTOOLHIVE_REMOTE_HEALTHCHECKS=truedocs/arch/03-transport-architecture.mdDoes this introduce a user-facing change?
Yes. Remote servers that keep failing now show as
unhealthyinthv listwith the reason,instead of staying stuck in
startingwith nothing shown to the user.Large PR Justification
The line count is inflated by line-ending normalization in transparent_test.go. Running gci on Windows wrote LF endings to a file that previously had CRLF endings, causing git to report ~1800 changed lines when the actual code changes are fewer than 50 lines.
The real content changes in this PR are:
Proxy stays alive after health check failures on remote servers (instead of shutting down)
Auto-recovery: status resets to running when the upstream server comes back
TOOLHIVE_REMOTE_HEALTHCHECKS env var removed (health checks always enabled)
Tests updated to assert stay-alive and recovery behavior
These changes cannot be split further without breaking the transport interface, the mock, and the tests simultaneously