You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the OAuth token refresh endpoint returns any 4xx response, MonitoredTokenSource immediately marks the remote MCP workload as unauthenticated and stops monitoring. Recovery requires manual re-authentication. No further refresh attempts are made.
This is too aggressive. Many real-world 4xx responses are transient infrastructure-level errors that resolve on their own:
WAF / firewall blocks: a Cloudflare or similar WAF returns 403 with an HTML body when the refresh request comes from a non-allowlisted IP. A momentary VPN flap that routes the request through the user's home IP triggers a block; the next refresh would succeed once the tunnel reconnects.
Rate limiting: some OAuth servers return 429 (or 403) for rate limiting. The condition clears after a short cooldown.
Transient bad-config deploys: an OAuth server briefly returns 400 for a few minutes before rolling back. ToolHive marks the workload permanently dead for a blip.
The retry infrastructure already exists in MonitoredTokenSource (added in #4513 for 5xx errors). The root cause is that isTransientNetworkError() in pkg/auth/monitored_token_source.go treats every non-5xx *oauth2.RetrieveError as permanent, regardless of whether the response actually contained a structured RFC 6749 error code.
Steps to reproduce
Run a remote MCP server with OAuth authentication: thv run <remote-url> --remote-auth ....
Arrange for the next token refresh to hit a 4xx response without a parseable RFC 6749 error field. The simplest controlled reproduction is a mock token endpoint returning 403 with Content-Type: text/html and an HTML body. In practice this happens organically when a VPN drops momentarily and the request egresses through a WAF.
Inspect the workload status file (~/Library/Application Support/toolhive/statuses/<server>.json on macOS). Observe status: unauthenticated and status_context containing the raw 4xx body, with no preceding retry attempts in the logs.
The bug manifests during real-world transient outages and infrastructure events. The code path can be verified by reading isTransientNetworkError() and Token() in pkg/auth/monitored_token_source.go.
Expected behavior
A 4xx response without a populated RFC 6749 error field (e.g. an HTML page from a WAF) should be classified as transient and enter the existing retry loop. Only 4xx responses where the OAuth server returned a structured error code (invalid_grant, invalid_client, etc.) should be treated as permanent — those are the cases where the protocol explicitly says the credentials are bad.
Actual behavior
All 4xx responses are treated as permanent. The workload is immediately marked unauthenticated and the monitor stops.
Example status produced by a Cloudflare WAF block:
Subsequent requests through the proxy fail until the workload is manually re-authenticated, even after the underlying issue has cleared.
Environment (if relevant)
ToolHive v0.26.1 (the affected code path is unchanged on main at the time of filing)
Additional context
Classification rule that fixes the misclassification
The golang.org/x/oauth2 library populates RetrieveError.ErrorCode only when the response body is parseable JSON containing an RFC 6749 error field. An empty ErrorCode therefore signals an infrastructure-level response (HTML page from a WAF, CDN, or reverse proxy), not an OAuth protocol failure. This gives a clean structural signal for the classification:
4xx with an empty ErrorCode: transient (infrastructure error)
4xx with a populated ErrorCode: permanent (OAuth told us specifically what's wrong)
Scope
This issue is specifically about misclassification of transient infrastructure errors as permanent. A more architectural change — keeping the monitor alive across token expiry intervals so workloads recover from outages that span longer than a single refresh attempt — would require a new "transiently failing" workload state, status file representation, and monitor lifecycle changes. That's a separable conversation. The classification fix is independently valuable for the common short-blip cases.
Wire authserver DCR resolver and add structured logs #5044 (in flight) emits a DCR remediation hint on permanent 4xx. Sharpening the classification removes false triggers — the hint should fire when DCR credentials are actually stale, not when a WAF blocks the request.
Bug description
When the OAuth token refresh endpoint returns any 4xx response,
MonitoredTokenSourceimmediately marks the remote MCP workload asunauthenticatedand stops monitoring. Recovery requires manual re-authentication. No further refresh attempts are made.This is too aggressive. Many real-world 4xx responses are transient infrastructure-level errors that resolve on their own:
The retry infrastructure already exists in
MonitoredTokenSource(added in #4513 for 5xx errors). The root cause is thatisTransientNetworkError()inpkg/auth/monitored_token_source.gotreats every non-5xx*oauth2.RetrieveErroras permanent, regardless of whether the response actually contained a structured RFC 6749 error code.Steps to reproduce
thv run <remote-url> --remote-auth ....errorfield. The simplest controlled reproduction is a mock token endpoint returning 403 withContent-Type: text/htmland an HTML body. In practice this happens organically when a VPN drops momentarily and the request egresses through a WAF.~/Library/Application Support/toolhive/statuses/<server>.jsonon macOS). Observestatus: unauthenticatedandstatus_contextcontaining the raw 4xx body, with no preceding retry attempts in the logs.The bug manifests during real-world transient outages and infrastructure events. The code path can be verified by reading
isTransientNetworkError()andToken()inpkg/auth/monitored_token_source.go.Expected behavior
A 4xx response without a populated RFC 6749
errorfield (e.g. an HTML page from a WAF) should be classified as transient and enter the existing retry loop. Only 4xx responses where the OAuth server returned a structured error code (invalid_grant,invalid_client, etc.) should be treated as permanent — those are the cases where the protocol explicitly says the credentials are bad.Actual behavior
All 4xx responses are treated as permanent. The workload is immediately marked
unauthenticatedand the monitor stops.Example status produced by a Cloudflare WAF block:
{ "status": "unauthenticated", "status_context": "Token retrieval failed: oauth2: cannot fetch token: 403 Forbidden\nResponse: <html>...Cloudflare Firewall Block...</html>" }Subsequent requests through the proxy fail until the workload is manually re-authenticated, even after the underlying issue has cleared.
Environment (if relevant)
mainat the time of filing)Additional context
Classification rule that fixes the misclassification
The
golang.org/x/oauth2library populatesRetrieveError.ErrorCodeonly when the response body is parseable JSON containing an RFC 6749errorfield. An emptyErrorCodetherefore signals an infrastructure-level response (HTML page from a WAF, CDN, or reverse proxy), not an OAuth protocol failure. This gives a clean structural signal for the classification:ErrorCode: transient (infrastructure error)ErrorCode: permanent (OAuth told us specifically what's wrong)Scope
This issue is specifically about misclassification of transient infrastructure errors as permanent. A more architectural change — keeping the monitor alive across token expiry intervals so workloads recover from outages that span longer than a single refresh attempt — would require a new "transiently failing" workload state, status file representation, and monitor lifecycle changes. That's a separable conversation. The classification fix is independently valuable for the common short-blip cases.
Relevant prior work