Token refresh treats transient 4xx as permanent auth failures


## Bug description

When the OAuth token refresh endpoint returns any 4xx response, `MonitoredTokenSource` immediately marks the remote MCP workload as `unauthenticated` and stops monitoring. Recovery requires manual re-authentication. No further refresh attempts are made.

This is too aggressive. Many real-world 4xx responses are transient infrastructure-level errors that resolve on their own:

- **WAF / firewall blocks**: a Cloudflare or similar WAF returns 403 with an HTML body when the refresh request comes from a non-allowlisted IP. A momentary VPN flap that routes the request through the user's home IP triggers a block; the next refresh would succeed once the tunnel reconnects.
- **Rate limiting**: some OAuth servers return 429 (or 403) for rate limiting. The condition clears after a short cooldown.
- **Transient bad-config deploys**: an OAuth server briefly returns 400 for a few minutes before rolling back. ToolHive marks the workload permanently dead for a blip.

The retry infrastructure already exists in `MonitoredTokenSource` (added in #4513 for 5xx errors). The root cause is that `isTransientNetworkError()` in `pkg/auth/monitored_token_source.go` treats every non-5xx `*oauth2.RetrieveError` as permanent, regardless of whether the response actually contained a structured RFC 6749 error code.

## Steps to reproduce

1. Run a remote MCP server with OAuth authentication: `thv run <remote-url> --remote-auth ...`.
2. Arrange for the next token refresh to hit a 4xx response without a parseable RFC 6749 `error` field. The simplest controlled reproduction is a mock token endpoint returning 403 with `Content-Type: text/html` and an HTML body. In practice this happens organically when a VPN drops momentarily and the request egresses through a WAF.
3. Inspect the workload status file (`~/Library/Application Support/toolhive/statuses/<server>.json` on macOS). Observe `status: unauthenticated` and `status_context` containing the raw 4xx body, with no preceding retry attempts in the logs.

The bug manifests during real-world transient outages and infrastructure events. The code path can be verified by reading `isTransientNetworkError()` and `Token()` in `pkg/auth/monitored_token_source.go`.

## Expected behavior

A 4xx response without a populated RFC 6749 `error` field (e.g. an HTML page from a WAF) should be classified as transient and enter the existing retry loop. Only 4xx responses where the OAuth server returned a structured error code (`invalid_grant`, `invalid_client`, etc.) should be treated as permanent — those are the cases where the protocol explicitly says the credentials are bad.

## Actual behavior

All 4xx responses are treated as permanent. The workload is immediately marked `unauthenticated` and the monitor stops.

Example status produced by a Cloudflare WAF block:

```json
{
  "status": "unauthenticated",
  "status_context": "Token retrieval failed: oauth2: cannot fetch token: 403 Forbidden\nResponse: <html>...Cloudflare Firewall Block...</html>"
}
```

Subsequent requests through the proxy fail until the workload is manually re-authenticated, even after the underlying issue has cleared.

## Environment (if relevant)

- ToolHive v0.26.1 (the affected code path is unchanged on `main` at the time of filing)

## Additional context

### Classification rule that fixes the misclassification

The `golang.org/x/oauth2` library populates `RetrieveError.ErrorCode` only when the response body is parseable JSON containing an RFC 6749 `error` field. An empty `ErrorCode` therefore signals an infrastructure-level response (HTML page from a WAF, CDN, or reverse proxy), not an OAuth protocol failure. This gives a clean structural signal for the classification:

- 5xx: transient (existing behavior, from #4513)
- 429: transient regardless of body (HTTP standard)
- 4xx with an empty `ErrorCode`: transient (infrastructure error)
- 4xx with a populated `ErrorCode`: permanent (OAuth told us specifically what's wrong)

### Scope

This issue is specifically about *misclassification* of transient infrastructure errors as permanent. A more architectural change — keeping the monitor alive across token expiry intervals so workloads recover from outages that span longer than a single refresh attempt — would require a new "transiently failing" workload state, status file representation, and monitor lifecycle changes. That's a separable conversation. The classification fix is independently valuable for the common short-blip cases.

### Relevant prior work

- #4512 / #4513 added retry for 5xx and HTML-on-200 token-endpoint responses, but kept all 4xx classified as permanent.
- #5044 (in flight) emits a DCR remediation hint on permanent 4xx. Sharpening the classification removes false triggers — the hint should fire when DCR credentials are actually stale, not when a WAF blocks the request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token refresh treats transient 4xx as permanent auth failures #5169

Bug description

Steps to reproduce

Expected behavior

Actual behavior

Environment (if relevant)

Additional context

Classification rule that fixes the misclassification

Scope

Relevant prior work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Token refresh treats transient 4xx as permanent auth failures #5169

Description

Bug description

Steps to reproduce

Expected behavior

Actual behavior

Environment (if relevant)

Additional context

Classification rule that fixes the misclassification

Scope

Relevant prior work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions