Skip to content

Retry OAuth token refresh on infrastructure 4xx#5170

Open
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/retry-4xx-token-refresh-5169
Open

Retry OAuth token refresh on infrastructure 4xx#5170
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/retry-4xx-token-refresh-5169

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented May 3, 2026

Summary

  • The OAuth token refresh endpoint returns 4xx for infrastructure events (WAF/firewall blocks, rate limits, transient bad-config deploys) as well as for OAuth protocol failures. Today every 4xx is treated as a permanent auth failure, killing the workload until manual re-authentication. A momentary VPN flap that routes a refresh request through a non-allowlisted IP is enough to leave the workload dead.
  • Treat 4xx responses that lack a structured RFC 6749 error code as transient, so they enter the existing retry loop added in Retry OAuth token refresh on server errors #4513. Only 4xx with a populated error code (invalid_grant, invalid_client, etc.) remains permanent. 429 is always transient per HTTP standard.

Fixes #5169

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Added TestIsTransientNetworkError_AgainstRealOAuth2Library, which exercises the classification function end-to-end through golang.org/x/oauth2.Config.TokenSource against an httptest.Server returning each response shape (HTML 403, HTML 401, JSON invalid_grant, JSON invalid_client, 429, 503). This pins the assumption that RetrieveError.ErrorCode is populated only for parseable RFC 6749 error responses, so a future oauth2 upgrade that changes that behavior would surface here, not in production.

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Does this introduce a user-facing change?

Remote MCP servers with OAuth authentication now survive transient 4xx token-endpoint events (WAF blocks, rate limits, brief upstream config errors) instead of permanently going dark on the first occurrence.

Special notes for reviewers

Classification rule

golang.org/x/oauth2 populates RetrieveError.ErrorCode only when the response body is parseable JSON containing an RFC 6749 error field. An empty ErrorCode therefore signals an infrastructure-level response (HTML page from a WAF, CDN, or reverse proxy), not an OAuth protocol failure. The new classifyOAuthRetrieveError helper applies this rule:

  • 5xx: transient (existing behavior, from Retry OAuth token refresh on server errors #4513)
  • 429: transient regardless of body (HTTP standard)
  • 4xx with empty ErrorCode: transient (infrastructure error)
  • 4xx with populated ErrorCode: permanent (OAuth protocol failure)

Scope

Classification only — no changes to retry parameters, monitor lifecycle, or workload state machine. Outages that span longer than a single refresh attempt (e.g. a multi-hour VPN outage that crosses a token expiry boundary) still result in the workload being marked unauthenticated; that's a separable architectural change requiring a new "transiently failing" state and is not part of this fix.

Interaction with #5044

PR #5044 (in flight) emits a DCR remediation hint when classification returns permanent 4xx. This change reduces false triggers of that hint: the hint will fire for invalid_grant/invalid_client (where DCR creds genuinely may be stale) but no longer for WAF blocks (where the hint would have been misleading). Pure mechanical merge conflict on the same file, no semantic conflict.

Generated with Claude Code

The OAuth token refresh endpoint returns 4xx for infrastructure
events (WAF/firewall blocks, rate limits, transient bad-config
deploys) as well as for OAuth protocol failures. Today every 4xx
is treated as a permanent auth failure, killing the workload
until manual re-authentication. A momentary VPN flap that routes
a refresh request through a non-allowlisted IP is enough to
leave the workload dead.

Treat 4xx responses that lack a structured RFC 6749 error code
as transient; only 4xx with a populated error code
(invalid_grant, invalid_client, etc.) remains permanent. 429 is
always transient per HTTP standard.

Fixes stacklok#5169

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label May 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.53%. Comparing base (8c90184) to head (9de8e4f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5170      +/-   ##
==========================================
- Coverage   67.53%   67.53%   -0.01%     
==========================================
  Files         601      601              
  Lines       61093    61109      +16     
==========================================
+ Hits        41262    41267       +5     
- Misses      16714    16725      +11     
  Partials     3117     3117              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Token refresh treats transient 4xx as permanent auth failures

1 participant