Skip to content

Carry forward upstream refresh token on re-authorization#5132

Merged
jhrozek merged 10 commits into
mainfrom
issue-5047-callback-preserve-rt
May 1, 2026
Merged

Carry forward upstream refresh token on re-authorization#5132
jhrozek merged 10 commits into
mainfrom
issue-5047-callback-preserve-rt

Conversation

@jhrozek
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek commented Apr 29, 2026

Summary

This PR addresses Root Cause B of #5047: when a user re-authorizes through the auth server, the OAuth callback writes the IdP-supplied RefreshToken verbatim into a new storage.UpstreamTokens row. Real IdPs commonly omit refresh_token on a silent re-auth (Google without prompt=consent, Microsoft Entra when consent was already granted, GitHub/Okta/Auth0/Keycloak similarly). In every case the previously-issued RT remains valid at the IdP — but today the callback overwrites it with "", orphaning the prior good RT and making the next refresh attempt fail. That forces another re-auth, perpetuating the loop.

The fix mirrors the preservation pattern already in place in upstreamTokenRefresher.RefreshAndStore: if the IdP returns no RT, look up the most recent prior row for (userID, providerID) and carry the prior RT into the new row. A defensive prior.UpstreamSubject == providerSubject guard prevents carrying credentials across distinct upstream identities. Storage failures during the lookup are non-fatal — they log a warning and proceed with the empty RT, which is no worse than the current status quo.

  • New storage method UpstreamTokenStorage.GetLatestUpstreamTokensForUser(ctx, userID, providerID) keyed on user+provider rather than the primary (sessionID, providerName) key. Implemented on both Memory and Redis backends; both backends pick the highest ExpiresAt so the carry-forward decision is consistent across deployment shapes. Expired rows are intentionally returned (callers want the RT regardless of access-token expiry).
  • Callback wiring in pkg/authserver/server/handlers/callback.go extracted into maybeCarryForwardRefreshToken (gocyclo budget) and called between storageTokens construction and StoreUpstreamTokens.
  • The Redis backend uses the existing KeyTypeUserUpstream:{userID} reverse-index set (already maintained atomically by storeUpstreamTokensScript), no new Lua script.
  • End-to-end integration test against mockoidc + a thin RT-stripping reverse proxy (mockoidc unconditionally issues RTs for authorization_code and offers no API to suppress them).

Out of scope, deferred to a follow-up PR: Root Cause A — the storage rekey from (sessionID, providerID) to (userID, providerID) plus a one-time Redis migration. Trey's comment on #5047 describes the split; this PR is the smaller, surgical fix and is valuable on its own because it ends the orphaning loop even before Root Cause A lands.

Closes #5047 partially (Root Cause B; Root Cause A tracked separately).

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Specifically verified:

  • go test -race -count=1 ./pkg/authserver/... — all 13 packages pass.
  • go test -race -count=1 -v -run TestIntegration_ ./pkg/authserver/ — all 23 integration tests pass, including the new TestIntegration_Callback_PreservesRefreshTokenOnReauth and the existing TestIntegration_FullFlow_NonExpiringUpstreamToken[_Redis] and TestIntegration_MultiUpstreamChain_MixedExpiryOrderings[_Redis] from Fix non-expiring upstream token handling and storage TTL bugs #5092.
  • task test — pre-existing failure in pkg/workloads/TestDefaultManager_ListWorkloadsInGroup/workloads_with_empty_group_names is unrelated to this branch.

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

The new method on storage.UpstreamTokenStorage is an internal Go interface, not a CRD field. No CRD schema changes.

Changes

File Change
pkg/authserver/storage/types.go Adds GetLatestUpstreamTokensForUser to the UpstreamTokenStorage interface with explicit godoc on liveness contract (expired rows are returned) and cross-identity caller responsibility.
pkg/authserver/storage/memory.go Implements the new method on MemoryStorage — RLock'd iteration, deep-copy on hit, wrapped ErrNotFound on miss.
pkg/authserver/storage/redis.go Implements the new method on RedisStorage using SMembers over the existing user-upstream reverse index, single MGet, deserialized-ProviderID filtering. Tolerates dangling SMEMBERs and the null deletion tombstone.
pkg/authserver/storage/mocks/mock_storage.go Regenerated.
pkg/authserver/server/handlers/callback.go New maybeCarryForwardRefreshToken helper invoked between struct construction and StoreUpstreamTokens. UpstreamSubject guard, non-fatal on storage error.
pkg/authserver/storage/memory_test.go Table-driven test for the Memory backend (9 subtests including deep-copy assertion, latest-by-ExpiresAt, empty-input validation).
pkg/authserver/storage/redis_test.go Table-driven test for the Redis backend (9 subtests including the dangling-SMEMBER stale-index case via miniredis SAdd, two-zero-ExpiresAt corner case).
pkg/authserver/server/handlers/callback_test.go Table-driven TestCallbackHandler_RefreshTokenCarryForward consolidating the 5 callback scenarios (preserve, subject mismatch, fresh wins, no prior row, storage error).
pkg/authserver/server/handlers/helpers_test.go New withGetLatestUpstreamTokensError option mirroring the existing withStorePendingError pattern; mock checks it first.
pkg/authserver/integration_test.go TestIntegration_Callback_PreservesRefreshTokenOnReauth — full callback flow against mockoidc with an RT-stripping reverse proxy.

Does this introduce a user-facing change?

Yes. Users who previously hit a refresh-token-orphaning loop on re-authorization (most commonly with Google, Entra, Okta, GitHub, Auth0, Keycloak — all of which omit refresh_token on silent re-auth) will see refreshes succeed instead of failing into another forced re-auth. The fix is silent: no configuration changes are required.

Implementation plan

This PR was planned with Claude Code via a per-step go-expert-developer + MoE (code-reviewer + go-security-reviewer / oauth-expert / redis-valkey-advisor) workflow. Each step landed as a separate commit with independent review.

Approved plan (5 steps)
  1. Add GetLatestUpstreamTokensForUser to storage interface + stub implementations + mock regen. Keeps the diff tiny so the interface surface is reviewed independently.
  2. Implement Memory backend + 9-case table-driven test. Filters by (UserID, ProviderID), picks highest ExpiresAt, returns deep copy.
  3. Implement Redis backend + 9-case table-driven test. Uses the existing KeyTypeUserUpstream reverse set; SMembers → MGet → unmarshal → filter; tolerates stale index entries and null tombstones.
  4. Wire the OAuth callback handler + 5-case table-driven test. Helper maybeCarryForwardRefreshToken between storageTokens construction and StoreUpstreamTokens. Non-fatal on storage error. UpstreamSubject guard.
  5. End-to-end integration test against mockoidc + RT-stripping reverse proxy. Three legs: initial RT issued → strip flag on, second auth carries forward → third auth as different user gets ErrNotFound and stays empty.

Cross-project research (via MCP-mounted source): Hydra delegates upstream-token storage to its consent app and avoids the bug class. oauth2-proxy has the same blind-overwrite shape but sidesteps it via opaque-ticket session keys (orphaned rows TTL out instead of being clobbered) — their IDToken preservation pattern at pkg/providers/oidc.go:140-180 is the explicit code-level analog we mirror for RefreshToken. infratographer/identity-api is RFC-8693 only, no broker.

Special notes for reviewers

  • This is Root Cause B only. Root Cause A — the full storage rekey from (sessionID, providerID) to (userID, providerID) with a one-time Redis migration — is intentionally deferred to a follow-up PR. See Trey's plan in Upstream refresh tokens orphaned on re-authorization #5047. This fix is valuable on its own: even with Root Cause A unresolved, carrying the RT forward to the new sessionID lets the refresher succeed when the new access token expires, ending the re-auth cycle.
  • The UpstreamSubject == providerSubject guard is defense-in-depth. Given the current identity model (ResolveUser keys identity by (providerID, providerSubject)), the same internal UserID cannot legitimately have two different upstream subjects on the same provider — same-provider account-linking is a future TODO in storage/types.go. The guard exists so that future code can't accidentally violate that.
  • The integration test does not request offline_access because mockoidc's ScopesSupported rejects it as invalid_scope. mockoidc unconditionally issues RTs regardless of scope, so the regression assertion (leg 2 RT matches leg 1 RT) is unaffected — but reviewers should know the test does not exercise the realistic precondition that real IdPs gate RT issuance on offline_access.

🤖 Generated with Claude Code

Large PR Justification

  • the new integration test is the culprit

@jhrozek jhrozek requested a review from tgrunnagle April 29, 2026 15:39
@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
@github-actions github-actions Bot dismissed their stale review April 29, 2026 15:42

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 82.44275% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.40%. Comparing base (b094f75) to head (f564b0f).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
pkg/authserver/storage/redis.go 70.83% 14 Missing and 7 partials ⚠️
pkg/authserver/storage/memory.go 95.23% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5132      +/-   ##
==========================================
+ Coverage   67.31%   67.40%   +0.08%     
==========================================
  Files         598      598              
  Lines       60566    60656      +90     
==========================================
+ Hits        40768    40883     +115     
+ Misses      16706    16673      -33     
- Partials     3092     3100       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/authserver/storage/redis.go Outdated
Comment thread pkg/authserver/storage/redis.go Outdated
Copy link
Copy Markdown
Contributor

@tgrunnagle tgrunnagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review (MEDIUM findings only)

Agents consulted: oauth-flow, security, redis-storage, test-coverage, go-style, general-quality (6 specialists, parallel review)

A full consensus review surfaced 18 actionable findings. Posting only the MEDIUM-severity items here; the LOW findings are polish and didn't feel worth the inline noise on a focused fix PR.

MEDIUM Findings

# Finding Consensus Action
F1 SessionExpiresAt dropped from returned record (Memory + Redis) 10/10 Fix — see inline suggestions
F7 Missing handler test — prior row with empty RefreshToken 9/10 Add table case

Overall

The carry-forward design is sound and the implementation is correct for the documented use case. Mirroring RefreshAndStore's preservation pattern at the cross-row level (with the UpstreamSubject defense-in-depth guard) is a sensible surgical fix for Root Cause B. Scope is one logical change, the integration-test harness (rtStrippingProxy) is genuine novel infrastructure, and the explicit deferral of Root Cause A is appropriate.

The latent bug worth fixing in this PR is F1: both backends silently omit SessionExpiresAt from the returned struct in GetLatestUpstreamTokensForUser. No production caller reads it today, but it's a contract-shaped trap for any future caller that uses this method as a substitute for GetUpstreamTokens — exactly the kind of expansion the deferred Root Cause A rekey will encourage. One-line fix per backend; suggestion blocks attached.

F7 is a test gap: the production guard prior.RefreshToken != "" (callback.go:187) is currently untested. The case matters in real chains because the carry-forward chain itself can leave rows with empty RTs, so the guard is load-bearing.


Generated with Claude Code

Comment thread pkg/authserver/storage/memory.go Outdated
Comment thread pkg/authserver/storage/redis.go Outdated
Comment thread pkg/authserver/server/handlers/callback_test.go
jhrozek added 10 commits April 30, 2026 09:16
Introduces a secondary upstream-token lookup keyed on (userID, providerID)
on the UpstreamTokenStorage interface. The OAuth callback handler will use
this in a follow-on commit to preserve a previously-issued refresh token
when the upstream IdP omits refresh_token on re-authorization (e.g. Google
without prompt=consent), which today silently clobbers the prior good RT
and forces a refresh-token-orphaning loop.

This commit only adds the interface surface plus stub implementations on
both backends that return ErrNotFound. The real Memory and Redis logic, the
callback wiring, and tests land in subsequent commits so each diff stays
small and independently reviewable. The godoc explicitly calls out the
liveness contract (expired rows are returned, callers handle it) and the
cross-identity caller responsibility (verify UpstreamSubject before reusing
the prior RT) so future callers cannot misuse the lookup.

Refs #5047
Replaces the Step 1 stub on MemoryStorage with the real lookup. Iterates
the upstream-tokens map under RLock, filters by (UserID, ProviderID), and
picks the row with the highest ExpiresAt. Expired access tokens are
intentionally returned (matching the interface contract) so the OAuth
callback can carry forward a still-valid refresh token even when the
access token is dead.

Returns a defensive deep copy on hit and the same wrapped ErrNotFound
shape that GetUpstreamTokens uses on miss, so callers cannot accidentally
mutate stored state through the returned pointer.

Adds a 7-case table-driven test covering: no match, one match (with a
deep-copy mutation assertion), latest-by-ExpiresAt across three rows,
different-user/different-provider misses, expired-row tolerance, and
zero-ExpiresAt losing to non-zero.

Refs #5047
Replaces the Step 1 stub on RedisStorage with the real lookup, using the
existing KeyTypeUserUpstream:{userID} reverse-index set already maintained
atomically by storeUpstreamTokensScript. The read path is plain Go (no Lua
script) since atomicity buys nothing on a read-only path: the worst case
is reading a slightly-stale row, which is acceptable for the carry-forward
use case.

Mirrors the SMembers -> MGet -> unmarshal pattern from GetAllUpstreamTokens
and tolerates dangling SMEMBERs (per-provider key TTL-evicted, set member
lingers), nullMarker deletion tombstones, and corrupt JSON. Filters on
the deserialized ProviderID rather than parsing the key string so the
implementation survives any future key-format change. Picks the winner by
the stored unix-timestamp ExpiresAt; zero (the no-expiry sentinel) loses
to any real timestamp.

Adds input validation for empty userID/providerID matching the rest of
the type's public API. The Memory backend has the same gap and should be
fixed as a follow-up; it's left untouched here to keep this commit
scoped to the Redis backend.

Adds a 9-case table-driven unit test covering all the cases from the
Memory backend plus a stale-index-entry case (dangling SMEMBER injected
via miniredis SAdd) and a two-zero-ExpiresAt case (verifies non-strict
ordering without locking down iteration order).

Refs #5047
The OAuth callback handler now preserves a previously-issued upstream
refresh token when the IdP omits refresh_token on re-authorization. This
is common across major IdPs: Google omits unless prompt=consent and
access_type=offline are forced; Microsoft Entra omits when consent was
already granted; GitHub, Okta, Auth0, and Keycloak behave similarly. In
every case the prior RT remains valid at the IdP — but today the
callback writes the empty value verbatim, clobbering the stored RT and
forcing the next refresh attempt to fail, which puts the user in a
re-auth-orphans-RT loop.

The carry-forward logic mirrors the existing pattern in
upstreamTokenRefresher.RefreshAndStore. It uses the new storage method
GetLatestUpstreamTokensForUser to find the most recent prior row for
(user, provider) and copies the prior RefreshToken into the new row.
The prior row's UpstreamSubject must equal the current providerSubject
as defense-in-depth; this guard is dormant given the current identity
model (ResolveUser keys on (providerID, providerSubject) and same-
provider account-linking is not implemented) but protects future flows.

Storage failures during the lookup log a Warn and continue with the
empty RT — failing the callback would be a worse user experience than
the (already broken) status quo, and the next refresh attempt will
produce a clean error path the user can retry from.

The implementation lives in a private helper to keep CallbackHandler
under the gocyclo threshold; the call site is between storageTokens
construction and StoreUpstreamTokens so the in-place mutation is
visible at the point of use.

Adds 5 handler tests covering: positive carry-forward, UpstreamSubject
mismatch (no carry), fresh-RT-from-IdP wins (carry-forward
short-circuited), no-prior-row (status quo for first auth), and
storage-error-non-fatal (callback completes with empty RT instead of
returning 500).

Refs #5047
Drives the full authorize -> callback flow against mockoidc and a real
MemoryStorage so the carry-forward behavior is validated end-to-end, not
just at the gomock-handler boundary.

mockoidc unconditionally issues refresh_token on the authorization_code
grant with no API to suppress the field, so the test wires a small
reverse-proxy (rtStrippingProxy) between the auth server and mockoidc.
When the test toggles the strip flag (atomic.Bool — required, otherwise
the race detector flags the read on httptest's request goroutines), the
proxy unmarshals mockoidc's token response, deletes the refresh_token
key, and re-marshals before forwarding. The key is then absent from the
wire JSON exactly as Google, Entra, GitHub, Okta and Auth0 behave on a
silent re-authorization, which is what the underlying bug actually
reproduces in production.

Three legs:
  1. Initial authorize -> callback with RT preserved verbatim. Captures
     the RT for later comparison.
  2. Strip flag flipped on, second authorize -> callback for the same
     user. Asserts the new (different sessionID) row carries the leg-1
     RT forward.
  3. Different upstream subject -> ResolveUser mints a new internal
     user; the carry-forward lookup returns ErrNotFound and the new row
     stays empty. Confirms the carry-forward fires only when warranted.

The UpstreamSubject == providerSubject guard inside the callback is
covered by the handler-level unit test added in the previous commit;
exercising it end-to-end here would require either a contrived storage
hack or same-provider account-linking (not implemented today).

Note on offline_access: real IdPs require this scope before issuing a
refresh token, but mockoidc's ScopesSupported allowlist rejects it as
invalid_scope. The test does not request the scope as a result. This is
a fixture limitation; the carry-forward regression assertion is
unaffected because mockoidc returns the RT regardless.

Refs #5047
Both backend implementations of GetLatestUpstreamTokensForUser dropped
SessionExpiresAt from the returned struct, while the sibling readers
GetUpstreamTokens (memory) and unmarshalUpstreamTokens (redis) populate
it. The interface return type makes no "partial copy" caveat, so any
future caller doing Get-then-Store would lose the storage TTL fallback
bound that SessionExpiresAt provides on rows whose upstream IdP did not
issue an expires_in.

Today's only caller (maybeCarryForwardRefreshToken) reads only
RefreshToken and UpstreamSubject, so this is latent. Fixing now to
close the contract trap before the deferred Root Cause A rekey
introduces a caller that does treat the result as round-trippable.
The storage package had four sites doing field-by-field construction of
UpstreamTokens — three in the memory backend (defensive deep copies) and
two in the redis backend (storedUpstreamTokens int64→time.Time
conversion). Each site is a place where a future field addition can be
silently missed; the SessionExpiresAt drop in the previous commit is an
example of that class of bug.

Extract a free function cloneUpstreamTokens for the memory backend (used
at 3 sites) and a (*storedUpstreamTokens).toUpstreamTokens method for
the redis backend (used at 2 sites). No behavior change.
The existing tests for GetLatestUpstreamTokensForUser asserted only on
RefreshToken — the previous SessionExpiresAt drop slipped through
because no test exercised whole-struct round-tripping. Add a focused
subtest in each backend that builds a fully-populated UpstreamTokens
fixture, stores it, retrieves it, and asserts whole-struct equality.

Catches the SessionExpiresAt regression and any future field that fails
to round-trip through the storage layer.
The prior.RefreshToken != "" guard in maybeCarryForwardRefreshToken was
untested: every existing case either carried (non-empty prior RT and
matching subject), failed earlier (subject mismatch / fresh IdP RT /
nil prior / storage error), or never reached the guard. The state is
reachable in production — e.g. Google can omit refresh_token even on
first auth without prompt=consent, leaving a leg-1 row with empty RT
that a leg-2 lookup will return.

Add a case where the prior row exists with matching subject but empty
RefreshToken, and assert no carry-forward happens.

Strengthen the shared assertion block with sanity checks on
AccessToken, IDToken, UpstreamSubject and a non-zero ExpiresAt across
all cases. Without them, a regression that early-returns from the
callback before StoreUpstreamTokens would still pass an assertion that
only checks RefreshToken.
The previous tie-breaker used "highest ExpiresAt wins" with zero (the
no-expiry sentinel) treated as the smallest int64 — so non-expiring
rows always lost to any finite expiry. That was inconsistent with the
rest of the package, which treats zero ExpiresAt as "alive forever":
IsExpired returns false for zero, cleanupExpired skips zero entries,
and marshalUpstreamTokensWithTTL only writes a Redis TTL when ExpiresAt
is non-zero. PR #5092 (a98c991) reshaped the whole storage stack on
that premise — Slack user tokens and GitHub OAuth App tokens
deliberately omit expires_in because they are genuinely non-expiring.

Flip the rule to align with the rest of the package: a non-expiring row
beats any finite expiry; among finite expiries, the latest still wins.
Both backends use a small predicate helper (candidateBeatsWinner for
time.Time, candidateBeatsWinnerInt64 for the Redis epoch) to keep the
comparison readable.

The carry-forward caller is largely unaffected today because Slack /
GitHub OAuth App rows store an empty RefreshToken (non-expiring access
tokens have no RT) and the prior.RefreshToken != "" guard rejects them
either way. The flip matters for future callers that read the access
token from the returned row, and for the rare provider that emits both
zero ExpiresAt and a non-empty refresh token.

Update the interface godoc and rename the existing
zero_expires_at_loses_to_nonzero subtest to zero_expires_at_wins_over_nonzero
in both backends. Add a memory-side two_zero_expires_at_rows case to
match the existing Redis coverage.
@jhrozek jhrozek force-pushed the issue-5047-callback-preserve-rt branch from 51d6927 to f564b0f Compare April 30, 2026 12:27
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 30, 2026
@jhrozek
Copy link
Copy Markdown
Contributor Author

jhrozek commented Apr 30, 2026

Thanks for the review. Both MEDIUM findings and both nits addressed — replied on each thread with specifics.

@jhrozek jhrozek merged commit 31e0307 into main May 1, 2026
65 of 66 checks passed
@jhrozek jhrozek deleted the issue-5047-callback-preserve-rt branch May 1, 2026 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upstream refresh tokens orphaned on re-authorization

2 participants