Carry forward upstream refresh token on re-authorization#5132
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5132 +/- ##
==========================================
+ Coverage 67.31% 67.40% +0.08%
==========================================
Files 598 598
Lines 60566 60656 +90
==========================================
+ Hits 40768 40883 +115
+ Misses 16706 16673 -33
- Partials 3092 3100 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tgrunnagle
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review (MEDIUM findings only)
Agents consulted: oauth-flow, security, redis-storage, test-coverage, go-style, general-quality (6 specialists, parallel review)
A full consensus review surfaced 18 actionable findings. Posting only the MEDIUM-severity items here; the LOW findings are polish and didn't feel worth the inline noise on a focused fix PR.
MEDIUM Findings
| # | Finding | Consensus | Action |
|---|---|---|---|
| F1 | SessionExpiresAt dropped from returned record (Memory + Redis) |
10/10 | Fix — see inline suggestions |
| F7 | Missing handler test — prior row with empty RefreshToken |
9/10 | Add table case |
Overall
The carry-forward design is sound and the implementation is correct for the documented use case. Mirroring RefreshAndStore's preservation pattern at the cross-row level (with the UpstreamSubject defense-in-depth guard) is a sensible surgical fix for Root Cause B. Scope is one logical change, the integration-test harness (rtStrippingProxy) is genuine novel infrastructure, and the explicit deferral of Root Cause A is appropriate.
The latent bug worth fixing in this PR is F1: both backends silently omit SessionExpiresAt from the returned struct in GetLatestUpstreamTokensForUser. No production caller reads it today, but it's a contract-shaped trap for any future caller that uses this method as a substitute for GetUpstreamTokens — exactly the kind of expansion the deferred Root Cause A rekey will encourage. One-line fix per backend; suggestion blocks attached.
F7 is a test gap: the production guard prior.RefreshToken != "" (callback.go:187) is currently untested. The case matters in real chains because the carry-forward chain itself can leave rows with empty RTs, so the guard is load-bearing.
Generated with Claude Code
Introduces a secondary upstream-token lookup keyed on (userID, providerID) on the UpstreamTokenStorage interface. The OAuth callback handler will use this in a follow-on commit to preserve a previously-issued refresh token when the upstream IdP omits refresh_token on re-authorization (e.g. Google without prompt=consent), which today silently clobbers the prior good RT and forces a refresh-token-orphaning loop. This commit only adds the interface surface plus stub implementations on both backends that return ErrNotFound. The real Memory and Redis logic, the callback wiring, and tests land in subsequent commits so each diff stays small and independently reviewable. The godoc explicitly calls out the liveness contract (expired rows are returned, callers handle it) and the cross-identity caller responsibility (verify UpstreamSubject before reusing the prior RT) so future callers cannot misuse the lookup. Refs #5047
Replaces the Step 1 stub on MemoryStorage with the real lookup. Iterates the upstream-tokens map under RLock, filters by (UserID, ProviderID), and picks the row with the highest ExpiresAt. Expired access tokens are intentionally returned (matching the interface contract) so the OAuth callback can carry forward a still-valid refresh token even when the access token is dead. Returns a defensive deep copy on hit and the same wrapped ErrNotFound shape that GetUpstreamTokens uses on miss, so callers cannot accidentally mutate stored state through the returned pointer. Adds a 7-case table-driven test covering: no match, one match (with a deep-copy mutation assertion), latest-by-ExpiresAt across three rows, different-user/different-provider misses, expired-row tolerance, and zero-ExpiresAt losing to non-zero. Refs #5047
Replaces the Step 1 stub on RedisStorage with the real lookup, using the
existing KeyTypeUserUpstream:{userID} reverse-index set already maintained
atomically by storeUpstreamTokensScript. The read path is plain Go (no Lua
script) since atomicity buys nothing on a read-only path: the worst case
is reading a slightly-stale row, which is acceptable for the carry-forward
use case.
Mirrors the SMembers -> MGet -> unmarshal pattern from GetAllUpstreamTokens
and tolerates dangling SMEMBERs (per-provider key TTL-evicted, set member
lingers), nullMarker deletion tombstones, and corrupt JSON. Filters on
the deserialized ProviderID rather than parsing the key string so the
implementation survives any future key-format change. Picks the winner by
the stored unix-timestamp ExpiresAt; zero (the no-expiry sentinel) loses
to any real timestamp.
Adds input validation for empty userID/providerID matching the rest of
the type's public API. The Memory backend has the same gap and should be
fixed as a follow-up; it's left untouched here to keep this commit
scoped to the Redis backend.
Adds a 9-case table-driven unit test covering all the cases from the
Memory backend plus a stale-index-entry case (dangling SMEMBER injected
via miniredis SAdd) and a two-zero-ExpiresAt case (verifies non-strict
ordering without locking down iteration order).
Refs #5047
The OAuth callback handler now preserves a previously-issued upstream refresh token when the IdP omits refresh_token on re-authorization. This is common across major IdPs: Google omits unless prompt=consent and access_type=offline are forced; Microsoft Entra omits when consent was already granted; GitHub, Okta, Auth0, and Keycloak behave similarly. In every case the prior RT remains valid at the IdP — but today the callback writes the empty value verbatim, clobbering the stored RT and forcing the next refresh attempt to fail, which puts the user in a re-auth-orphans-RT loop. The carry-forward logic mirrors the existing pattern in upstreamTokenRefresher.RefreshAndStore. It uses the new storage method GetLatestUpstreamTokensForUser to find the most recent prior row for (user, provider) and copies the prior RefreshToken into the new row. The prior row's UpstreamSubject must equal the current providerSubject as defense-in-depth; this guard is dormant given the current identity model (ResolveUser keys on (providerID, providerSubject) and same- provider account-linking is not implemented) but protects future flows. Storage failures during the lookup log a Warn and continue with the empty RT — failing the callback would be a worse user experience than the (already broken) status quo, and the next refresh attempt will produce a clean error path the user can retry from. The implementation lives in a private helper to keep CallbackHandler under the gocyclo threshold; the call site is between storageTokens construction and StoreUpstreamTokens so the in-place mutation is visible at the point of use. Adds 5 handler tests covering: positive carry-forward, UpstreamSubject mismatch (no carry), fresh-RT-from-IdP wins (carry-forward short-circuited), no-prior-row (status quo for first auth), and storage-error-non-fatal (callback completes with empty RT instead of returning 500). Refs #5047
Drives the full authorize -> callback flow against mockoidc and a real
MemoryStorage so the carry-forward behavior is validated end-to-end, not
just at the gomock-handler boundary.
mockoidc unconditionally issues refresh_token on the authorization_code
grant with no API to suppress the field, so the test wires a small
reverse-proxy (rtStrippingProxy) between the auth server and mockoidc.
When the test toggles the strip flag (atomic.Bool — required, otherwise
the race detector flags the read on httptest's request goroutines), the
proxy unmarshals mockoidc's token response, deletes the refresh_token
key, and re-marshals before forwarding. The key is then absent from the
wire JSON exactly as Google, Entra, GitHub, Okta and Auth0 behave on a
silent re-authorization, which is what the underlying bug actually
reproduces in production.
Three legs:
1. Initial authorize -> callback with RT preserved verbatim. Captures
the RT for later comparison.
2. Strip flag flipped on, second authorize -> callback for the same
user. Asserts the new (different sessionID) row carries the leg-1
RT forward.
3. Different upstream subject -> ResolveUser mints a new internal
user; the carry-forward lookup returns ErrNotFound and the new row
stays empty. Confirms the carry-forward fires only when warranted.
The UpstreamSubject == providerSubject guard inside the callback is
covered by the handler-level unit test added in the previous commit;
exercising it end-to-end here would require either a contrived storage
hack or same-provider account-linking (not implemented today).
Note on offline_access: real IdPs require this scope before issuing a
refresh token, but mockoidc's ScopesSupported allowlist rejects it as
invalid_scope. The test does not request the scope as a result. This is
a fixture limitation; the carry-forward regression assertion is
unaffected because mockoidc returns the RT regardless.
Refs #5047
Both backend implementations of GetLatestUpstreamTokensForUser dropped SessionExpiresAt from the returned struct, while the sibling readers GetUpstreamTokens (memory) and unmarshalUpstreamTokens (redis) populate it. The interface return type makes no "partial copy" caveat, so any future caller doing Get-then-Store would lose the storage TTL fallback bound that SessionExpiresAt provides on rows whose upstream IdP did not issue an expires_in. Today's only caller (maybeCarryForwardRefreshToken) reads only RefreshToken and UpstreamSubject, so this is latent. Fixing now to close the contract trap before the deferred Root Cause A rekey introduces a caller that does treat the result as round-trippable.
The storage package had four sites doing field-by-field construction of UpstreamTokens — three in the memory backend (defensive deep copies) and two in the redis backend (storedUpstreamTokens int64→time.Time conversion). Each site is a place where a future field addition can be silently missed; the SessionExpiresAt drop in the previous commit is an example of that class of bug. Extract a free function cloneUpstreamTokens for the memory backend (used at 3 sites) and a (*storedUpstreamTokens).toUpstreamTokens method for the redis backend (used at 2 sites). No behavior change.
The existing tests for GetLatestUpstreamTokensForUser asserted only on RefreshToken — the previous SessionExpiresAt drop slipped through because no test exercised whole-struct round-tripping. Add a focused subtest in each backend that builds a fully-populated UpstreamTokens fixture, stores it, retrieves it, and asserts whole-struct equality. Catches the SessionExpiresAt regression and any future field that fails to round-trip through the storage layer.
The prior.RefreshToken != "" guard in maybeCarryForwardRefreshToken was untested: every existing case either carried (non-empty prior RT and matching subject), failed earlier (subject mismatch / fresh IdP RT / nil prior / storage error), or never reached the guard. The state is reachable in production — e.g. Google can omit refresh_token even on first auth without prompt=consent, leaving a leg-1 row with empty RT that a leg-2 lookup will return. Add a case where the prior row exists with matching subject but empty RefreshToken, and assert no carry-forward happens. Strengthen the shared assertion block with sanity checks on AccessToken, IDToken, UpstreamSubject and a non-zero ExpiresAt across all cases. Without them, a regression that early-returns from the callback before StoreUpstreamTokens would still pass an assertion that only checks RefreshToken.
The previous tie-breaker used "highest ExpiresAt wins" with zero (the no-expiry sentinel) treated as the smallest int64 — so non-expiring rows always lost to any finite expiry. That was inconsistent with the rest of the package, which treats zero ExpiresAt as "alive forever": IsExpired returns false for zero, cleanupExpired skips zero entries, and marshalUpstreamTokensWithTTL only writes a Redis TTL when ExpiresAt is non-zero. PR #5092 (a98c991) reshaped the whole storage stack on that premise — Slack user tokens and GitHub OAuth App tokens deliberately omit expires_in because they are genuinely non-expiring. Flip the rule to align with the rest of the package: a non-expiring row beats any finite expiry; among finite expiries, the latest still wins. Both backends use a small predicate helper (candidateBeatsWinner for time.Time, candidateBeatsWinnerInt64 for the Redis epoch) to keep the comparison readable. The carry-forward caller is largely unaffected today because Slack / GitHub OAuth App rows store an empty RefreshToken (non-expiring access tokens have no RT) and the prior.RefreshToken != "" guard rejects them either way. The flip matters for future callers that read the access token from the returned row, and for the rare provider that emits both zero ExpiresAt and a non-empty refresh token. Update the interface godoc and rename the existing zero_expires_at_loses_to_nonzero subtest to zero_expires_at_wins_over_nonzero in both backends. Add a memory-side two_zero_expires_at_rows case to match the existing Redis coverage.
51d6927 to
f564b0f
Compare
|
Thanks for the review. Both MEDIUM findings and both nits addressed — replied on each thread with specifics. |
Summary
This PR addresses Root Cause B of #5047: when a user re-authorizes through the auth server, the OAuth callback writes the IdP-supplied
RefreshTokenverbatim into a newstorage.UpstreamTokensrow. Real IdPs commonly omitrefresh_tokenon a silent re-auth (Google withoutprompt=consent, Microsoft Entra when consent was already granted, GitHub/Okta/Auth0/Keycloak similarly). In every case the previously-issued RT remains valid at the IdP — but today the callback overwrites it with"", orphaning the prior good RT and making the next refresh attempt fail. That forces another re-auth, perpetuating the loop.The fix mirrors the preservation pattern already in place in
upstreamTokenRefresher.RefreshAndStore: if the IdP returns no RT, look up the most recent prior row for(userID, providerID)and carry the prior RT into the new row. A defensiveprior.UpstreamSubject == providerSubjectguard prevents carrying credentials across distinct upstream identities. Storage failures during the lookup are non-fatal — they log a warning and proceed with the empty RT, which is no worse than the current status quo.UpstreamTokenStorage.GetLatestUpstreamTokensForUser(ctx, userID, providerID)keyed on user+provider rather than the primary(sessionID, providerName)key. Implemented on both Memory and Redis backends; both backends pick the highestExpiresAtso the carry-forward decision is consistent across deployment shapes. Expired rows are intentionally returned (callers want the RT regardless of access-token expiry).pkg/authserver/server/handlers/callback.goextracted intomaybeCarryForwardRefreshToken(gocyclo budget) and called betweenstorageTokensconstruction andStoreUpstreamTokens.KeyTypeUserUpstream:{userID}reverse-index set (already maintained atomically bystoreUpstreamTokensScript), no new Lua script.mockoidc+ a thin RT-stripping reverse proxy (mockoidc unconditionally issues RTs forauthorization_codeand offers no API to suppress them).Out of scope, deferred to a follow-up PR: Root Cause A — the storage rekey from
(sessionID, providerID)to(userID, providerID)plus a one-time Redis migration. Trey's comment on #5047 describes the split; this PR is the smaller, surgical fix and is valuable on its own because it ends the orphaning loop even before Root Cause A lands.Closes #5047 partially (Root Cause B; Root Cause A tracked separately).
Type of change
Test plan
task test)task test-e2e)task lint-fix)Specifically verified:
go test -race -count=1 ./pkg/authserver/...— all 13 packages pass.go test -race -count=1 -v -run TestIntegration_ ./pkg/authserver/— all 23 integration tests pass, including the newTestIntegration_Callback_PreservesRefreshTokenOnReauthand the existingTestIntegration_FullFlow_NonExpiringUpstreamToken[_Redis]andTestIntegration_MultiUpstreamChain_MixedExpiryOrderings[_Redis]from Fix non-expiring upstream token handling and storage TTL bugs #5092.task test— pre-existing failure inpkg/workloads/TestDefaultManager_ListWorkloadsInGroup/workloads_with_empty_group_namesis unrelated to this branch.API Compatibility
v1beta1API, OR theapi-break-allowedlabel is applied and the migration guidance is described above.The new method on
storage.UpstreamTokenStorageis an internal Go interface, not a CRD field. No CRD schema changes.Changes
pkg/authserver/storage/types.goGetLatestUpstreamTokensForUserto theUpstreamTokenStorageinterface with explicit godoc on liveness contract (expired rows are returned) and cross-identity caller responsibility.pkg/authserver/storage/memory.goMemoryStorage— RLock'd iteration, deep-copy on hit, wrappedErrNotFoundon miss.pkg/authserver/storage/redis.goRedisStorageusingSMembersover the existing user-upstream reverse index, singleMGet, deserialized-ProviderIDfiltering. Tolerates dangling SMEMBERs and thenulldeletion tombstone.pkg/authserver/storage/mocks/mock_storage.gopkg/authserver/server/handlers/callback.gomaybeCarryForwardRefreshTokenhelper invoked between struct construction andStoreUpstreamTokens.UpstreamSubjectguard, non-fatal on storage error.pkg/authserver/storage/memory_test.goExpiresAt, empty-input validation).pkg/authserver/storage/redis_test.goSAdd, two-zero-ExpiresAtcorner case).pkg/authserver/server/handlers/callback_test.goTestCallbackHandler_RefreshTokenCarryForwardconsolidating the 5 callback scenarios (preserve, subject mismatch, fresh wins, no prior row, storage error).pkg/authserver/server/handlers/helpers_test.gowithGetLatestUpstreamTokensErroroption mirroring the existingwithStorePendingErrorpattern; mock checks it first.pkg/authserver/integration_test.goTestIntegration_Callback_PreservesRefreshTokenOnReauth— full callback flow against mockoidc with an RT-stripping reverse proxy.Does this introduce a user-facing change?
Yes. Users who previously hit a refresh-token-orphaning loop on re-authorization (most commonly with Google, Entra, Okta, GitHub, Auth0, Keycloak — all of which omit
refresh_tokenon silent re-auth) will see refreshes succeed instead of failing into another forced re-auth. The fix is silent: no configuration changes are required.Implementation plan
This PR was planned with Claude Code via a per-step
go-expert-developer+ MoE (code-reviewer+go-security-reviewer/oauth-expert/redis-valkey-advisor) workflow. Each step landed as a separate commit with independent review.Approved plan (5 steps)
GetLatestUpstreamTokensForUserto storage interface + stub implementations + mock regen. Keeps the diff tiny so the interface surface is reviewed independently.(UserID, ProviderID), picks highestExpiresAt, returns deep copy.KeyTypeUserUpstreamreverse set; SMembers → MGet → unmarshal → filter; tolerates stale index entries andnulltombstones.maybeCarryForwardRefreshTokenbetweenstorageTokensconstruction andStoreUpstreamTokens. Non-fatal on storage error.UpstreamSubjectguard.ErrNotFoundand stays empty.Cross-project research (via MCP-mounted source): Hydra delegates upstream-token storage to its consent app and avoids the bug class. oauth2-proxy has the same blind-overwrite shape but sidesteps it via opaque-ticket session keys (orphaned rows TTL out instead of being clobbered) — their
IDTokenpreservation pattern atpkg/providers/oidc.go:140-180is the explicit code-level analog we mirror forRefreshToken. infratographer/identity-api is RFC-8693 only, no broker.Special notes for reviewers
(sessionID, providerID)to(userID, providerID)with a one-time Redis migration — is intentionally deferred to a follow-up PR. See Trey's plan in Upstream refresh tokens orphaned on re-authorization #5047. This fix is valuable on its own: even with Root Cause A unresolved, carrying the RT forward to the new sessionID lets the refresher succeed when the new access token expires, ending the re-auth cycle.UpstreamSubject == providerSubjectguard is defense-in-depth. Given the current identity model (ResolveUserkeys identity by(providerID, providerSubject)), the same internalUserIDcannot legitimately have two different upstream subjects on the same provider — same-provider account-linking is a future TODO instorage/types.go. The guard exists so that future code can't accidentally violate that.offline_accessbecause mockoidc'sScopesSupportedrejects it asinvalid_scope. mockoidc unconditionally issues RTs regardless of scope, so the regression assertion (leg 2 RT matches leg 1 RT) is unaffected — but reviewers should know the test does not exercise the realistic precondition that real IdPs gate RT issuance onoffline_access.🤖 Generated with Claude Code
Large PR Justification