Add Redis backend for DCRCredentialStore#5195
Draft
tgrunnagle wants to merge 10 commits intodcr-3a_issue_5183from
Draft
Add Redis backend for DCRCredentialStore#5195tgrunnagle wants to merge 10 commits intodcr-3a_issue_5183from
tgrunnagle wants to merge 10 commits intodcr-3a_issue_5183from
Conversation
Implements Phase 2 steps 2d/2g of the DCR story (#5039): - EmbeddedAuthServer now owns an in-memory DCRCredentialStore and calls resolveDCRCredentials for any OAuth2 upstream with DCRConfig. The resolved ClientSecret is overlaid on the built upstream.OAuth2Config after buildPureOAuth2Config (whose signature and body remain intentionally unchanged) so that RFC 7591-obtained credentials flow through the same execution path as file/env-resolved secrets. - Each UpstreamRunConfig element is shallow-copied and its OAuth2 sub-config is deep-copied before DCR resolution, preserving the caller's RunConfig.Upstreams slice per .claude/rules/go-style.md "Copy Before Mutating Caller Input". - resolveDCRCredentials emits structured logs: Debug on cache hit with dcr_age_days, an additional Warn when the cached registration exceeds dcrStaleAgeThreshold (90 days), and Error with a "step" attribute identifying which phase failed on every error path. - The /oauth/register handler upgrades its success log to Info with upstream, issuer, client_id, software_id, token_endpoint_auth_method, and scopes. SoftwareID is threaded through DCRRequest validation so incoming "software_id" is captured. A small helper guards against a nil embedded *fosite.Config (a legitimate test-only condition). - isTransientNetworkError's permanent-4xx branch now emits a Warn with a DCR remediation hint before returning false unchanged. The MonitoredTokenSource gains an optional SetUpstreamContext setter so the upstream and client_id fields can be threaded into the log without breaking the existing NewMonitoredTokenSource contract. - Integration tests exercise the full DCR boot path against a mock AS, verify the cache-hit short-circuit issues zero additional HTTP requests, and assert the caller's original RunConfig.Upstreams slice element is unchanged across both calls. Address authserver DCR review feedback Fixes from the CODE_REVIEW_ISSUES.md review of commit 71c4f43: Critical - Wire upstream/client_id into MonitoredTokenSource so the DCR remediation warning carries meaningful correlation fields. Promote the fields to constructor parameters (replacing SetUpstreamContext) to remove the unsynchronized writer and force callers to supply them at construction. - Runner populates the new fields from the RemoteAuthConfig, preferring the DCR-cached client_id over the statically configured one. High - /oauth/register handler drops the redundant upstream attribute that mirrored issuer, and omits issuer when empty rather than emitting a bare issuer="". - Resolver no longer logs each error branch and then returns; it now wraps failures in a DCRStepError and the boundary caller (buildUpstreamConfigs) emits a single slog.Error via LogDCRStepError. - The DCR-resolved ClientSecret is applied through a new applyResolutionToOAuth2Config helper paired with applyResolution, so the DCR application sites live side-by-side and future call sites cannot silently drop the secret. Medium - Remove the Type==OAuth2 guard that duplicated needsDCR's nil check. - Cap software_id to 256 characters and require printable ASCII in ValidateDCRRequest; expose MaxSoftwareIDLength. - Add TestNewEmbeddedAuthServer_DCRBoot to drive the full constructor and assert EmbeddedAuthServer.dcrStore is populated after boot. - Remove the nil-guard in Handler.issuer() and add TestHandler_issuer so a real wiring bug fails loudly instead of logging issuer="". - Sanitize error strings before logging to strip URL query parameters that could plausibly carry tokens in a future refactor. Preserve URL-trailing punctuation in DCR log sanitiser The query-stripping regex in sanitizeErrorForLog matches `[^\s"']+`, so a URL ending with sentence punctuation (`.`, `,`, `)`, `]`, etc.) pulls that punctuation into the URL match. url.Parse then absorbs it into the raw query, and the Strip + Reassemble step drops it along with the rest of the query — mangling the surrounding prose. Split the trailing run of terminal punctuation off the match before parsing, and re-append it verbatim after the query is stripped. URL matches without a query are returned untouched so the pass is idempotent for URLs that are already clean. New test cases cover commas, periods, closing parens, mixed runs, and a Go http.Client-style quoted URL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A misbehaving or malicious authorization server could echo arbitrary content into the upstream's /register error response. handleHTTPResponse read that body via io.ReadAll with no LimitReader, then embedded it verbatim in the returned error — which downstream callers log. Cap the read at 8 KiB (far larger than any conformant RFC 7591 error response) so operator log volume cannot be inflated by a non-conformant upstream. Addresses #5044 review finding F2 (HIGH). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The doc comment claimed remoteAuthLogContext mirrored the precedence of remote.Handler.resolveClientCredentials, but the implementation skipped the CachedCIMDClientID check entirely. For CIMD-authenticated workloads the new DCR remediation Warn would have reported a stale or empty client_id rather than the CIMD URL actually being sent on token refresh, defeating the operator-correlation the field exists for. Restore the documented precedence (CachedCIMDClientID > CachedClientID > ClientID) and add a TestRemoteAuthLogContext case covering the CIMD-wins path. Addresses #5044 review findings F3 (HIGH) and F26 (MEDIUM, closed by the new test case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isTransientNetworkError previously emitted the cached-DCR-client
remediation Warn from inside its classifier on every permanent 4xx.
A tight Token() loop hitting the same condition would spam the same
record on every call before the workload's Unauthenticated status
propagated. The same branch also fired for non-DCR workloads, which
saw a remediation telling them to "delete cached credentials" they
never had.
Strip the side effect from the classifier and emit the Warn from
markAsUnauthenticated, which already gates the close-monitoring
transition through stopOnce. The Warn now fires at most once per
state transition, is suppressed when no client_id context is
available, and reads honestly about the variability ("if this
workload uses cached DCR or CIMD credentials they may be stale").
Addresses #5044 review finding F5 (MEDIUM).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related polish items in the authserver handlers package: NewHandler did not validate that AuthorizationServerConfig (or its embedded *fosite.Config) was non-nil. issuer() reached into the config at request time, so a misconfigured caller would panic deep inside an HTTP handler instead of failing at startup. Add the nil check to the constructor and simplify issuer() to rely on the constructor invariant. Pin the new invariant with TestNewHandler_ErrorsOnNilConfig. The /oauth/register success log was promoted to Info even though the operation is neither long-running nor exceptional. Demote back to Debug; an audit-log path is the right home for the audit signal if that becomes a requirement. Addresses #5044 review findings F6 and F7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles ten review-feedback items in pkg/authserver/runner; each addresses internal-API hygiene or doc-comment drift in the DCR resolver, no behaviour change for callers. Sanitizer hardening (F1, F19, F18, F23): sanitizeErrorForLog now strips userinfo and fragment in addition to the query, since either can carry credentials or tokens (https://user:pass@host, implicit-flow #access_token=...). queryStrippingPattern matches http/https case-insensitively per RFC 3986 §3.1 so HTTPS://... cannot escape sanitisation. trimURLTrailingPunctuation switches to strings.IndexByte to match the ASCII-only terminator set without the rune-decoding overhead. Test cases added for each. Resolver error API (F4, F13, F20): the dcrStepRegister panic-recovery branch no longer emits a duplicate slog.Error; the captured stack travels with the wrapped *dcrStepError to the boundary log so a single panic produces a single record. DCRStepError / LogDCRStepError lowercased to dcrStepError / logDCRStepError since no caller lives outside the package. logDCRStepError now no-ops on nil err so the unknown-step branch cannot fire on a missing failure. Resolution helpers (F11, F12, F25): applyResolution renamed to consumeResolution to communicate that it is a one-shot state transition (clearing DCRConfig is unconditional), not an idempotent defaulting step. The applyResolutionToOAuth2Config doc now states the paired-call invariant explicitly without referencing a specific test. Lifecycle docs (F21, F22): the per-instance dcrStore vs. process-wide dcrFlight asymmetry is now stated on both sides, and EmbeddedAuthServer.Close documents the future-Close hook for a backend with handles. Inline rules-file rationale (F24): production comments no longer cite .claude/rules/... by path; the principle is inlined. Addresses #5044 review findings F1, F4, F11, F12, F13, F18, F19, F20, F21, F22, F23, F24, F25. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements changes for issue #5040 (Phase 2 DCR CRD surface): - Add DCRUpstreamConfig CRD type (discoveryUrl, registrationEndpoint, initialAccessTokenRef, softwareId, softwareStatement) and a new dcrConfig field on OAuth2UpstreamConfig so Kubernetes users can configure RFC 7591 Dynamic Client Registration on upstream providers. - Make OAuth2UpstreamConfig.clientId optional and add CEL validation requiring exactly one of clientId or dcrConfig, and exactly one of discoveryUrl or registrationEndpoint inside dcrConfig. Mirror the checks at runtime via validateOAuth2DCRConfig for defense-in-depth. - Wire the conversion in controllerutil/authserver.go so DCRConfig is mapped onto authserver.DCRUpstreamConfig. InitialAccessTokenRef is resolved to an env var (TOOLHIVE_UPSTREAM_DCR_INITIAL_ACCESS_TOKEN_*) populated from the referenced Secret, mirroring the ClientSecretRef pattern. Extract small helpers for env-var generation to keep cyclomatic complexity within lint limits. - Regenerate zz_generated.deepcopy.go, CRD YAML manifests, and CRD API reference docs. - Add table-driven validation tests covering DCR+ClientID conflict, both endpoints set, neither endpoint set, valid single-endpoint cases, and neither-auth configuration. Add conversion tests covering DCR discoveryUrl/registrationEndpoint paths and initial-access-token env var wiring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Address code review feedback Fixed issues from code review of the DCR CRD surface commit: - CRITICAL: CEL markers contained a Unicode smart quote (U+201D) that gofmt's doc-comment formatter reintroduced on every lint-fix. Rewrote both markers to use CEL's size(...) > 0 idiom instead of `!= ''`, which sidesteps the typographic normalization entirely and keeps regeneration idempotent. Verified no U+2018-U+201F characters remain in source or CRDs. - HIGH: buildUpstreamRunConfig now calls the exported mcpv1beta1.ValidateOAuth2DCRConfig before producing a RunConfig, so malformed ClientID/DCRConfig pairs that bypass admission fail at reconcile time rather than at authserver startup. Error propagation threaded through BuildAuthServerRunConfig; split OIDC and OAuth2 branches into helpers to stay under the gocyclo limit. - HIGH: Added table case exercising validateUpstreamProvider rejection of an OIDC-typed provider whose OAuth2Config carries a DCRConfig. - MEDIUM: Added kubebuilder CEL XValidation on UpstreamProviderConfig enforcing oidcConfig/oauth2Config mutual exclusivity paired to the declared type, closing the silent-pod-failure YAML-apply gap. - MEDIUM: Added MaxLength=255 to SoftwareID and MaxLength=4096 to SoftwareStatement to prevent unbounded input from inflating CRs beyond etcd object limits. - MEDIUM: Pinned the "neither ClientID nor DCRConfig" error assertion to the scoped `oauth2Config:` prefix; added a regression case exercising the non-DCR OAuth2 path (ClientID only, DCRConfig nil); added a new TestBuildAuthServerRunConfig_InvalidDCR suite covering all four invalid DCR/ClientID pairings at the conversion layer. - MEDIUM: Renamed UpstreamDCRInitialAccessTokenEnvVar to UpstreamDCRInitialAccessTokenEnvVarPrefix and expanded the godoc on both prefix constants to show the resolved <prefix>_<PROVIDER> form. All task lint/lint-fix/license-check pass; regenerated CRDs and deepcopy are idempotent; affected unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Address iteration-2 review feedback Polish items raised in the second review pass: - MEDIUM: Trim duplicate upstream name from reconcile-time DCR validation errors. Added scopedFieldPath() helper in cmd/thv-operator/api/v1beta1/mcpexternalauthconfig_types.go so ValidateOAuth2DCRConfig prepends a dotted prefix only when one is given, and the conversion call site now passes an empty prefix so BuildAuthServerRunConfig's outer "upstream %q: %w" wrap is the only mention of the upstream name. Strengthened TestBuildAuthServerRunConfig_InvalidDCR to assert the upstream name appears exactly once in the error string. - MEDIUM: Make the UpstreamProviderConfig CEL rule fail closed for unrecognized future provider types. Restructured the rule from a binary discriminator into a chain of equality checks ending in an explicit `false`, and updated the message to "type must be 'oidc' or 'oauth2'; ...". Added a contributor-facing doc comment reminding future authors to extend both the rule and validateUpstreamProvider when adding a new UpstreamProviderType. - MEDIUM: Refresh the godoc on extractUpstreamSecretRefs to describe the actual invariants that hold post-CEL: OIDC providers can only return a clientSecretRef; OAuth2 providers can return both independently; other (currently unreachable) types return nil/nil. Cross-linked to the CEL rule and noted that BuildAuthServerRunConfig is the reconcile-time backstop callers should not rely on this helper to enforce. Regenerated CRD YAMLs and crd-api.md prose. task lint, lint-fix, license-check, and the affected unit tests pass; regeneration is idempotent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 sub-issue 1 of #5183. Define the persisted DCRCredentials value type and the storage-level DCRCredentialStore interface in pkg/authserver/storage/, and ship the in-process memory implementation that single-replica deployments and unit tests use. The Redis backend (sub-issue 2) and the wiring change (sub-issue 3) build on this. DCRKey consolidation: chose option (a) from the issue — DCRKey and its ScopesHash constructor move to pkg/authserver/storage/ so any future backend hashes keys identically. The runner package keeps a package-local type alias (type DCRKey = storage.DCRKey) and a var binding for scopesHash so existing call sites compile unchanged; the canonical form has a single source of truth. DCRCredentials carries ClientSecretExpiresAt so the Redis backend can drive a SetEX TTL without re-touching the value type or regenerating mocks. The interface contract documents this as SHOULD honor when backend-supported; MemoryStorage retains entries verbatim for the process lifetime. StoreDCRCredentials rejects nil creds and zero-valued Key.Issuer or Key.RedirectURI with fosite.ErrInvalidRequest, matching the StoreUpstreamTokens validation pattern. Stats reports dcrCredentials count for parity with the other in-memory maps. The runner-side DCRCredentialStore (Get/Put *DCRResolution) is left in place as the thin adapter sub-issue 3 will swap. This sub-issue lands the new storage-level interface, MemoryStorage implementation, and regenerated mock without touching the wire-up. DCR credentials are intentionally excluded from cleanupExpired: RFC 7591 client registrations are long-lived and the authoritative expiry signal is client_secret_expires_at, which the Redis backend will honor as a SetEX TTL.
Implements the persistent Redis backend for DCR credentials behind the DCRCredentialStore interface, so an authserver backed by Redis Sentinel shares dynamic-client-registration state across replicas and survives restarts. The wire format honors RFC 7591 client_secret_expires_at as a Redis TTL when non-zero, falling back to a long-lived row when the upstream did not assert an expiry. Implements changes for issue #5184: - Add KeyTypeDCR const and length-prefixed redisDCRKey helper that handles colons in RedirectURI without ambiguity - Add RedisStorage.StoreDCRCredentials/GetDCRCredentials with JSON serialization, defensive copy via decode, and TTL derived from ClientSecretExpiresAt - Add unit tests for redisDCRKey distinctness/determinism plus miniredis-backed tests for round-trip, overwrite, validation, defensive copy, TTL semantics, and concurrent access - Add Redis Sentinel integration tests covering round-trip, distinct keys, overwrite, real-Redis TTL (-1 for never-expires), and concurrent access
Fixed issues from code review: - MEDIUM: Drop input validation from RedisStorage.GetDCRCredentials so it matches MemoryStorage and the DCRCredentialStore interface contract (an unpopulated key is now a normal ErrNotFound miss). Fold the empty-key cases into TestRedisStorage_DCRCredentials_NotFound. - MEDIUM: Apply pastExpiryDCRTTL = 1s when ClientSecretExpiresAt is in the past so already-expired DCR rows self-evict instead of living forever. Update docstring and rewrite the corresponding TTL subtest. - MEDIUM: Assert no Redis row remains after each rejected StoreDCRCredentials call, mirroring the memory backend's Stats().DCRCredentials == 0 guard. - LOW: Loosen the integration TTL upper-bound assertion by 1s so truncation/Redis second-granularity cannot flake the assertion. - LOW: Drop the inaccurate SetEX rationale from StoreDCRCredentials' docstring (resolved as part of the past-expiry fix). - LOW: Remove dcrIntegrationFixtureKey duplicate; integration tests now share dcrFixtureKey from redis_test.go (no build tag) so the canonical fixture has a single source of truth.
c0fed52 to
cc92472
Compare
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DRAFT - not ready for review
Summary
An authserver replica that registers itself as a DCR client against an upstream
authorization server currently keeps the resulting
(client_id, client_secret)in the in-process
MemoryStoragefrom sub-issue 1. Restarts and horizontalscale-outs lose the registration, forcing every replica to re-register on cold
start and breaking RFC 7592 management URLs. This PR adds the persistent half
of
DCRCredentialStoreso a Redis-Sentinel-backed authserver shares DCRcredentials across replicas and survives restarts.
KeyTypeDCRand a length-prefixedredisDCRKey(prefix, DCRKey)helperthat handles colons in
RedirectURIandIssuerwithout ambiguity, mirroringthe existing
redisProviderKeyshape.RedisStorage.StoreDCRCredentials/GetDCRCredentialswith JSONserialisation (acting as a defensive copy on read) and TTL semantics derived
from
client_secret_expires_at: zero means no Redis TTL (RFC 7591 "never"),a future expiry uses
time.Until(expiry), and an already-past expiry isbounded to a 1s eviction window so already-dead secrets do not linger.
_ DCRCredentialStore = (*RedisStorage)(nil)assertionalongside the existing interface checks at the bottom of
redis.go.round-trip, overwrite, validation, defensive copy via decode, all three TTL
branches,
ErrNotFoundsemantics, and concurrentStore/Get.//go:build integration)pinning the wire-level TTL contract against real Redis (
-1for never-expires)plus round-trip, distinct-keys, overwrite, and concurrent access — extending
redis_integration_test.gorather than introducing a second harness.Closes #5184
Type of change
Test plan
task test)task lint-fix)Ran the integration suite locally against a Docker-backed Redis Sentinel
cluster:
go test -tags=integration ./pkg/authserver/storage/...(TTL,round-trip, distinct-keys, overwrite, and concurrent-access cases all pass,
including the wire-level
TTL == -1assertion for never-expires rows).API Compatibility
v1beta1API, OR theapi-break-allowedlabel is applied and the migration guidance is described above.Changes
pkg/authserver/storage/redis_keys.goKeyTypeDCRconst andredisDCRKeylength-prefixed key helper.pkg/authserver/storage/redis.gostoredDCRCredentialswire type,StoreDCRCredentials/GetDCRCredentials,pastExpiryDCRTTLbound, andDCRCredentialStoreinterface assertion.pkg/authserver/storage/redis_test.goErrNotFound, concurrent access.pkg/authserver/storage/redis_integration_test.gowithIntegrationStorage: round-trip, distinct keys, overwrite, real-Redis TTL, concurrent access.Does this introduce a user-facing change?
No. This is internal storage plumbing behind the
DCRCredentialStoreinterfaceintroduced in sub-issue 1. Sub-issue 3 will wire
EmbeddedAuthServerto selectthe Redis backend via the existing
storage_type: redisconfig toggle.Special notes for reviewers
CRITICAL / HIGH / MEDIUM findings. One LOW finding remains: a stale docstring
on a unit-test helper in
redis_test.gothat still references thepre-
pastExpiryDCRTTLbehaviour. Happy to fix in this PR or as a follow-upif reviewers prefer to keep this PR focused on the storage primitive.
1srather than rejecting thewrite or storing long-lived. Rationale is in the
pastExpiryDCRTTLconstantdoc comment and the
StoreDCRCredentialsdocstring: caller's expirytimestamp still round-trips so a downstream reader can observe it and trigger
re-registration, while the row self-evicts almost immediately. Worth a look
during review to confirm the policy matches how sub-issue 3's resolver will
consume it.
miniredis(already ingo.modfrom the surroundingredis_test.gosuite). Integration tests use the existing testcontainersRedis Sentinel harness — both layers are required: the unit layer pins
in-process semantics for
task test, and the integration layer pins thewire contract (
TTLreturns-1for "no TTL") that miniredis cannotfaithfully reproduce.