Skip to content

Add Redis backend for DCRCredentialStore#5195

Draft
tgrunnagle wants to merge 10 commits intodcr-3a_issue_5183from
dcr-3b_issue_5184
Draft

Add Redis backend for DCRCredentialStore#5195
tgrunnagle wants to merge 10 commits intodcr-3a_issue_5183from
dcr-3b_issue_5184

Conversation

@tgrunnagle
Copy link
Copy Markdown
Contributor

@tgrunnagle tgrunnagle commented May 5, 2026

DRAFT - not ready for review

Summary

An authserver replica that registers itself as a DCR client against an upstream
authorization server currently keeps the resulting (client_id, client_secret)
in the in-process MemoryStorage from sub-issue 1. Restarts and horizontal
scale-outs lose the registration, forcing every replica to re-register on cold
start and breaking RFC 7592 management URLs. This PR adds the persistent half
of DCRCredentialStore so a Redis-Sentinel-backed authserver shares DCR
credentials across replicas and survives restarts.

  • Add KeyTypeDCR and a length-prefixed redisDCRKey(prefix, DCRKey) helper
    that handles colons in RedirectURI and Issuer without ambiguity, mirroring
    the existing redisProviderKey shape.
  • Add RedisStorage.StoreDCRCredentials / GetDCRCredentials with JSON
    serialisation (acting as a defensive copy on read) and TTL semantics derived
    from client_secret_expires_at: zero means no Redis TTL (RFC 7591 "never"),
    a future expiry uses time.Until(expiry), and an already-past expiry is
    bounded to a 1s eviction window so already-dead secrets do not linger.
  • Wire a compile-time _ DCRCredentialStore = (*RedisStorage)(nil) assertion
    alongside the existing interface checks at the bottom of redis.go.
  • Add miniredis-backed unit tests covering key distinctness/determinism, full
    round-trip, overwrite, validation, defensive copy via decode, all three TTL
    branches, ErrNotFound semantics, and concurrent Store / Get.
  • Add Redis Sentinel integration tests (testcontainers, //go:build integration)
    pinning the wire-level TTL contract against real Redis (-1 for never-expires)
    plus round-trip, distinct-keys, overwrite, and concurrent access — extending
    redis_integration_test.go rather than introducing a second harness.

Closes #5184

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Ran the integration suite locally against a Docker-backed Redis Sentinel
cluster: go test -tags=integration ./pkg/authserver/storage/... (TTL,
round-trip, distinct-keys, overwrite, and concurrent-access cases all pass,
including the wire-level TTL == -1 assertion for never-expires rows).

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Changes

File Change
pkg/authserver/storage/redis_keys.go Add KeyTypeDCR const and redisDCRKey length-prefixed key helper.
pkg/authserver/storage/redis.go Add storedDCRCredentials wire type, StoreDCRCredentials / GetDCRCredentials, pastExpiryDCRTTL bound, and DCRCredentialStore interface assertion.
pkg/authserver/storage/redis_test.go Add miniredis unit coverage: key encoding, round-trip, overwrite, validation, defensive copy, TTL (never / future / past), ErrNotFound, concurrent access.
pkg/authserver/storage/redis_integration_test.go Add Redis Sentinel integration coverage extending withIntegrationStorage: round-trip, distinct keys, overwrite, real-Redis TTL, concurrent access.

Does this introduce a user-facing change?

No. This is internal storage plumbing behind the DCRCredentialStore interface
introduced in sub-issue 1. Sub-issue 3 will wire EmbeddedAuthServer to select
the Redis backend via the existing storage_type: redis config toggle.

Special notes for reviewers

  • This branch went through two review iterations. The final state has zero
    CRITICAL / HIGH / MEDIUM findings. One LOW finding remains: a stale docstring
    on a unit-test helper in redis_test.go that still references the
    pre-pastExpiryDCRTTL behaviour. Happy to fix in this PR or as a follow-up
    if reviewers prefer to keep this PR focused on the storage primitive.
  • Past-expiry TTL behaviour is deliberately 1s rather than rejecting the
    write or storing long-lived. Rationale is in the pastExpiryDCRTTL constant
    doc comment and the StoreDCRCredentials docstring: caller's expiry
    timestamp still round-trips so a downstream reader can observe it and trigger
    re-registration, while the row self-evicts almost immediately. Worth a look
    during review to confirm the policy matches how sub-issue 3's resolver will
    consume it.
  • Unit tests use miniredis (already in go.mod from the surrounding
    redis_test.go suite). Integration tests use the existing testcontainers
    Redis Sentinel harness — both layers are required: the unit layer pins
    in-process semantics for task test, and the integration layer pins the
    wire contract (TTL returns -1 for "no TTL") that miniredis cannot
    faithfully reproduce.
  • No new dependencies were introduced.

tgrunnagle and others added 10 commits May 1, 2026 06:46
Implements Phase 2 steps 2d/2g of the DCR story (#5039):

- EmbeddedAuthServer now owns an in-memory DCRCredentialStore and calls
  resolveDCRCredentials for any OAuth2 upstream with DCRConfig. The
  resolved ClientSecret is overlaid on the built upstream.OAuth2Config
  after buildPureOAuth2Config (whose signature and body remain
  intentionally unchanged) so that RFC 7591-obtained credentials flow
  through the same execution path as file/env-resolved secrets.
- Each UpstreamRunConfig element is shallow-copied and its OAuth2
  sub-config is deep-copied before DCR resolution, preserving the
  caller's RunConfig.Upstreams slice per .claude/rules/go-style.md
  "Copy Before Mutating Caller Input".
- resolveDCRCredentials emits structured logs: Debug on cache hit with
  dcr_age_days, an additional Warn when the cached registration exceeds
  dcrStaleAgeThreshold (90 days), and Error with a "step" attribute
  identifying which phase failed on every error path.
- The /oauth/register handler upgrades its success log to Info with
  upstream, issuer, client_id, software_id, token_endpoint_auth_method,
  and scopes. SoftwareID is threaded through DCRRequest validation so
  incoming "software_id" is captured. A small helper guards against a
  nil embedded *fosite.Config (a legitimate test-only condition).
- isTransientNetworkError's permanent-4xx branch now emits a Warn with
  a DCR remediation hint before returning false unchanged. The
  MonitoredTokenSource gains an optional SetUpstreamContext setter so
  the upstream and client_id fields can be threaded into the log
  without breaking the existing NewMonitoredTokenSource contract.
- Integration tests exercise the full DCR boot path against a mock AS,
  verify the cache-hit short-circuit issues zero additional HTTP
  requests, and assert the caller's original RunConfig.Upstreams slice
  element is unchanged across both calls.

Address authserver DCR review feedback

Fixes from the CODE_REVIEW_ISSUES.md review of commit 71c4f43:

Critical
- Wire upstream/client_id into MonitoredTokenSource so the DCR remediation
  warning carries meaningful correlation fields. Promote the fields to
  constructor parameters (replacing SetUpstreamContext) to remove the
  unsynchronized writer and force callers to supply them at construction.
- Runner populates the new fields from the RemoteAuthConfig, preferring
  the DCR-cached client_id over the statically configured one.

High
- /oauth/register handler drops the redundant upstream attribute that
  mirrored issuer, and omits issuer when empty rather than emitting a
  bare issuer="".
- Resolver no longer logs each error branch and then returns; it now
  wraps failures in a DCRStepError and the boundary caller
  (buildUpstreamConfigs) emits a single slog.Error via LogDCRStepError.
- The DCR-resolved ClientSecret is applied through a new
  applyResolutionToOAuth2Config helper paired with applyResolution, so
  the DCR application sites live side-by-side and future call sites
  cannot silently drop the secret.

Medium
- Remove the Type==OAuth2 guard that duplicated needsDCR's nil check.
- Cap software_id to 256 characters and require printable ASCII in
  ValidateDCRRequest; expose MaxSoftwareIDLength.
- Add TestNewEmbeddedAuthServer_DCRBoot to drive the full constructor
  and assert EmbeddedAuthServer.dcrStore is populated after boot.
- Remove the nil-guard in Handler.issuer() and add TestHandler_issuer
  so a real wiring bug fails loudly instead of logging issuer="".
- Sanitize error strings before logging to strip URL query parameters
  that could plausibly carry tokens in a future refactor.

Preserve URL-trailing punctuation in DCR log sanitiser

The query-stripping regex in sanitizeErrorForLog matches `[^\s"']+`, so a
URL ending with sentence punctuation (`.`, `,`, `)`, `]`, etc.) pulls
that punctuation into the URL match. url.Parse then absorbs it into the
raw query, and the Strip + Reassemble step drops it along with the rest
of the query — mangling the surrounding prose.

Split the trailing run of terminal punctuation off the match before
parsing, and re-append it verbatim after the query is stripped. URL
matches without a query are returned untouched so the pass is
idempotent for URLs that are already clean. New test cases cover commas,
periods, closing parens, mixed runs, and a Go http.Client-style quoted
URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A misbehaving or malicious authorization server could echo arbitrary
content into the upstream's /register error response. handleHTTPResponse
read that body via io.ReadAll with no LimitReader, then embedded it
verbatim in the returned error — which downstream callers log. Cap the
read at 8 KiB (far larger than any conformant RFC 7591 error response)
so operator log volume cannot be inflated by a non-conformant upstream.

Addresses #5044 review finding F2 (HIGH).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The doc comment claimed remoteAuthLogContext mirrored the precedence of
remote.Handler.resolveClientCredentials, but the implementation skipped
the CachedCIMDClientID check entirely. For CIMD-authenticated workloads
the new DCR remediation Warn would have reported a stale or empty
client_id rather than the CIMD URL actually being sent on token refresh,
defeating the operator-correlation the field exists for.

Restore the documented precedence (CachedCIMDClientID >
CachedClientID > ClientID) and add a TestRemoteAuthLogContext case
covering the CIMD-wins path.

Addresses #5044 review findings F3 (HIGH) and F26
(MEDIUM, closed by the new test case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isTransientNetworkError previously emitted the cached-DCR-client
remediation Warn from inside its classifier on every permanent 4xx.
A tight Token() loop hitting the same condition would spam the same
record on every call before the workload's Unauthenticated status
propagated. The same branch also fired for non-DCR workloads, which
saw a remediation telling them to "delete cached credentials" they
never had.

Strip the side effect from the classifier and emit the Warn from
markAsUnauthenticated, which already gates the close-monitoring
transition through stopOnce. The Warn now fires at most once per
state transition, is suppressed when no client_id context is
available, and reads honestly about the variability ("if this
workload uses cached DCR or CIMD credentials they may be stale").

Addresses #5044 review finding F5 (MEDIUM).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related polish items in the authserver handlers package:

NewHandler did not validate that AuthorizationServerConfig (or its
embedded *fosite.Config) was non-nil. issuer() reached into the
config at request time, so a misconfigured caller would panic deep
inside an HTTP handler instead of failing at startup. Add the nil
check to the constructor and simplify issuer() to rely on the
constructor invariant. Pin the new invariant with
TestNewHandler_ErrorsOnNilConfig.

The /oauth/register success log was promoted to Info even though the
operation is neither long-running nor exceptional. Demote back to
Debug; an audit-log path is the right home for the audit signal if
that becomes a requirement.

Addresses #5044 review findings F6 and F7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles ten review-feedback items in pkg/authserver/runner; each
addresses internal-API hygiene or doc-comment drift in the DCR
resolver, no behaviour change for callers.

Sanitizer hardening (F1, F19, F18, F23): sanitizeErrorForLog now
strips userinfo and fragment in addition to the query, since either
can carry credentials or tokens (https://user:pass@host,
implicit-flow #access_token=...). queryStrippingPattern matches
http/https case-insensitively per RFC 3986 §3.1 so HTTPS://...
cannot escape sanitisation. trimURLTrailingPunctuation switches to
strings.IndexByte to match the ASCII-only terminator set without
the rune-decoding overhead. Test cases added for each.

Resolver error API (F4, F13, F20): the dcrStepRegister panic-recovery
branch no longer emits a duplicate slog.Error; the captured stack
travels with the wrapped *dcrStepError to the boundary log so a
single panic produces a single record. DCRStepError /
LogDCRStepError lowercased to dcrStepError / logDCRStepError since
no caller lives outside the package. logDCRStepError now no-ops on
nil err so the unknown-step branch cannot fire on a missing failure.

Resolution helpers (F11, F12, F25): applyResolution renamed to
consumeResolution to communicate that it is a one-shot state
transition (clearing DCRConfig is unconditional), not an idempotent
defaulting step. The applyResolutionToOAuth2Config doc now states
the paired-call invariant explicitly without referencing a specific
test.

Lifecycle docs (F21, F22): the per-instance dcrStore vs.
process-wide dcrFlight asymmetry is now stated on both sides, and
EmbeddedAuthServer.Close documents the future-Close hook for a
backend with handles.

Inline rules-file rationale (F24): production comments no longer
cite .claude/rules/... by path; the principle is inlined.

Addresses #5044 review findings F1, F4, F11, F12,
F13, F18, F19, F20, F21, F22, F23, F24, F25.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements changes for issue #5040 (Phase 2 DCR CRD surface):

- Add DCRUpstreamConfig CRD type (discoveryUrl, registrationEndpoint,
  initialAccessTokenRef, softwareId, softwareStatement) and a new
  dcrConfig field on OAuth2UpstreamConfig so Kubernetes users can
  configure RFC 7591 Dynamic Client Registration on upstream providers.
- Make OAuth2UpstreamConfig.clientId optional and add CEL validation
  requiring exactly one of clientId or dcrConfig, and exactly one of
  discoveryUrl or registrationEndpoint inside dcrConfig. Mirror the
  checks at runtime via validateOAuth2DCRConfig for defense-in-depth.
- Wire the conversion in controllerutil/authserver.go so DCRConfig is
  mapped onto authserver.DCRUpstreamConfig. InitialAccessTokenRef is
  resolved to an env var (TOOLHIVE_UPSTREAM_DCR_INITIAL_ACCESS_TOKEN_*)
  populated from the referenced Secret, mirroring the ClientSecretRef
  pattern. Extract small helpers for env-var generation to keep
  cyclomatic complexity within lint limits.
- Regenerate zz_generated.deepcopy.go, CRD YAML manifests, and CRD API
  reference docs.
- Add table-driven validation tests covering DCR+ClientID conflict,
  both endpoints set, neither endpoint set, valid single-endpoint
  cases, and neither-auth configuration. Add conversion tests covering
  DCR discoveryUrl/registrationEndpoint paths and initial-access-token
  env var wiring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address code review feedback

Fixed issues from code review of the DCR CRD surface commit:

- CRITICAL: CEL markers contained a Unicode smart quote (U+201D) that
  gofmt's doc-comment formatter reintroduced on every lint-fix. Rewrote
  both markers to use CEL's size(...) > 0 idiom instead of `!= ''`, which
  sidesteps the typographic normalization entirely and keeps regeneration
  idempotent. Verified no U+2018-U+201F characters remain in source or CRDs.
- HIGH: buildUpstreamRunConfig now calls the exported
  mcpv1beta1.ValidateOAuth2DCRConfig before producing a RunConfig, so
  malformed ClientID/DCRConfig pairs that bypass admission fail at
  reconcile time rather than at authserver startup. Error propagation
  threaded through BuildAuthServerRunConfig; split OIDC and OAuth2
  branches into helpers to stay under the gocyclo limit.
- HIGH: Added table case exercising validateUpstreamProvider rejection
  of an OIDC-typed provider whose OAuth2Config carries a DCRConfig.
- MEDIUM: Added kubebuilder CEL XValidation on UpstreamProviderConfig
  enforcing oidcConfig/oauth2Config mutual exclusivity paired to the
  declared type, closing the silent-pod-failure YAML-apply gap.
- MEDIUM: Added MaxLength=255 to SoftwareID and MaxLength=4096 to
  SoftwareStatement to prevent unbounded input from inflating CRs
  beyond etcd object limits.
- MEDIUM: Pinned the "neither ClientID nor DCRConfig" error assertion to
  the scoped `oauth2Config:` prefix; added a regression case exercising
  the non-DCR OAuth2 path (ClientID only, DCRConfig nil); added a new
  TestBuildAuthServerRunConfig_InvalidDCR suite covering all four
  invalid DCR/ClientID pairings at the conversion layer.
- MEDIUM: Renamed UpstreamDCRInitialAccessTokenEnvVar to
  UpstreamDCRInitialAccessTokenEnvVarPrefix and expanded the godoc on
  both prefix constants to show the resolved <prefix>_<PROVIDER> form.

All task lint/lint-fix/license-check pass; regenerated CRDs and
deepcopy are idempotent; affected unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address iteration-2 review feedback

Polish items raised in the second review pass:

- MEDIUM: Trim duplicate upstream name from reconcile-time DCR validation
  errors. Added scopedFieldPath() helper in
  cmd/thv-operator/api/v1beta1/mcpexternalauthconfig_types.go so
  ValidateOAuth2DCRConfig prepends a dotted prefix only when one is
  given, and the conversion call site now passes an empty prefix so
  BuildAuthServerRunConfig's outer "upstream %q: %w" wrap is the only
  mention of the upstream name. Strengthened
  TestBuildAuthServerRunConfig_InvalidDCR to assert the upstream name
  appears exactly once in the error string.
- MEDIUM: Make the UpstreamProviderConfig CEL rule fail closed for
  unrecognized future provider types. Restructured the rule from a
  binary discriminator into a chain of equality checks ending in an
  explicit `false`, and updated the message to "type must be 'oidc'
  or 'oauth2'; ...". Added a contributor-facing doc comment reminding
  future authors to extend both the rule and validateUpstreamProvider
  when adding a new UpstreamProviderType.
- MEDIUM: Refresh the godoc on extractUpstreamSecretRefs to describe
  the actual invariants that hold post-CEL: OIDC providers can only
  return a clientSecretRef; OAuth2 providers can return both
  independently; other (currently unreachable) types return nil/nil.
  Cross-linked to the CEL rule and noted that BuildAuthServerRunConfig
  is the reconcile-time backstop callers should not rely on this
  helper to enforce.

Regenerated CRD YAMLs and crd-api.md prose. task lint, lint-fix,
license-check, and the affected unit tests pass; regeneration is
idempotent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 sub-issue 1 of #5183. Define the persisted DCRCredentials value
type and the storage-level DCRCredentialStore interface in
pkg/authserver/storage/, and ship the in-process memory implementation
that single-replica deployments and unit tests use. The Redis backend
(sub-issue 2) and the wiring change (sub-issue 3) build on this.

DCRKey consolidation: chose option (a) from the issue — DCRKey and its
ScopesHash constructor move to pkg/authserver/storage/ so any future
backend hashes keys identically. The runner package keeps a
package-local type alias (type DCRKey = storage.DCRKey) and a var
binding for scopesHash so existing call sites compile unchanged; the
canonical form has a single source of truth.

DCRCredentials carries ClientSecretExpiresAt so the Redis backend can
drive a SetEX TTL without re-touching the value type or regenerating
mocks. The interface contract documents this as SHOULD honor when
backend-supported; MemoryStorage retains entries verbatim for the
process lifetime.

StoreDCRCredentials rejects nil creds and zero-valued Key.Issuer or
Key.RedirectURI with fosite.ErrInvalidRequest, matching the
StoreUpstreamTokens validation pattern. Stats reports
dcrCredentials count for parity with the other in-memory maps.

The runner-side DCRCredentialStore (Get/Put *DCRResolution) is left in
place as the thin adapter sub-issue 3 will swap. This sub-issue lands
the new storage-level interface, MemoryStorage implementation, and
regenerated mock without touching the wire-up.

DCR credentials are intentionally excluded from cleanupExpired:
RFC 7591 client registrations are long-lived and the authoritative
expiry signal is client_secret_expires_at, which the Redis backend
will honor as a SetEX TTL.
Implements the persistent Redis backend for DCR credentials behind the
DCRCredentialStore interface, so an authserver backed by Redis Sentinel
shares dynamic-client-registration state across replicas and survives
restarts. The wire format honors RFC 7591 client_secret_expires_at as a
Redis TTL when non-zero, falling back to a long-lived row when the
upstream did not assert an expiry.

Implements changes for issue #5184:
- Add KeyTypeDCR const and length-prefixed redisDCRKey helper that
  handles colons in RedirectURI without ambiguity
- Add RedisStorage.StoreDCRCredentials/GetDCRCredentials with JSON
  serialization, defensive copy via decode, and TTL derived from
  ClientSecretExpiresAt
- Add unit tests for redisDCRKey distinctness/determinism plus
  miniredis-backed tests for round-trip, overwrite, validation, defensive
  copy, TTL semantics, and concurrent access
- Add Redis Sentinel integration tests covering round-trip, distinct
  keys, overwrite, real-Redis TTL (-1 for never-expires), and concurrent
  access
Fixed issues from code review:
- MEDIUM: Drop input validation from RedisStorage.GetDCRCredentials so
  it matches MemoryStorage and the DCRCredentialStore interface contract
  (an unpopulated key is now a normal ErrNotFound miss). Fold the
  empty-key cases into TestRedisStorage_DCRCredentials_NotFound.
- MEDIUM: Apply pastExpiryDCRTTL = 1s when ClientSecretExpiresAt is in
  the past so already-expired DCR rows self-evict instead of living
  forever. Update docstring and rewrite the corresponding TTL subtest.
- MEDIUM: Assert no Redis row remains after each rejected
  StoreDCRCredentials call, mirroring the memory backend's
  Stats().DCRCredentials == 0 guard.
- LOW: Loosen the integration TTL upper-bound assertion by 1s so
  truncation/Redis second-granularity cannot flake the assertion.
- LOW: Drop the inaccurate SetEX rationale from StoreDCRCredentials'
  docstring (resolved as part of the past-expiry fix).
- LOW: Remove dcrIntegrationFixtureKey duplicate; integration tests
  now share dcrFixtureKey from redis_test.go (no build tag) so the
  canonical fixture has a single source of truth.
@tgrunnagle tgrunnagle changed the base branch from main to dcr-3a_issue_5183 May 5, 2026 14:35
@github-actions github-actions Bot added the size/L Large PR: 600-999 lines changed label May 5, 2026
@tgrunnagle tgrunnagle force-pushed the dcr-3a_issue_5183 branch from c0fed52 to cc92472 Compare May 5, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persistent DCRCredentialStore: Redis backend

1 participant