Skip to content

Add Redis Cluster mode support to auth server storage#5153

Merged
reyortiz3 merged 11 commits intomainfrom
worktree-redis-cluster-support-to-auth-server-storage
May 6, 2026
Merged

Add Redis Cluster mode support to auth server storage#5153
reyortiz3 merged 11 commits intomainfrom
worktree-redis-cluster-support-to-auth-server-storage

Conversation

@reyortiz3
Copy link
Copy Markdown
Contributor

@reyortiz3 reyortiz3 commented May 1, 2026

Summary

Managed Redis services like GCP Memorystore Cluster and AWS ElastiCache cluster mode enabled expose a single discovery endpoint — not a list of shard nodes. The operator needs a way to tell go-redis to use the Cluster protocol for that endpoint, rather than treating it as a standalone server.

This PR adds clusterMode: true as a flag on the existing addr field instead of introducing a separate clusterConfig struct with an addrs slice. When clusterMode is set, the runner calls redis.NewClusterClient({Addrs: []string{addr}}), which auto-discovers the full cluster topology from the single discovery endpoint.

Changes:

  • Storage layer (pkg/authserver/storage): replace ClusterConfig *ClusterConfig with ClusterMode bool on RedisConfig / RedisRunConfig; NewRedisStorage branches on cfg.ClusterMode and wraps cfg.Addr in the slice that ClusterOptions expects
  • Runner layer (pkg/authserver/runner): convertRedisRunConfig maps ClusterMode straight through; validation updated to: addr XOR sentinel, plus clusterMode requires addr
  • Operator CRD (cmd/thv-operator/api/v1beta1): RedisStorageConfig gains clusterMode bool; CEL rules revert to a simple XOR (addr vs sentinelConfig) plus a second rule for clusterMode requiring addr; RedisClusterConfig struct removed
  • Generated artifacts: deepcopy, CRD YAML, and Helm chart templates regenerated

Closes #5010

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

New unit tests cover: cluster mode + sentinel conflict, cluster mode without addr, cluster mode happy-path at all three layers (storage, runner, operator).

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

clusterMode is a new optional bool field (defaults to false). Adding it is a backward-compatible change. The CEL XOR rule for addr vs sentinelConfig is equivalent to the previous rule for the existing combinations.

Changes

File Change
pkg/authserver/storage/redis.go Replace ClusterConfig struct with ClusterMode bool; update validateConfig and NewRedisStorage
pkg/authserver/storage/config.go Replace ClusterRunConfig / ClusterConfig field with ClusterMode bool on RedisRunConfig
pkg/authserver/runner/embeddedauthserver.go Map ClusterMode in convertRedisRunConfig; simplified validation
cmd/thv-operator/api/v1beta1/mcpexternalauthconfig_types.go Add ClusterMode bool; remove RedisClusterConfig struct; update CEL rules
cmd/thv-operator/pkg/controllerutil/authserver.go Map ClusterMode; remove validateRedisConnectionMode helper
zz_generated.deepcopy.go, CRD YAML × 4 Regenerated artifacts

Does this introduce a user-facing change?

Yes. Kubernetes operators can now enable Redis Cluster protocol by setting spec.authServer.storage.redis.clusterMode: true alongside the existing addr field. Set addr to the single discovery endpoint of the managed Redis Cluster service (e.g., GCP Memorystore Cluster discovery IP, AWS ElastiCache configuration endpoint).

Generated with Claude Code

Managed Redis services like GCP Memorystore Cluster and AWS ElastiCache
Serverless use the Redis Cluster protocol rather than standalone or Sentinel
connections, making it impossible to use them without this third mode.

- Add ClusterConfig/ClusterRunConfig types to storage and runner layers
- Wire cluster client creation via redis.NewClusterClient in NewRedisStorage
- Update validateConfig, convertRedisRunConfig, and buildStorageRunConfig to
  enforce three-way mutual exclusion (addr / sentinel / cluster)
- Add RedisClusterConfig CRD type with MinItems=1 validation; update CEL rule
  to require exactly one of the three modes
- Regenerate deepcopy, CRD YAML, and Helm chart templates
- Add unit tests for all new validation and happy-path code paths

Closes #5010

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the size/L Large PR: 600-999 lines changed label May 1, 2026
@reyortiz3 reyortiz3 marked this pull request as draft May 1, 2026 15:08
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

❌ Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.70%. Comparing base (59e86a5) to head (133c2e2).

Files with missing lines Patch % Lines
pkg/authserver/storage/redis.go 90.90% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5153      +/-   ##
==========================================
- Coverage   67.76%   67.70%   -0.06%     
==========================================
  Files         607      607              
  Lines       62132    62153      +21     
==========================================
- Hits        42104    42083      -21     
- Misses      16861    16897      +36     
- Partials     3167     3173       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ClusterConfig{Addrs []string} was a misleading abstraction: GCP Memorystore
Cluster and AWS ElastiCache cluster mode both expose a single discovery endpoint,
so the slice always held exactly one address and was redundant with the existing
Addr field.

Replace ClusterConfig with a ClusterMode bool flag that is set alongside the
existing Addr field. go-redis NewClusterClient still receives the single endpoint
as []string{addr} and auto-discovers the full cluster topology from there.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/L Large PR: 600-999 lines changed size/M Medium PR: 300-599 lines changed labels May 1, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 1, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 1, 2026
@reyortiz3 reyortiz3 marked this pull request as ready for review May 1, 2026 19:19
@reyortiz3 reyortiz3 requested a review from amirejaz as a code owner May 1, 2026 19:19
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 1, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 2, 2026
@github-actions github-actions Bot removed the size/M Medium PR: 300-599 lines changed label May 4, 2026
@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label May 5, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 5, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 6, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 6, 2026
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes — none are blockers. Two of them are about pre-existing code that this PR now exposes via cluster mode, and one is doc drift the PR introduced.

1. Lua script declares 2 KEYS but writes to undeclared user-set keys (pre-existing)

pkg/authserver/storage/redis.gostoreUpstreamTokensScript (lines 894-982) and its call site at line 1067-1073.

The script declares KEYS[1] (per-provider token key) and KEYS[2] (session index set), but the body at lines 974/978 does redis.call('SREM'/'SADD', setPrefix .. userID, ...) against keys built from ARGV[4]. On cluster this only works because setPrefix is s.keyPrefix and embeds the {ns:name} hash tag, so the user-set keys happen to land on the same slot — that's a load-bearing invariant the script's contract doesn't state.

Could you either pass the resolved user-set keys as KEYS[3]/KEYS[4] (empty string when no old/new user), or add a comment at the top of the script asserting that every dynamically-built key MUST inherit the {ns:name} hash tag baked into setPrefix? Otherwise this is the kind of thing that silently regresses on a future refactor while continuing to pass on standalone Redis. Happy for it to be a follow-up rather than blocking this PR.

2. MGet/Del over SMEMBERS results assumes keyset is well-formed (pre-existing)

pkg/authserver/storage/redis.go at lines 1117 (GetAllUpstreamTokens), 1195 (DeleteUpstreamTokens), and 1234 (GetLatestUpstreamTokensForUser).

providerKeys / members come straight out of SMEMBERS and are then fed into MGet / Del with many keys. On cluster these succeed only when all keys share a slot. Today everything written into those sets reuses s.keyPrefix (which has the {ns:name} hash tag), so it works — but there's no defensive check.

A stray un-prefixed member (legacy data, an external admin op, a test fixture) would surface as CROSSSLOT on cluster while passing on standalone. Worth filtering members to entries starting with s.keyPrefix before the multi-key call and warn-logging anything dropped, just as belt-and-braces. Also fine as a follow-up.

3. CRD tls field doc drift (PR-introduced)

cmd/thv-operator/api/v1beta1/mcpexternalauthconfig_types.go:715.

The storage-layer comments on RedisConfig.TLS (pkg/authserver/storage/redis.go:77) and RedisRunConfig.TLS (pkg/authserver/storage/config.go:102) were updated to "Redis/Valkey master or cluster nodes", but this CRD field still reads "master" only. Since this is the comment that ends up in crd-api.md and kubectl explain, a cluster-mode user will read it and reasonably wonder where to configure TLS for cluster-node connections — when in fact this same field already does it.

Could you bring this in line ("master or cluster nodes") and regenerate the CRD YAML / docs?

Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the AI comments were nits, approving

@reyortiz3 reyortiz3 merged commit 20ced20 into main May 6, 2026
46 checks passed
@reyortiz3 reyortiz3 deleted the worktree-redis-cluster-support-to-auth-server-storage branch May 6, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Redis Cluster mode support to auth server storage

2 participants