Add Redis Cluster mode support to auth server storage#5153
Conversation
Managed Redis services like GCP Memorystore Cluster and AWS ElastiCache Serverless use the Redis Cluster protocol rather than standalone or Sentinel connections, making it impossible to use them without this third mode. - Add ClusterConfig/ClusterRunConfig types to storage and runner layers - Wire cluster client creation via redis.NewClusterClient in NewRedisStorage - Update validateConfig, convertRedisRunConfig, and buildStorageRunConfig to enforce three-way mutual exclusion (addr / sentinel / cluster) - Add RedisClusterConfig CRD type with MinItems=1 validation; update CEL rule to require exactly one of the three modes - Regenerate deepcopy, CRD YAML, and Helm chart templates - Add unit tests for all new validation and happy-path code paths Closes #5010 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5153 +/- ##
==========================================
- Coverage 67.76% 67.70% -0.06%
==========================================
Files 607 607
Lines 62132 62153 +21
==========================================
- Hits 42104 42083 -21
- Misses 16861 16897 +36
- Partials 3167 3173 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ClusterConfig{Addrs []string} was a misleading abstraction: GCP Memorystore
Cluster and AWS ElastiCache cluster mode both expose a single discovery endpoint,
so the slice always held exactly one address and was redundant with the existing
Addr field.
Replace ClusterConfig with a ClusterMode bool flag that is set alongside the
existing Addr field. go-redis NewClusterClient still receives the single endpoint
as []string{addr} and auto-discovers the full cluster topology from there.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jhrozek
left a comment
There was a problem hiding this comment.
A few notes — none are blockers. Two of them are about pre-existing code that this PR now exposes via cluster mode, and one is doc drift the PR introduced.
1. Lua script declares 2 KEYS but writes to undeclared user-set keys (pre-existing)
pkg/authserver/storage/redis.go — storeUpstreamTokensScript (lines 894-982) and its call site at line 1067-1073.
The script declares KEYS[1] (per-provider token key) and KEYS[2] (session index set), but the body at lines 974/978 does redis.call('SREM'/'SADD', setPrefix .. userID, ...) against keys built from ARGV[4]. On cluster this only works because setPrefix is s.keyPrefix and embeds the {ns:name} hash tag, so the user-set keys happen to land on the same slot — that's a load-bearing invariant the script's contract doesn't state.
Could you either pass the resolved user-set keys as KEYS[3]/KEYS[4] (empty string when no old/new user), or add a comment at the top of the script asserting that every dynamically-built key MUST inherit the {ns:name} hash tag baked into setPrefix? Otherwise this is the kind of thing that silently regresses on a future refactor while continuing to pass on standalone Redis. Happy for it to be a follow-up rather than blocking this PR.
2. MGet/Del over SMEMBERS results assumes keyset is well-formed (pre-existing)
pkg/authserver/storage/redis.go at lines 1117 (GetAllUpstreamTokens), 1195 (DeleteUpstreamTokens), and 1234 (GetLatestUpstreamTokensForUser).
providerKeys / members come straight out of SMEMBERS and are then fed into MGet / Del with many keys. On cluster these succeed only when all keys share a slot. Today everything written into those sets reuses s.keyPrefix (which has the {ns:name} hash tag), so it works — but there's no defensive check.
A stray un-prefixed member (legacy data, an external admin op, a test fixture) would surface as CROSSSLOT on cluster while passing on standalone. Worth filtering members to entries starting with s.keyPrefix before the multi-key call and warn-logging anything dropped, just as belt-and-braces. Also fine as a follow-up.
3. CRD tls field doc drift (PR-introduced)
cmd/thv-operator/api/v1beta1/mcpexternalauthconfig_types.go:715.
The storage-layer comments on RedisConfig.TLS (pkg/authserver/storage/redis.go:77) and RedisRunConfig.TLS (pkg/authserver/storage/config.go:102) were updated to "Redis/Valkey master or cluster nodes", but this CRD field still reads "master" only. Since this is the comment that ends up in crd-api.md and kubectl explain, a cluster-mode user will read it and reasonably wonder where to configure TLS for cluster-node connections — when in fact this same field already does it.
Could you bring this in line ("master or cluster nodes") and regenerate the CRD YAML / docs?
jhrozek
left a comment
There was a problem hiding this comment.
the AI comments were nits, approving
Summary
Managed Redis services like GCP Memorystore Cluster and AWS ElastiCache cluster mode enabled expose a single discovery endpoint — not a list of shard nodes. The operator needs a way to tell go-redis to use the Cluster protocol for that endpoint, rather than treating it as a standalone server.
This PR adds
clusterMode: trueas a flag on the existingaddrfield instead of introducing a separateclusterConfigstruct with anaddrsslice. WhenclusterModeis set, the runner callsredis.NewClusterClient({Addrs: []string{addr}}), which auto-discovers the full cluster topology from the single discovery endpoint.Changes:
pkg/authserver/storage): replaceClusterConfig *ClusterConfigwithClusterMode boolonRedisConfig/RedisRunConfig;NewRedisStoragebranches oncfg.ClusterModeand wrapscfg.Addrin the slice thatClusterOptionsexpectspkg/authserver/runner):convertRedisRunConfigmapsClusterModestraight through; validation updated to: addr XOR sentinel, plusclusterModerequiresaddrcmd/thv-operator/api/v1beta1):RedisStorageConfiggainsclusterMode bool; CEL rules revert to a simple XOR (addr vs sentinelConfig) plus a second rule forclusterModerequiringaddr;RedisClusterConfigstruct removedCloses #5010
Type of change
Test plan
task test)task lint-fix)New unit tests cover: cluster mode + sentinel conflict, cluster mode without addr, cluster mode happy-path at all three layers (storage, runner, operator).
API Compatibility
v1beta1API, OR theapi-break-allowedlabel is applied and the migration guidance is described above.clusterModeis a new optional bool field (defaults to false). Adding it is a backward-compatible change. The CEL XOR rule for addr vs sentinelConfig is equivalent to the previous rule for the existing combinations.Changes
pkg/authserver/storage/redis.goClusterConfigstruct withClusterMode bool; updatevalidateConfigandNewRedisStoragepkg/authserver/storage/config.goClusterRunConfig/ClusterConfigfield withClusterMode boolonRedisRunConfigpkg/authserver/runner/embeddedauthserver.goClusterModeinconvertRedisRunConfig; simplified validationcmd/thv-operator/api/v1beta1/mcpexternalauthconfig_types.goClusterMode bool; removeRedisClusterConfigstruct; update CEL rulescmd/thv-operator/pkg/controllerutil/authserver.goClusterMode; removevalidateRedisConnectionModehelperzz_generated.deepcopy.go, CRD YAML × 4Does this introduce a user-facing change?
Yes. Kubernetes operators can now enable Redis Cluster protocol by setting
spec.authServer.storage.redis.clusterMode: truealongside the existingaddrfield. Setaddrto the single discovery endpoint of the managed Redis Cluster service (e.g., GCP Memorystore Cluster discovery IP, AWS ElastiCache configuration endpoint).Generated with Claude Code