Skip to content

Commit f0f0c50

Browse files
authored
Merge 8d7c97e into 02c003c
2 parents 02c003c + 8d7c97e commit f0f0c50

File tree

1 file changed

+384
-0
lines changed

1 file changed

+384
-0
lines changed
Lines changed: 384 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,384 @@
1+
# Valkey Integration for ToolHive Operator
2+
3+
## Overview
4+
5+
This proposal outlines a secure-by-default approach for integrating Valkey (Redis-compatible) session storage into the ToolHive operator. The design prioritizes ease of use with automatic security configuration, enabling distributed session storage for horizontal scaling of MCP servers.
6+
7+
## Design Philosophy
8+
9+
1. **Secure by default** - All security features enabled automatically
10+
2. **Zero configuration** - Works out of the box with sensible defaults
11+
3. **Simple sizing** - Just pick small/medium/large
12+
4. **Automatic management** - Operator handles all complexity
13+
14+
## Proposed Architecture: SessionStorage CRD
15+
16+
A minimal CRD that automatically configures everything:
17+
18+
```yaml
19+
apiVersion: toolhive.stacklok.dev/v1alpha1
20+
kind: SessionStorage
21+
metadata:
22+
name: shared-storage
23+
namespace: toolhive-system
24+
spec:
25+
# Just pick a size - everything else is automatic
26+
size: medium # small, medium, or large
27+
28+
# Optional: Use external Redis instead of managed Valkey
29+
external:
30+
url: redis://external-redis:6379 # Optional
31+
secretRef: external-redis-auth # Optional
32+
33+
status:
34+
phase: Ready
35+
connectionSecret: shared-storage-connection # Auto-generated
36+
message: "Storage is ready and secured"
37+
```
38+
39+
## Automatic Security Implementation
40+
41+
### How It Works
42+
43+
When you create a SessionStorage resource, the operator automatically:
44+
45+
1. **Generates strong authentication** - Creates random 32-character password
46+
2. **Configures network isolation** - Only ToolHive proxies can connect
47+
3. **Enables persistence** - Data survives pod restarts (except "small" size)
48+
4. **Sets up monitoring** - Exports metrics for observability
49+
5. **Creates connection secret** - MCPServers just reference this
50+
51+
### Size Presets
52+
53+
```go
54+
// Simple presets that configure everything automatically
55+
var sizePresets = map[string]Config {
56+
"small": { // Development
57+
Replicas: 1,
58+
Memory: "256Mi",
59+
CPU: "100m",
60+
Disk: "1Gi",
61+
Persistent: false, // Ephemeral for dev
62+
},
63+
"medium": { // Staging/Small Production
64+
Replicas: 1,
65+
Memory: "512Mi",
66+
CPU: "250m",
67+
Disk: "5Gi",
68+
Persistent: true,
69+
},
70+
"large": { // Production HA
71+
Replicas: 3, // Automatic HA cluster
72+
Memory: "1Gi",
73+
CPU: "500m",
74+
Disk: "10Gi",
75+
Persistent: true,
76+
},
77+
}
78+
```
79+
80+
### Automatic Security Features
81+
82+
The operator automatically configures:
83+
84+
#### 1. Authentication
85+
```go
86+
func (r *SessionStorageReconciler) ensureSecurity(ctx context.Context,
87+
storage *mcpv1alpha1.SessionStorage) error {
88+
89+
// Generate auth secret if it doesn't exist
90+
authSecret := r.generateAuthSecret(storage)
91+
92+
// Configure Valkey with auth
93+
valkeyConfig := fmt.Sprintf(`
94+
requirepass %s
95+
maxmemory-policy allkeys-lru
96+
save "" # Disable RDB snapshots, use AOF only
97+
appendonly yes
98+
appendfsync everysec
99+
`, authSecret.Password)
100+
101+
// Create ConfigMap with Valkey config
102+
return r.createValkeyConfig(ctx, storage, valkeyConfig)
103+
}
104+
```
105+
106+
#### 2. Network Isolation
107+
```yaml
108+
# Automatically created NetworkPolicy
109+
apiVersion: networking.k8s.io/v1
110+
kind: NetworkPolicy
111+
metadata:
112+
name: {storage-name}-valkey
113+
spec:
114+
podSelector:
115+
matchLabels:
116+
toolhive.stacklok.dev/storage: {storage-name}
117+
ingress:
118+
- from:
119+
- podSelector:
120+
matchLabels:
121+
toolhive.stacklok.dev/component: proxy
122+
ports:
123+
- port: 6379
124+
```
125+
126+
#### 3. Connection Secret
127+
```yaml
128+
# Automatically created for MCPServers to use
129+
apiVersion: v1
130+
kind: Secret
131+
metadata:
132+
name: {storage-name}-connection
133+
data:
134+
REDIS_URL: base64(redis://:password@service:6379)
135+
REDIS_PASSWORD: base64(generated-password)
136+
SESSION_STORAGE_TYPE: base64(redis)
137+
```
138+
139+
## Usage Examples
140+
141+
### 1. Simple Development Setup
142+
143+
```yaml
144+
# Just this - operator handles everything else
145+
apiVersion: toolhive.stacklok.dev/v1alpha1
146+
kind: SessionStorage
147+
metadata:
148+
name: dev-storage
149+
spec:
150+
size: small
151+
---
152+
apiVersion: toolhive.stacklok.dev/v1alpha1
153+
kind: MCPServer
154+
metadata:
155+
name: my-server
156+
spec:
157+
image: my-mcp:latest
158+
sessionStorageRef: dev-storage # That's it!
159+
```
160+
161+
### 2. Production Setup
162+
163+
```yaml
164+
# Still simple - just pick "large" for HA
165+
apiVersion: toolhive.stacklok.dev/v1alpha1
166+
kind: SessionStorage
167+
metadata:
168+
name: prod-storage
169+
namespace: toolhive-system
170+
spec:
171+
size: large # Automatic 3-node HA cluster
172+
---
173+
apiVersion: toolhive.stacklok.dev/v1alpha1
174+
kind: MCPServer
175+
metadata:
176+
name: production-server
177+
spec:
178+
image: my-mcp:latest
179+
sessionStorageRef:
180+
name: prod-storage
181+
namespace: toolhive-system
182+
```
183+
184+
### 3. Using External Redis
185+
186+
```yaml
187+
# For when you have existing Redis infrastructure
188+
apiVersion: v1
189+
kind: Secret
190+
metadata:
191+
name: external-redis-auth
192+
data:
193+
password: base64(your-redis-password)
194+
---
195+
apiVersion: toolhive.stacklok.dev/v1alpha1
196+
kind: SessionStorage
197+
metadata:
198+
name: external-storage
199+
spec:
200+
external:
201+
url: redis://my-redis.example.com:6379
202+
secretRef: external-redis-auth
203+
---
204+
apiVersion: toolhive.stacklok.dev/v1alpha1
205+
kind: MCPServer
206+
metadata:
207+
name: my-server
208+
spec:
209+
image: my-mcp:latest
210+
sessionStorageRef: external-storage
211+
```
212+
213+
## Implementation Details
214+
215+
### SessionStorage Controller
216+
217+
```go
218+
func (r *SessionStorageReconciler) Reconcile(ctx context.Context,
219+
req ctrl.Request) (ctrl.Result, error) {
220+
221+
storage := &mcpv1alpha1.SessionStorage{}
222+
if err := r.Get(ctx, req.NamespacedName, storage); err != nil {
223+
return ctrl.Result{}, client.IgnoreNotFound(err)
224+
}
225+
226+
// Handle external storage differently
227+
if storage.Spec.External != nil {
228+
return r.reconcileExternal(ctx, storage)
229+
}
230+
231+
// For managed Valkey, do everything automatically
232+
steps := []func(context.Context, *mcpv1alpha1.SessionStorage) error{
233+
r.ensureAuthSecret, // Generate password
234+
r.ensureNetworkPolicy, // Create network isolation
235+
r.ensureValkeyConfig, // Create config with auth
236+
r.ensureValkeyDeployment,// Deploy Valkey
237+
r.ensureValkeyService, // Create service
238+
r.ensureConnectionSecret,// Create connection info
239+
r.updateStatus, // Update CRD status
240+
}
241+
242+
for _, step := range steps {
243+
if err := step(ctx, storage); err != nil {
244+
return ctrl.Result{}, err
245+
}
246+
}
247+
248+
return ctrl.Result{}, nil
249+
}
250+
```
251+
252+
### MCPServer Integration
253+
254+
MCPServers automatically get configured with the connection:
255+
256+
```go
257+
func (r *MCPServerReconciler) injectSessionStorage(ctx context.Context,
258+
mcpServer *mcpv1alpha1.MCPServer, deployment *appsv1.Deployment) error {
259+
260+
if mcpServer.Spec.SessionStorageRef == nil {
261+
return nil // No session storage configured
262+
}
263+
264+
// Get the connection secret created by SessionStorage controller
265+
secretName := fmt.Sprintf("%s-connection", mcpServer.Spec.SessionStorageRef)
266+
267+
// Add environment variables from secret
268+
container := &deployment.Spec.Template.Spec.Containers[0]
269+
container.EnvFrom = append(container.EnvFrom, corev1.EnvFromSource{
270+
SecretRef: &corev1.SecretEnvSource{
271+
LocalObjectReference: corev1.LocalObjectReference{
272+
Name: secretName,
273+
},
274+
},
275+
})
276+
277+
return nil
278+
}
279+
```
280+
281+
## Benefits
282+
283+
1. **Zero Learning Curve** - Just set size: small/medium/large
284+
2. **Production Ready** - Secure by default, no manual configuration
285+
3. **Automatic Updates** - Operator handles version upgrades
286+
4. **Cost Efficient** - Share storage across multiple MCPServers
287+
5. **Flexible** - Support both managed and external storage
288+
289+
## Resilience and Scaling Benefits
290+
291+
By externalizing session storage from the proxy pods, we enable:
292+
293+
### Proxy Resilience
294+
- **Stateless Proxies**: Proxy pods can be terminated, restarted, or rescheduled without losing sessions
295+
- **Rolling Updates**: Deploy new proxy versions with zero downtime - sessions persist in Valkey
296+
- **Crash Recovery**: If a proxy crashes, users reconnect to any other proxy and continue their session
297+
- **Horizontal Pod Autoscaling**: Scale proxy replicas up/down based on load without session disruption
298+
299+
### Example Scaling Scenario
300+
```yaml
301+
# HPA for proxy deployment - sessions remain intact during scaling
302+
apiVersion: autoscaling/v2
303+
kind: HorizontalPodAutoscaler
304+
metadata:
305+
name: mcp-proxy-hpa
306+
spec:
307+
scaleTargetRef:
308+
apiVersion: apps/v1
309+
kind: Deployment
310+
name: github-server # MCPServer deployment
311+
minReplicas: 2
312+
maxReplicas: 10
313+
metrics:
314+
- type: Resource
315+
resource:
316+
name: cpu
317+
target:
318+
type: Utilization
319+
averageUtilization: 70
320+
```
321+
322+
When load increases:
323+
1. HPA scales proxy pods from 2 → 10 replicas
324+
2. New pods connect to the same Valkey instance
325+
3. Sessions are immediately available to all pods
326+
4. Load balancer distributes traffic across all proxies
327+
5. Users experience no interruption
328+
329+
When load decreases:
330+
1. HPA scales down from 10 → 2 replicas
331+
2. Pods are gracefully terminated
332+
3. Sessions remain in Valkey
333+
4. Remaining pods continue serving all sessions
334+
335+
## Security Features (All Automatic)
336+
337+
- ✅ Strong password authentication (32 characters, random)
338+
- ✅ Network isolation (NetworkPolicy)
339+
- ✅ Secure defaults (no default user, restricted commands)
340+
- ✅ Automatic secret rotation (operator can handle this)
341+
- ✅ Least privilege (Valkey only accessible by proxies)
342+
- ✅ Persistence encryption (when StorageClass supports it)
343+
344+
## Migration Path
345+
346+
### Phase 1: MVP
347+
- Basic SessionStorage CRD
348+
- Automatic auth and network policies
349+
- Support for small/medium/large sizes
350+
351+
### Phase 2: Production Features
352+
- Automatic backups
353+
- Metrics and monitoring
354+
- Secret rotation
355+
356+
### Phase 3: Advanced Features
357+
- Multi-region support
358+
- Automatic scaling based on load
359+
- Integration with cloud Redis services
360+
361+
## FAQ
362+
363+
**Q: What if I need custom Valkey configuration?**
364+
A: Use external mode with your own Redis/Valkey instance.
365+
366+
**Q: How secure is the automatic setup?**
367+
A: Very secure - uses strong passwords, network isolation, and follows Redis security best practices.
368+
369+
**Q: Can I see the generated password?**
370+
A: Yes, it's in the secret `{storage-name}-auth` but you don't need it - MCPServers use it automatically.
371+
372+
**Q: What happens if a Valkey pod crashes?**
373+
A: For medium/large sizes, data is persisted and will be restored. For small (dev), data is ephemeral.
374+
375+
**Q: Can multiple MCPServers share one SessionStorage?**
376+
A: Yes! That's the recommended pattern for production.
377+
378+
## Conclusion
379+
380+
This design makes distributed session storage as easy as setting `size: medium` while maintaining production-grade security. The operator handles all the complexity automatically, letting developers focus on their MCP servers instead of infrastructure configuration.
381+
382+
Most importantly, by decoupling session state from proxy pods, we transform the ToolHive proxy layer into a truly stateless, resilient system that can scale elastically in response to load. Proxies can crash, restart, or scale from 1 to 100 replicas without any impact on user sessions. This architecture enables cloud-native deployment patterns like rolling updates, auto-scaling, and multi-region deployments while maintaining session continuity.
383+
384+
By providing secure defaults and automatic management, we enable both development simplicity and production readiness without requiring deep Redis/Valkey expertise.

0 commit comments

Comments
 (0)