Skip to content

Conversation

@JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Sep 8, 2025

This PR adds a design proposal for integrating Valkey (Redis-compatible) distributed session storage into the ToolHive operator.

The proposal introduces a new SessionStorage CRD that enables automatic deployment and management of Valkey instances for distributed session storage. This allows ToolHive proxy pods to scale horizontally while maintaining session state.

Key features:

  • Separate SessionStorage CRD to keep concerns separated
  • Simple configuration with size presets (small/medium/large)
  • Automatic security configuration
  • Support for both operator-managed and external Valkey instances

The design enables proxy pods to become truly stateless, supporting elastic scaling, rolling updates, and improved resilience.

This proposal introduces a design for integrating Valkey (Redis-compatible)
distributed session storage into the ToolHive operator. The design focuses
on simplicity and security by providing automatic, secure-by-default
configuration.

Key features:
- Separate SessionStorage CRD to keep MCPServer CRD focused
- Zero-configuration deployment with simple size presets (small/medium/large)
- Automatic security configuration (auth, network policies, TLS)
- Seamless integration with existing MCPServer resources
- Support for both operator-managed and external Valkey instances

The design enables horizontal scaling and resilience by externalizing
session state from proxy pods, allowing them to scale elastically and
restart without losing user sessions. This transforms the ToolHive proxy
layer into a truly stateless, cloud-native system.

Implementation follows a phased approach starting with core CRD and
controller, then adding MCPServer integration, and finally production
features like monitoring and backups.
@codecov
Copy link

codecov bot commented Sep 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.60%. Comparing base (02c003c) to head (8d7c97e).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1770      +/-   ##
==========================================
+ Coverage   40.56%   40.60%   +0.04%     
==========================================
  Files         184      184              
  Lines       21380    21380              
==========================================
+ Hits         8672     8682      +10     
+ Misses      12056    12040      -16     
- Partials      652      658       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 38.377% (+0.06%) from 38.322%
when pulling 8d7c97e on proposal/operator-valkey-integration
into 02c003c on main.

A: Yes, it's in the secret `{storage-name}-auth` but you don't need it - MCPServers use it automatically.

**Q: What happens if a Valkey pod crashes?**
A: For medium/large sizes, data is persisted and will be restored. For small (dev), data is ephemeral.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changes are needed in the application to handle the possibility that state can potentially be destroyed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in today's state of things, we always have this possibility. If something happens to the proxy runner pod, then the session is lost and it needs to recreate it. This gives us the possibility of surviving such scenarios.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, these are the changes that build to that #1771 . They're in several commits. So we don't need to merge that big chunk and it can be split.

Copy link
Contributor

@yrobla yrobla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have concerns about adding an external dependency there. Mostly related to maintenance, size of deployment, what happens with data if there is some connection problem, etc... is it really needed to add valkey there, or can we have in simpler approaches?

@JAORMX
Copy link
Collaborator Author

JAORMX commented Sep 8, 2025

@yrobla this is the lightest I could think of to have session persistence across scale-ups and restarts. The alternative is to build it in-process but that would be more complicated than this tbh.

@JAORMX JAORMX closed this Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants