-
Notifications
You must be signed in to change notification settings - Fork 195
Description
Summary
When a registry server needs to discover MCP resources across multiple namespaces (multi-tenant deployments), it requires ClusterRole/ClusterRoleBinding rather than namespace-scoped Role/RoleBinding. The operator currently cannot dynamically create these cluster-scoped RBAC resources because its own ClusterRole only has permissions for roles and rolebindings, not clusterroles and clusterrolebindings.
This creates a gap between the two installation methods:
- Registry server Helm chart (from toolhive-registry-server): Installs its own ClusterRoles and ClusterRoleBindings directly, so cross-namespace discovery works out of the box.
- Operator Helm chart: The operator dynamically creates only namespace-scoped Roles via
ensureRBACResources()incmd/thv-operator/pkg/registryapi/rbac.go. The static ClusterRole template (registry-api-clusterrole.yaml) only grants read access tomcpservers— it doesn't covermcpremoteproxies,virtualmcpservers, services, or Gateway API resources needed for full cross-namespace registry functionality.
Related: stacklok/toolhive-registry-server#287 (multi-tenancy & multiple registry server installations)
Current State
What the operator creates dynamically (namespace-scoped)
ServiceAccount,Role,RoleBindingper MCPRegistry — limited to the MCPRegistry's own namespace- Created by
cmd/thv-operator/pkg/registryapi/rbac.gousing the RBAC utility library (cmd/thv-operator/pkg/kubernetes/rbac/rbac.go), which only supports namespace-scoped resources
What the Helm chart provides statically (cluster-scoped)
- A single
toolhive-registry-api-roleClusterRole with read access tomcpserversonly - A single
toolhive-registry-api-rolebindingClusterRoleBinding bound to thetoolhive-registry-apiServiceAccount in the release namespace - A
toolhive-registry-apiServiceAccount - These are not sufficient for multi-namespace scenarios and don't cover all required resource types
Static Helm templates are orphaned
The static templates (registry-api-clusterrole.yaml, registry-api-clusterrolebinding.yaml, registry-api-serviceaccount.yaml) are effectively unused:
- The static ClusterRoleBinding binds to SA name
toolhive-registry-api(from Helm values) - The operator dynamically creates a different SA named
{registry-name}-registry-api(viaGetServiceAccountName()incmd/thv-operator/pkg/registryapi/types.go) - The operator's deployment code assigns the dynamic SA to registry pods (
deployment.go:195) - No Go code in the operator references
toolhive-registry-api
Result: the static ClusterRole's permissions never reach any registry API pod because the ClusterRoleBinding targets a ServiceAccount that no pod uses.
What the operator's own ClusterRole permits
rolesandrolebindings: full CRUD (can create namespace-scoped RBAC)clusterrolesandclusterrolebindings: not listed (cannot create cluster-scoped RBAC)
Problem
For multi-namespace/tenant registry deployments via the operator, the registry API needs to watch MCP resources, services, and Gateway API resources across namespaces. This requires ClusterRoles, but:
- The RBAC utility library (
cmd/thv-operator/pkg/kubernetes/rbac/rbac.go) has no functions for creatingClusterRoleorClusterRoleBindingresources - The operator's own ClusterRole doesn't include permissions to manage
clusterroles/clusterrolebindings - Even if the code were added, the operator would need escalated permissions in its own ClusterRole — creating a privilege spreading risk (Kubernetes RBAC escalation prevention limits the operator to creating ClusterRoles with permissions it already holds, but it could bind those permissions to arbitrary identities across the cluster)
- The existing static Helm ClusterRole/ClusterRoleBinding are orphaned and don't reach the registry pods (see above)
Proposed Solutions
Option A: Expand the static Helm ClusterRole (recommended)
Instead of granting the operator runtime ClusterRole management capabilities, expand the existing static Helm ClusterRole template to include all permissions the registry API needs for cross-namespace discovery:
- Expand
registry-api-clusterrole.yamlto includemcpremoteproxies,virtualmcpservers, services, and Gateway API resources (matching the rules inregistryAPIRBACRules) - Fix the orphaned binding — either have the operator use the static SA name, or have the operator create a ClusterRoleBinding from the dynamic SA to the static ClusterRole
- Make this opt-in via a Helm value (e.g.,
registryAPI.clusterScoped: false) that controls whether the expanded ClusterRole/ClusterRoleBinding are rendered - When cluster-scoped, the operator can skip creating the namespace-scoped Role for permissions already covered by the ClusterRole
Tradeoffs: The operator pod itself never needs clusterroles/clusterrolebindings permissions — the ClusterRole is provisioned at Helm install time (which already requires elevated access). This is strictly more secure than granting runtime ClusterRole creation. The tradeoff is that the static ClusterRole is fixed at install time and can't adapt per-MCPRegistry.
Option B: Feature-toggled runtime ClusterRole creation
Add a Helm value (e.g., operator.rbac.enableClusterScopedWorkloadRBAC: false) that, when enabled:
- Adds
clusterrolesandclusterrolebindingsto the operator's own ClusterRole - Extends the RBAC utility library to support
ClusterRole/ClusterRoleBindingcreation - Allows the MCPRegistry controller to create cluster-scoped RBAC when the registry needs cross-namespace access (possibly driven by a field on the MCPRegistry spec)
Tradeoffs: More flexible than Option A (per-MCPRegistry ClusterRoles), but increases the operator's attack surface. An attacker who compromises the operator pod could create ClusterRoles (limited to the operator's own permission level by Kubernetes RBAC escalation prevention) and bind them to arbitrary identities across the cluster, enabling privilege spreading. Opt-in preserves least-privilege by default.
Option C: Grant ClusterRole permissions by default
Always include clusterroles/clusterrolebindings in the operator's ClusterRole and create ClusterRoles for registry API deployments.
Tradeoffs: Simpler implementation, but increases the operator's default attack surface. Conflicts with the least-privilege approach that motivated #4030.
Option D: Extend the disableWorkloadRBAC pattern from #4030
Lean into the external RBAC management model — document the required ClusterRole/ClusterRoleBinding for multi-namespace registries and let platform teams provision them via GitOps, separate Helm charts, or Kustomize overlays.
Tradeoffs: No operator code changes needed. Puts the burden on platform teams. Works well in GitOps-heavy environments but is a poor experience for simpler deployments.
Additional Considerations
- Multiple installation methods: Users may deploy registries via the operator OR via the standalone registry Helm chart. RBAC solutions must account for both paths and avoid conflicts when both are present.
- Edge cases with PR Add disableWorkloadRBAC flag to skip per-workload RBAC creation #4030: The
disableWorkloadRBACflag (Add disableWorkloadRBAC flag to skip per-workload RBAC creation #4030, still open) conditionally removes the static ClusterRole/ClusterRoleBinding templates and the operator'sroles/rolebindingspermissions. Any solution here needs to compose correctly with that flag — e.g., if workload RBAC is disabled, cluster-scoped RBAC should also be disabled. - Kubernetes RBAC escalation prevention: Kubernetes prevents a controller from creating or modifying Role/ClusterRole resources containing permissions the controller doesn't itself hold (unless the
escalateverb is granted). For ClusterRoleBindings, a separatebindverb restriction applies (enforced in Kubernetes 1.25+). The operator's ClusterRole must be a superset of any ClusterRole it creates for the registry API. - Testing complexity: This is finicky to test because the behavior differs across installation methods, RBAC scopes, and feature flag combinations. E2E tests should cover: single-namespace registry via operator, multi-namespace registry via operator, registry with
disableWorkloadRBAC, and coexistence with standalone registry chart. - Orphaned static templates: The existing static SA/ClusterRole/ClusterRoleBinding should be fixed regardless of which solution is chosen — either by removing them, connecting them to the dynamic SA, or making the operator use the static SA.
References
- Support for multi-tenancy & multiple registry server installations toolhive-registry-server#287 — Multi-tenancy & multiple registry server installations
- Add disableWorkloadRBAC flag to skip per-workload RBAC creation #4030 — Add
disableWorkloadRBACflag to skip per-workload RBAC creation cmd/thv-operator/pkg/registryapi/rbac.go— Current namespace-scoped RBAC creationcmd/thv-operator/pkg/kubernetes/rbac/rbac.go— RBAC utility library (namespace-scoped only)cmd/thv-operator/pkg/registryapi/types.go:109—GetServiceAccountName()generates{name}-registry-apideploy/charts/operator/templates/clusterrole/role.yaml— Operator's own ClusterRoledeploy/charts/operator/templates/registry-api-clusterrole.yaml— Static registry API ClusterRole (orphaned)deploy/charts/operator/templates/registry-api-serviceaccount.yaml— Static registry API ServiceAccount (orphaned)