Skip to content

Kubernetes: Permissions for Multi-Namespace MCPRegistry Deployments #4245

@ChrisJBurns

Description

@ChrisJBurns

Summary

When a registry server needs to discover MCP resources across multiple namespaces (multi-tenant deployments), it requires ClusterRole/ClusterRoleBinding rather than namespace-scoped Role/RoleBinding. The operator currently cannot dynamically create these cluster-scoped RBAC resources because its own ClusterRole only has permissions for roles and rolebindings, not clusterroles and clusterrolebindings.

This creates a gap between the two installation methods:

  • Registry server Helm chart (from toolhive-registry-server): Installs its own ClusterRoles and ClusterRoleBindings directly, so cross-namespace discovery works out of the box.
  • Operator Helm chart: The operator dynamically creates only namespace-scoped Roles via ensureRBACResources() in cmd/thv-operator/pkg/registryapi/rbac.go. The static ClusterRole template (registry-api-clusterrole.yaml) only grants read access to mcpservers — it doesn't cover mcpremoteproxies, virtualmcpservers, services, or Gateway API resources needed for full cross-namespace registry functionality.

Related: stacklok/toolhive-registry-server#287 (multi-tenancy & multiple registry server installations)

Current State

What the operator creates dynamically (namespace-scoped)

  • ServiceAccount, Role, RoleBinding per MCPRegistry — limited to the MCPRegistry's own namespace
  • Created by cmd/thv-operator/pkg/registryapi/rbac.go using the RBAC utility library (cmd/thv-operator/pkg/kubernetes/rbac/rbac.go), which only supports namespace-scoped resources

What the Helm chart provides statically (cluster-scoped)

  • A single toolhive-registry-api-role ClusterRole with read access to mcpservers only
  • A single toolhive-registry-api-rolebinding ClusterRoleBinding bound to the toolhive-registry-api ServiceAccount in the release namespace
  • A toolhive-registry-api ServiceAccount
  • These are not sufficient for multi-namespace scenarios and don't cover all required resource types

Static Helm templates are orphaned

The static templates (registry-api-clusterrole.yaml, registry-api-clusterrolebinding.yaml, registry-api-serviceaccount.yaml) are effectively unused:

  1. The static ClusterRoleBinding binds to SA name toolhive-registry-api (from Helm values)
  2. The operator dynamically creates a different SA named {registry-name}-registry-api (via GetServiceAccountName() in cmd/thv-operator/pkg/registryapi/types.go)
  3. The operator's deployment code assigns the dynamic SA to registry pods (deployment.go:195)
  4. No Go code in the operator references toolhive-registry-api

Result: the static ClusterRole's permissions never reach any registry API pod because the ClusterRoleBinding targets a ServiceAccount that no pod uses.

What the operator's own ClusterRole permits

  • roles and rolebindings: full CRUD (can create namespace-scoped RBAC)
  • clusterroles and clusterrolebindings: not listed (cannot create cluster-scoped RBAC)

Problem

For multi-namespace/tenant registry deployments via the operator, the registry API needs to watch MCP resources, services, and Gateway API resources across namespaces. This requires ClusterRoles, but:

  1. The RBAC utility library (cmd/thv-operator/pkg/kubernetes/rbac/rbac.go) has no functions for creating ClusterRole or ClusterRoleBinding resources
  2. The operator's own ClusterRole doesn't include permissions to manage clusterroles/clusterrolebindings
  3. Even if the code were added, the operator would need escalated permissions in its own ClusterRole — creating a privilege spreading risk (Kubernetes RBAC escalation prevention limits the operator to creating ClusterRoles with permissions it already holds, but it could bind those permissions to arbitrary identities across the cluster)
  4. The existing static Helm ClusterRole/ClusterRoleBinding are orphaned and don't reach the registry pods (see above)

Proposed Solutions

Option A: Expand the static Helm ClusterRole (recommended)

Instead of granting the operator runtime ClusterRole management capabilities, expand the existing static Helm ClusterRole template to include all permissions the registry API needs for cross-namespace discovery:

  1. Expand registry-api-clusterrole.yaml to include mcpremoteproxies, virtualmcpservers, services, and Gateway API resources (matching the rules in registryAPIRBACRules)
  2. Fix the orphaned binding — either have the operator use the static SA name, or have the operator create a ClusterRoleBinding from the dynamic SA to the static ClusterRole
  3. Make this opt-in via a Helm value (e.g., registryAPI.clusterScoped: false) that controls whether the expanded ClusterRole/ClusterRoleBinding are rendered
  4. When cluster-scoped, the operator can skip creating the namespace-scoped Role for permissions already covered by the ClusterRole

Tradeoffs: The operator pod itself never needs clusterroles/clusterrolebindings permissions — the ClusterRole is provisioned at Helm install time (which already requires elevated access). This is strictly more secure than granting runtime ClusterRole creation. The tradeoff is that the static ClusterRole is fixed at install time and can't adapt per-MCPRegistry.

Option B: Feature-toggled runtime ClusterRole creation

Add a Helm value (e.g., operator.rbac.enableClusterScopedWorkloadRBAC: false) that, when enabled:

  1. Adds clusterroles and clusterrolebindings to the operator's own ClusterRole
  2. Extends the RBAC utility library to support ClusterRole/ClusterRoleBinding creation
  3. Allows the MCPRegistry controller to create cluster-scoped RBAC when the registry needs cross-namespace access (possibly driven by a field on the MCPRegistry spec)

Tradeoffs: More flexible than Option A (per-MCPRegistry ClusterRoles), but increases the operator's attack surface. An attacker who compromises the operator pod could create ClusterRoles (limited to the operator's own permission level by Kubernetes RBAC escalation prevention) and bind them to arbitrary identities across the cluster, enabling privilege spreading. Opt-in preserves least-privilege by default.

Option C: Grant ClusterRole permissions by default

Always include clusterroles/clusterrolebindings in the operator's ClusterRole and create ClusterRoles for registry API deployments.

Tradeoffs: Simpler implementation, but increases the operator's default attack surface. Conflicts with the least-privilege approach that motivated #4030.

Option D: Extend the disableWorkloadRBAC pattern from #4030

Lean into the external RBAC management model — document the required ClusterRole/ClusterRoleBinding for multi-namespace registries and let platform teams provision them via GitOps, separate Helm charts, or Kustomize overlays.

Tradeoffs: No operator code changes needed. Puts the burden on platform teams. Works well in GitOps-heavy environments but is a poor experience for simpler deployments.

Additional Considerations

  • Multiple installation methods: Users may deploy registries via the operator OR via the standalone registry Helm chart. RBAC solutions must account for both paths and avoid conflicts when both are present.
  • Edge cases with PR Add disableWorkloadRBAC flag to skip per-workload RBAC creation #4030: The disableWorkloadRBAC flag (Add disableWorkloadRBAC flag to skip per-workload RBAC creation #4030, still open) conditionally removes the static ClusterRole/ClusterRoleBinding templates and the operator's roles/rolebindings permissions. Any solution here needs to compose correctly with that flag — e.g., if workload RBAC is disabled, cluster-scoped RBAC should also be disabled.
  • Kubernetes RBAC escalation prevention: Kubernetes prevents a controller from creating or modifying Role/ClusterRole resources containing permissions the controller doesn't itself hold (unless the escalate verb is granted). For ClusterRoleBindings, a separate bind verb restriction applies (enforced in Kubernetes 1.25+). The operator's ClusterRole must be a superset of any ClusterRole it creates for the registry API.
  • Testing complexity: This is finicky to test because the behavior differs across installation methods, RBAC scopes, and feature flag combinations. E2E tests should cover: single-namespace registry via operator, multi-namespace registry via operator, registry with disableWorkloadRBAC, and coexistence with standalone registry chart.
  • Orphaned static templates: The existing static SA/ClusterRole/ClusterRoleBinding should be fixed regardless of which solution is chosen — either by removing them, connecting them to the dynamic SA, or making the operator use the static SA.

References

  • Support for multi-tenancy & multiple registry server installations toolhive-registry-server#287 — Multi-tenancy & multiple registry server installations
  • Add disableWorkloadRBAC flag to skip per-workload RBAC creation #4030 — Add disableWorkloadRBAC flag to skip per-workload RBAC creation
  • cmd/thv-operator/pkg/registryapi/rbac.go — Current namespace-scoped RBAC creation
  • cmd/thv-operator/pkg/kubernetes/rbac/rbac.go — RBAC utility library (namespace-scoped only)
  • cmd/thv-operator/pkg/registryapi/types.go:109GetServiceAccountName() generates {name}-registry-api
  • deploy/charts/operator/templates/clusterrole/role.yaml — Operator's own ClusterRole
  • deploy/charts/operator/templates/registry-api-clusterrole.yaml — Static registry API ClusterRole (orphaned)
  • deploy/charts/operator/templates/registry-api-serviceaccount.yaml — Static registry API ServiceAccount (orphaned)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions