Skip to content

UX: DX: VolumeClaimTemplate overrides without a spec cause permanent reconciliation failures #2023

@jmealo

Description

@jmealo

Describe the bug

  • If the user mistakes the override.statefulSet.spec.volumeClaimTemplates for a merge operation rather than a replace you will have a cluster that cannot reconcile (permanently).
  • If the user omits spec.resources.requests.storage it is interpreted as 0 by the operator
  • The error logged by the operator is: shrinking persistent volumes is not supported
  • The error doesn't aid in debugging this configuration error; and troubleshooting isn't straight forward if you only inspect the StatefulSet and PVC -- you'd need to check the helm output and/or the Cluster CR.

Symptoms:

  • The operator reconciliation loop is continuously failing (every ~15 minutes based on those logs)
  • Any changes to the RabbitMQCluster CR won't be applied (operator can't reconcile)
  • Scaling (adding/removing nodes) would likely fail or behave unexpectedly
  • Helm upgrades might appear successful but some changes won't take effect

Fixes suggested:

  • Implement validation at the CRD level to prevent incomplete VolumeClaimTemplate overrides
  • Make the documentation explicit that override is a replace rather than a merge (yes, this is implied by the name, but, LLMs are gonna LLM, and devs are going to use them 🙃 )
  • Added helpful error messages in the operator logs to aid in troubleshooting configuration errors.

Fixes applied:

Logs

{
    "container": "operator",
    "controller": "rabbitmqcluster",
    "controllerGroup": "rabbitmq.com",
    "controllerKind": "RabbitmqCluster",
    "error": "shrinking persistent volumes is not supported",
    "level": "error",
    "msg": "Reconciler error",
    "name": "rabbitmq",
    "namespace": "rabbitmq-system",
    "pod": "rabbitmq-cluster-operator-5f8dc96c76-855k6",
    "reconcileID": "aaa60dae-fb09-4ea9-a10a-9924c4e7da15",
    "service_name": "rabbitmq-cluster-operator",
    "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:353\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202",
    "stream": "stderr",
    "ts": "2025-12-09T21:08:26Z"
}
{
    "container": "operator",
    "controller": "rabbitmqcluster",
    "controllerGroup": "rabbitmq.com",
    "controllerKind": "RabbitmqCluster",
    "error": "hit an error while scaling PVC capacity: shrinking persistent volumes is not supported",
    "level": "error",
    "msg": "Failed to scale PVCs: shrinking persistent volumes is not supported",
    "name": "rabbitmq",
    "namespace": "rabbitmq-system",
    "pod": "rabbitmq-cluster-operator-5f8dc96c76-855k6",
    "reconcileID": "aaa60dae-fb09-4ea9-a10a-9924c4e7da15",
    "service_name": "rabbitmq-cluster-operator",
    "stacktrace": "github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).reconcilePVC\n\t/workspace/controllers/reconcile_persistence.go:21\ngithub.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).Reconcile\n\t/workspace/controllers/rabbitmqcluster_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:340\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202",
    "stream": "stderr",
    "ts": "2025-12-09T21:08:26Z"
}
{
    "container": "operator",
    "controller": "rabbitmqcluster",
    "controllerGroup": "rabbitmq.com",
    "controllerKind": "RabbitmqCluster",
    "error": "unsupported operation",
    "level": "error",
    "msg": "shrinking persistent volumes is not supported",
    "name": "rabbitmq",
    "namespace": "rabbitmq-system",
    "pod": "rabbitmq-cluster-operator-5f8dc96c76-855k6",
    "reconcileID": "aaa60dae-fb09-4ea9-a10a-9924c4e7da15",
    "service_name": "rabbitmq-cluster-operator",
    "stacktrace": "github.com/rabbitmq/cluster-operator/v2/internal/scaling.PersistenceScaler.Scale\n\t/workspace/internal/scaling/scaling.go:52\ngithub.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).reconcilePVC\n\t/workspace/controllers/reconcile_persistence.go:18\ngithub.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).Reconcile\n\t/workspace/controllers/rabbitmqcluster_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:340\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202",
    "stream": "stderr",
    "ts": "2025-12-09T21:08:26Z"
}

Expected behavior

  • Refuse invalid cluster specs at deploy time, rather than logging errors during reconciliation.
  • Helpful error messages in the case of misconfiguration not caught by CRDs.

Version and environment information

  • RabbitMQ: 4.1.3
  • RabbitMQ Cluster Operator: 2.16.1
  • Kubernetes: 1.33.5
  • Cloud provider or hardware configuration: Azure AKS

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions