Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster doesn't recover if all rabbitmq-server pods deleted from cluster #609

Closed
sheiks opened this issue Feb 17, 2021 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@sheiks
Copy link

sheiks commented Feb 17, 2021

Describe the bug

RabbitMQ cluster cannot recover if someone deletes all pods in the rabbitmq cluster using kubectl cli.

To Reproduce

Steps to reproduce the behavior:

  1. Create RabbitmqCluster with 3 replicas.
  2. Once cluster is healthy, then delete all 3 pods with kubectl cli. kubectl delete pods rabbitmq-server-0 rabbitmq-server-1 rabbitmq-server-2
  3. Verify rabbitmq pods status. kubectl get pods
  4. NAME READY STATUS RESTARTS AGE
    rabbitmq-server-0 0/1 Running 5 69m

kubectl logs -f rabbitmq-server-0

2021-02-17 14:29:41.001 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2021-02-17 14:30:11.002 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:30:11.002 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2021-02-17 14:30:41.002 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:30:41.003 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2021-02-17 14:31:11.004 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:31:11.004 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2021-02-17 14:31:41.005 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:31:41.005 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2021-02-17 14:32:11.006 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:32:11.006 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2021-02-17 14:32:41.007 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:32:41.007 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2021-02-17 14:33:11.008 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:33:11.008 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2021-02-17 14:33:41.009 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:33:41.009 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 1 retries left

Below values.yaml file used with https://github.com/rabbitmq/cluster-operator/tree/main/charts/rabbitmq

labels:
label1: foo
label2: bar

annotations:
annotation1: foo
annotation2: bar

replicas: 3

imagePullSecrets:

  • name: foo

service:
type: LoadBalancer

resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 100m
memory: 1Gi

tolerations:

  • key: "dedicated"
    operator: "Equal"
    value: "rabbitmq"
    effect: "NoSchedule"

rabbitmq:
additionalPlugins:
- rabbitmq_shovel
- rabbitmq_shovel_management
additionalConfig: |
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
envConfig: |
PLUGINS_DIR=/opt/rabbitmq/plugins:/opt/rabbitmq/community-plugins
advancedConfig: |
[
{ra, [
{wal_data_dir, '/var/lib/rabbitmq/quorum-wal'}
]}
].

terminationGracePeriodSeconds: 42

skipPostDeploySteps: true

override:
statefulSet:
spec:
template:
spec:
containers:
- name: rabbitmq
ports:
- containerPort: 12345 # opens an additional port on the rabbitmq server container
name: additional-port
protocol: TCP

Expected behavior
We had seen this problem when were using bitnami images, and solution for this problem documented here https://github.com/bitnami/charts/tree/master/bitnami/rabbitmq#recovering-the-cluster-from-complete-shutdown

May be it's good to document same for cluster-operator as well

Screenshots

If applicable, add screenshots to help explain your problem.

Version and environment information

  • RabbitMQ: 3.8.11
  • RabbitMQ Cluster Operator: 1.1.0
  • Kubernetes: v1.17.8
  • vmware(PKS)

Additional context

Add any other context about the problem here.
https://github.com/bitnami/charts/tree/master/bitnami/rabbitmq#recovering-the-cluster-from-complete-shutdown

@sheiks sheiks added the bug Something isn't working label Feb 17, 2021
@mkuratczyk
Copy link
Collaborator

mkuratczyk commented Feb 17, 2021

Hi,

Thanks for this report. This is a known issue that we are planning on addressing soon (check #578).

If you are deploying a new cluster, you can set PodManagementPolicy to Parallel to avoid this problem.

If you want to recover an existing cluster, you need to perform the following steps (PodManagementPolicy cannot be changed on an existing Statefulset, so the process involves deleting the old Statefulset):

  1. Set the PodManagementPolicy: Parallel on RabbitmqCluster resource (you will see errors in operator's logs because StatefulSet cannot have this policy updated - just ignore that)
  2. kubectl delete statefulsets.apps rabbitmq-server --cascade=orphan​ to delete the existing statefulset without deleting the pods and other child resources
  3. Operator will now recreate the statefulset with the new policy. If needed, delete pod 0 to force a new start attempt

Now, if you delete all your pods, they will all get started in parallel which solves the problem.

The main disadvantage of Parallel is that data can easily be lost if you scale down your cluster (eg. change replicas from 3 to 1). That's what we want to address before merging the PR and changing the policy to Parallel by default. If you don't plan to scale down your clusters - you can just leave Parallel. Alternatively, follow the same steps to change it back to OrderedReady.

@sheiks
Copy link
Author

sheiks commented Feb 18, 2021

Thank you Michal.

@ChunyiLyu
Copy link
Contributor

This is fixed from PR: #621

The PR is on main branch, not in release version just yet. Will close after it's release

@ChunyiLyu
Copy link
Contributor

@sheiks The fix is release in v.1.16: https://github.com/rabbitmq/cluster-operator/releases/tag/v1.6.0 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants