[BUG] Unexpected rabbitmqCluster replica number when operator misses particular scale events #758

srteam2020 · 2021-07-03T22:08:16Z

Describe the bug

The rabbitmq operator have the following checking to prevent users to scale down statefulset (and pvc) of the cluster,

if r.scaleDown(ctx, rabbitmqCluster, current, sts) {
	// return when cluster scale down detected; unsupported operation
	return ctrl.Result{}, nil
}

If the user tries to change the replica from 3 to 2, the scaling down should not happen and the replicas should still remain 3.

We ran a workload that changes replica from 1 -> 3 -> 2. Ideally, the replicas should remain 3 since scaling down is not allowed. However, we observed there are 2 replicas at the end in some cases.

The reason is that the goroutine of performing reconciliation (g1) and the goroutine of updating the locally cached rabbitmqCluster object (g2) are concurrent, and certain interleaving of the two goroutines can lead to the unexpected scaling behavior.
Ideally, this interleaving will lead to the correct result:

g2: set replica number in the rabbitmq (from 1) to 3
g1: read the rabbitmq object and scales sts from 1 to 3
g2: set replica number in the rabbitmq (from 3) to 2
g1: read the rabbitmq object and refuse to scale from 3 to 2

And we find that the following interleaving (when the operator's reconcile goroutine runs relatively slowly) can lead to the unexpected scaling behavior mentioned above:

g2: set replica number in the rabbitmq (from 1) to 3
g2: set replica number in the rabbitmq (from 3) to 2
g1: read the rabbitmq object and scales sts from 1 to 2

To Reproduce

As mentioned above, we change the replica number as 1 -> 3 -> 2. When the operator misses the view of "replica = 3" and directly sees "replica = 2", the rabbitmq statefulset will end up with 2 replicas, not 3.

Expected behavior
Since scale down is not allowed, we should have 3 replicas eventually.

Version and environment information

RabbitMQ: 3.8.12
RabbitMQ Cluster Operator: 4f13b9a (main branch)
Kubernetes: 1.18.9

Additional context

This issue is probably hard to fix from the operator side, as how the operator reads events from users (e.g., scaling up/down) and how the different goroutines interleave with each other are decided by the controller-runtime package, not the operator code. We didn't find a good way to solve this issue with the existing controller-runtime package.
Instead, we can at least make the potential debugging process easier when other users encounter similar problems later again by adding more logs. For example, we can log the currentReplicas and desiredReplicas in scaleDown so users can know the exact sequence of replica number seen by the operator.
We are willing to send a PR to add one more log here.

The text was updated successfully, but these errors were encountered:

srteam2020 · 2021-07-03T22:44:15Z

This issue might be hard to reproduce as it will only manifest when the particular interleaving happens (e.g., when the reconcile goroutine runs slow). We are recently building a tool https://github.com/sieve-project/sieve to reliably detect and reproduce bugs like the above one in various k8s controllers. This bug can be automatically reproduced by just a few commands using our tool. Please take a try if you also want to reproduce the bug.

To use the tool, please first run the environment check:

python3 check-env.py

If the checker passes, please build the Kubernetes and controller images using the commands below as our tool needs to set up a kind cluster to reproduce the bug:

python3 build.py -p kubernetes -m obs-gap -d DOCKER_REPO_NAME
python3 build.py -p rabbitmq-operator -m obs-gap -d DOCKER_REPO_NAME -s 4f13b9a942ad34fece0171d2174aa0264b10e947

-d DOCKER_REPO_NAME should be the docker repo that you have write access to, -s 4f13b9a942ad34fece0171d2174aa0264b10e947 is the commit where the bug is found.

Finally, run

python3 sieve.py -p rabbitmq-operator -t scaleup-scaledown -d DOCKER_REPO_NAME

It will try to scale the rabbitmq cluster from 1 -> 3 -> 2 and also manipulate the goroutine interleaving of the operator to trigger the previously mentioned corner case interleaving.
The bug is successfully reproduced if you can see the message below

[ERROR] persistentvolumeclaim SIZE inconsistency: 3 seen after learning run, but 2 seen after testing run
[ERROR] pod SIZE inconsistency: 4 seen after learning run, but 3 seen after testing run

Ideally there should be 4 pods eventually (1 for the operator pod and 3 replicas), but after the test run we observe 3 pods (1 for the operator pod and 2 replicas).

For more information please refer to https://github.com/sieve-project/sieve#sieve-testing-datacenter-infrastructures-using-partial-histories and https://github.com/sieve-project/sieve/blob/main/docs/reprod.md#rabbitmq-cluster-operator-758

Please let me know if you encounter any problems when reproducing the bug with the tool.

Zerpet · 2021-07-28T10:21:29Z

Hello @srteam2020, thank you for reporting this issue. We talked about this issue in our internal sync up, and we acknowledge this is a flaw in the Operator. However, we think this situation is an edge case, and given that we haven't received further reports of this bheaviour, we won't be able to get to this any time soon.

If you would like to contribute a fix for this issue, we will very gladly work with you to validate the changes and merge a potential fix.

Once again, thank you for reporting this issue, and thank you for testing this Operator.

srteam2020 added the bug Something isn't working label Jul 3, 2021

ChunyiLyu added the sync-up Issue to discuss during sync-up label Jul 5, 2021

Zerpet removed the sync-up Issue to discuss during sync-up label Jul 28, 2021

coro mentioned this issue Sep 2, 2021

Add extra logging and event detail when attempting to scale down cluster #828

Merged

coro closed this as completed in #828 Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unexpected rabbitmqCluster replica number when operator misses particular scale events #758

[BUG] Unexpected rabbitmqCluster replica number when operator misses particular scale events #758

srteam2020 commented Jul 3, 2021

srteam2020 commented Jul 3, 2021

Zerpet commented Jul 28, 2021

[BUG] Unexpected rabbitmqCluster replica number when operator misses particular scale events #758

[BUG] Unexpected rabbitmqCluster replica number when operator misses particular scale events #758

Comments

srteam2020 commented Jul 3, 2021

Describe the bug

To Reproduce

Version and environment information

Additional context

srteam2020 commented Jul 3, 2021

Zerpet commented Jul 28, 2021