Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unexpected rabbitmqCluster replica number when operator misses particular scale events #758

Closed
srteam2020 opened this issue Jul 3, 2021 · 2 comments · Fixed by #828
Labels
bug Something isn't working

Comments

@srteam2020
Copy link
Contributor

Describe the bug

The rabbitmq operator have the following checking to prevent users to scale down statefulset (and pvc) of the cluster,

if r.scaleDown(ctx, rabbitmqCluster, current, sts) {
	// return when cluster scale down detected; unsupported operation
	return ctrl.Result{}, nil
}

If the user tries to change the replica from 3 to 2, the scaling down should not happen and the replicas should still remain 3.

We ran a workload that changes replica from 1 -> 3 -> 2. Ideally, the replicas should remain 3 since scaling down is not allowed. However, we observed there are 2 replicas at the end in some cases.

The reason is that the goroutine of performing reconciliation (g1) and the goroutine of updating the locally cached rabbitmqCluster object (g2) are concurrent, and certain interleaving of the two goroutines can lead to the unexpected scaling behavior.
Ideally, this interleaving will lead to the correct result:

  1. g2: set replica number in the rabbitmq (from 1) to 3
  2. g1: read the rabbitmq object and scales sts from 1 to 3
  3. g2: set replica number in the rabbitmq (from 3) to 2
  4. g1: read the rabbitmq object and refuse to scale from 3 to 2

And we find that the following interleaving (when the operator's reconcile goroutine runs relatively slowly) can lead to the unexpected scaling behavior mentioned above:

  1. g2: set replica number in the rabbitmq (from 1) to 3
  2. g2: set replica number in the rabbitmq (from 3) to 2
  3. g1: read the rabbitmq object and scales sts from 1 to 2

To Reproduce

As mentioned above, we change the replica number as 1 -> 3 -> 2. When the operator misses the view of "replica = 3" and directly sees "replica = 2", the rabbitmq statefulset will end up with 2 replicas, not 3.

Expected behavior
Since scale down is not allowed, we should have 3 replicas eventually.

Version and environment information

  • RabbitMQ: 3.8.12
  • RabbitMQ Cluster Operator: 4f13b9a (main branch)
  • Kubernetes: 1.18.9

Additional context

This issue is probably hard to fix from the operator side, as how the operator reads events from users (e.g., scaling up/down) and how the different goroutines interleave with each other are decided by the controller-runtime package, not the operator code. We didn't find a good way to solve this issue with the existing controller-runtime package.
Instead, we can at least make the potential debugging process easier when other users encounter similar problems later again by adding more logs. For example, we can log the currentReplicas and desiredReplicas in scaleDown so users can know the exact sequence of replica number seen by the operator.
We are willing to send a PR to add one more log here.

@srteam2020 srteam2020 added the bug Something isn't working label Jul 3, 2021
@srteam2020
Copy link
Contributor Author

This issue might be hard to reproduce as it will only manifest when the particular interleaving happens (e.g., when the reconcile goroutine runs slow). We are recently building a tool https://github.com/sieve-project/sieve to reliably detect and reproduce bugs like the above one in various k8s controllers. This bug can be automatically reproduced by just a few commands using our tool. Please take a try if you also want to reproduce the bug.

To use the tool, please first run the environment check:

python3 check-env.py

If the checker passes, please build the Kubernetes and controller images using the commands below as our tool needs to set up a kind cluster to reproduce the bug:

python3 build.py -p kubernetes -m obs-gap -d DOCKER_REPO_NAME
python3 build.py -p rabbitmq-operator -m obs-gap -d DOCKER_REPO_NAME -s 4f13b9a942ad34fece0171d2174aa0264b10e947

-d DOCKER_REPO_NAME should be the docker repo that you have write access to, -s 4f13b9a942ad34fece0171d2174aa0264b10e947 is the commit where the bug is found.

Finally, run

python3 sieve.py -p rabbitmq-operator -t scaleup-scaledown -d DOCKER_REPO_NAME

It will try to scale the rabbitmq cluster from 1 -> 3 -> 2 and also manipulate the goroutine interleaving of the operator to trigger the previously mentioned corner case interleaving.
The bug is successfully reproduced if you can see the message below

[ERROR] persistentvolumeclaim SIZE inconsistency: 3 seen after learning run, but 2 seen after testing run
[ERROR] pod SIZE inconsistency: 4 seen after learning run, but 3 seen after testing run

Ideally there should be 4 pods eventually (1 for the operator pod and 3 replicas), but after the test run we observe 3 pods (1 for the operator pod and 2 replicas).

For more information please refer to https://github.com/sieve-project/sieve#sieve-testing-datacenter-infrastructures-using-partial-histories and https://github.com/sieve-project/sieve/blob/main/docs/reprod.md#rabbitmq-cluster-operator-758

Please let me know if you encounter any problems when reproducing the bug with the tool.

@ChunyiLyu ChunyiLyu added the sync-up Issue to discuss during sync-up label Jul 5, 2021
@Zerpet
Copy link
Collaborator

Zerpet commented Jul 28, 2021

Hello @srteam2020, thank you for reporting this issue. We talked about this issue in our internal sync up, and we acknowledge this is a flaw in the Operator. However, we think this situation is an edge case, and given that we haven't received further reports of this bheaviour, we won't be able to get to this any time soon.

If you would like to contribute a fix for this issue, we will very gladly work with you to validate the changes and merge a potential fix.

Once again, thank you for reporting this issue, and thank you for testing this Operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants