[BUG] Reading stale RabbitmqCluster information can lead to undesired statefulset deletion #648

srteam2020 · 2021-03-30T14:39:48Z

Describe the bug

We find that rabbitmq-cluster-operator could accidentally delete the statefulset when the controller experiences a restart and talks to one stale apiserver in an HA k8s cluster. After some inspection, we find that the root cause is the staleness populated from the apiserver that makes the controller think the RabbitmqCluster is going to be deleted (with a non-zero DeletionTimestamp), while it is actually not. One potential approach to handle this issue is to label each statefulset the UID of the RabbitmqCluster, and check the label before deleting the statefulset.

To Reproduce

Steps to reproduce the behavior:

Create a RabbitmqCluster named rabbitmq-cluster1 in a HA k8s cluster. The controller is talking to apiserver1 and the reconcile() will create a statefulset for rabbitmq-cluster1.
Delete rabbitmq-cluster1. Apiserver1 will send the update events with a non-zero DeletionTimestamp to the controller and the controller will delete the statefulset of rabbitmq-cluster1 in prepareForDeletion. Meanwhile, apiserver2 is partitioned so its watch cache stops at the moment that rabbitmq-cluster1 is tagged with a non-zero DeletionTimestamp.
Create the RabbitmqCluster with the same name again. Now, rabbitmq-cluster1 and its statefulset get back. However, apiserver2 still holds the stale view that rabbitmq-cluster1 has a non-zero DeletionTimestamp and is about to be deleted.
The controller crashes due to some node failure and restarts. This time the controller talks to the stale apiserver2. The restarted controller reads the stale information about rabbitmq-cluster1 from apiserver2 that rabbitmq-cluster1 has a non-zero DeletionTimestamp. Since the controller identifies the statefulset only with the name of the RabbitmqCluster, the statefulset belonging to the newly created RabbitmqCluster will be deleted in prepareForDeletion.

Expected behavior
The controller should be able to differentiate the statefulsets belonging to different RabbitmqCluster instances with the same name (in history). When reading the stale information about (previously deleted) RabbitmqCluster, the controller should be able to check whether the existing statefulset really belongs to that RabbitmqCluster (e.g., by using the UID) before perform any deletion.

Version and environment information

RabbitMQ: 3.8.12
RabbitMQ Cluster Operator: 4f13b9a (main branch)
Kubernetes: 1.18.9

Additional context

We are willing to issue a PR to help fix this issue.
As mentioned above, we can label each statefulset with the UID of the RabbitmqCluster, and check the label before deleting the statefulset to ensure we are deleting the correct statefulset.

The text was updated successfully, but these errors were encountered:

embano1 · 2021-03-30T15:52:07Z

I think using Preconditions in DeleteOptions would be a good way to address this:

type DeleteOptions struct {
	unversioned.TypeMeta `json:",inline"`

	// Optional duration in seconds before the object should be deleted. Value must be non-negative integer.
	// The value zero indicates delete immediately. If this value is nil, the default grace period for the
	// specified type will be used.
	GracePeriodSeconds *int64 `json:"gracePeriodSeconds,omitempty"`

	// Must be fulfilled before a deletion is carried out. If not possible, a 409 Conflict status will be
	// returned.
	Preconditions *Preconditions `json:"preconditions,omitempty"`

	// Should the dependent objects be orphaned. If true/false, the "orphan"
	// finalizer will be added to/removed from the object's finalizers list.
	OrphanDependents *bool `json:"orphanDependents,omitempty"`
}

srteam2020 · 2021-03-30T16:48:15Z

@embano1 Thanks for the pointer!
Yes, using Preconditions to specify the UID for deletion is a good way to address this. We can label each rabbitmqCluster with the UID of its statefulset, and specify the UID of the statefulset in the precondition when performing deletion.
We will issue a PR for this.

… correct UID

Zerpet · 2021-04-07T09:43:05Z

Thank you @srteam2020 for reporting this and for submitting a PR to fix it. We discussed this issue yesterday in our internal sync up and we will have a look at your PR, likely tomorrow.

…error msg in statefulSetUID

fix #648: delete the statefulset with precondition set to the correct…

srteam2020 added the bug Something isn't working label Mar 30, 2021

srteam2020 added a commit to srteam2020/cluster-operator that referenced this issue Apr 3, 2021

fix rabbitmq#648: delete the statefulset with precondition set to the…

a621a6a

… correct UID

srteam2020 mentioned this issue Apr 3, 2021

fix #648: delete the statefulset with precondition set to the correct… #651

Merged

Zerpet mentioned this issue Apr 7, 2021

Add support for Global Parameters #654

Closed

srteam2020 added a commit to srteam2020/cluster-operator that referenced this issue Apr 7, 2021

fix rabbitmq#648: add more comments in prepareForDeletion and better …

9aff91f

…error msg in statefulSetUID

srteam2020 added a commit to srteam2020/cluster-operator that referenced this issue Apr 7, 2021

fix rabbitmq#648: use fmt.Errorf

3d48973

ChunyiLyu closed this as completed in #651 Apr 9, 2021

ChunyiLyu added a commit that referenced this issue Apr 9, 2021

Merge pull request #651 from srteam2020/main

86bced0

fix #648: delete the statefulset with precondition set to the correct…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Reading stale RabbitmqCluster information can lead to undesired statefulset deletion #648

[BUG] Reading stale RabbitmqCluster information can lead to undesired statefulset deletion #648

srteam2020 commented Mar 30, 2021

embano1 commented Mar 30, 2021

srteam2020 commented Mar 30, 2021

Zerpet commented Apr 7, 2021

[BUG] Reading stale RabbitmqCluster information can lead to undesired statefulset deletion #648

[BUG] Reading stale RabbitmqCluster information can lead to undesired statefulset deletion #648

Comments

srteam2020 commented Mar 30, 2021

Describe the bug

To Reproduce

Version and environment information

Additional context

embano1 commented Mar 30, 2021

srteam2020 commented Mar 30, 2021

Zerpet commented Apr 7, 2021