-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] PVC expansion fails if the operator crashes in the middle of reconciliation #782
Labels
bug
Something isn't working
Comments
coro
added a commit
that referenced
this issue
Sep 8, 2021
This allows for this section of the code to have its own test suite, without relying on the restrictions placed by envtest. This allows us to test more scenarios, such as where a controller has previously crashed and left the existing API objects in a bad state (e.g. statefulsets missing, PVCs being the wrong size, etc.). This also fixes #782, as the process of scaling PVCs should now be more idempotent than before: - StatefulSets don't need to be present on the cluster in order to scale the PVCs (as long as the PVCs exist) - PVC scaling is done on a PVC-by-PVC case, rather than all at once, so that if one or more replicas have already scaled their PVCs it does not block the others - The current size of PVCs is determined by the live reading of the K8s API, rather than what's reported in the RabbitmqCluster object.
coro
added a commit
that referenced
this issue
Sep 23, 2021
* Extract PVC scaling to its own package This allows for this section of the code to have its own test suite, without relying on the restrictions placed by envtest. This allows us to test more scenarios, such as where a controller has previously crashed and left the existing API objects in a bad state (e.g. statefulsets missing, PVCs being the wrong size, etc.). This also fixes #782, as the process of scaling PVCs should now be more idempotent than before: - StatefulSets don't need to be present on the cluster in order to scale the PVCs (as long as the PVCs exist) - PVC scaling is done on a PVC-by-PVC case, rather than all at once, so that if one or more replicas have already scaled their PVCs it does not block the others - The current size of PVCs is determined by the live reading of the K8s API, rather than what's reported in the RabbitmqCluster object. * Fix integration tests * Move scaling test BeforeEach out of the test suite * Fix nil pointer exception when not defining replicas in Go struct Before this commit, when replicas was not defined as in: Spec: rabbitmqv1beta1.RabbitmqClusterSpec{ Replicas: &one, }, unit tests were failing. Replicas is an optional field. If not set, it should default to 1 instead of causing a panic. * Do not pollute info logs Before this commit huge "Found exising PVCs" logs were output after every reconcilication even though PVCs didn't get resized. "By default logr's V(0) is zap's InfoLevel and V(1) is zap's DebugLevel (which is numerically -1)." Co-authored-by: David Ansari <david.ansari@gmx.de>
@srteam2020 This should now be fixed, thanks again for raising the Issue! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
We find that if the operator crashes at some point in the middle of performing PVC expansion, the actual PVC size will never meet user's spec even after the operator restarts.
Specifically, The rabbitmq operator performs the following steps to expand the capacity of statefulset's PVC,
volumeClaimTemplates
to determine whether to resize PVCvolumeClaimTemplates
)We ran a workload that changes PVC capacity from 10Gi to 15Gi, and observed that PVC was never resized to 15Gi if the operator crashed at some point between deleting cluster statefulset (ii) and updating PVC capacity (iii).
In that case, the cluster statefulset was actually deleted ((ii) is executed before crash) while the actual PVC capacity is not successfully resized ((iii) is not executed before crash). After the operator restarts from the crash, it will try to get the cluster statefulset. Since the statefulset doesn't exist, the operator will skip (i),(ii),(iii), and directly create the cluster statefulset with the same PVC capacity in its
volumeClaimTemplates
as in the CRD spec. Since the two values become the same, the PVC expansion logic will never be triggered until the user manually change the PVC capacity in the CRD spec again.To Reproduce
As mentioned above, resize the PVC from 10Gi to 15Gi and make the operator crash between (ii) and (iii). After restart, the PVC expansion will not succeed eventually.
Expected behavior
The operator should recover correctly from the crash and continue the PVC expansion.
Version and environment information
Additional context
We are willing to send a PR to help fix this issue. A simple fix could be to compare capacity in CRD Spec the actual PVC capacity (instead of the capacity in statefulset's
volumeClaimTemplates
), so that even after the untimely crash the PVC expansion can still work.The text was updated successfully, but these errors were encountered: