Set Pod Management to 'Parallel' and disallow cluster scale down entirely #621

ChunyiLyu · 2021-03-01T14:13:42Z

This closes #298

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

set podManagementPolicy to Parallel
Use cluster_formation.randomized_startup_delay_range to avoid race condition between nodes during initial cluster formation
This change solves the problem of failing to restart the cluster when all pods are deleted and recreated at once; with podManagementPolicy set to 'OrderedReady', pod 0 is also the first one getting recreated, but it may not be the last node to shut down. This problem was first reported in community slack
Disallow cluster scale down entirely After talking to @mkuratczyk @MirahImage @yaronp68, we agreed that the cluster operator should prevent people from scaling down because it's not a properly supported and tested operation. Once Reconcile() has detect a scale down request, it will error, publish events, and set ReconcileSuccess to false.
PodManagementPolicy is immutable For existing clusters, operators won't be able to update the policy successfully. Users would need to manually delete the statefulset with cascading=false first, and then the operator can recreate the statefulSet with the correct settings. This needs to be mentioned in release notes

Additional Context

It would have been a cleaner and more elegant solution if we can mark CRD requirements on spec.replicas people from updating it to a less number. That would involve a webhook which adds more component for us to maintain. The current solution is easier to achieve and considering we want to support scale down in the future, it's an OK temporary fix.
controller-gen was updated in a previous PR, but the crd tag was not updated. I've included the change in this PR since it's a one liner.

- use cluster_formation.randomized_startup_delay_range to avoid race condition between nodes during initial cluster formation - this change solves the problem of failing to restart the cluster when all pods are deleted and recreated at once; with podManagementPolicy set to 'OrderedReady', pod 0 is also the first one getting recreated, but it may not be the last node to shut down.

mkuratczyk

Thanks!

- prevent cluster scale down from happening by checking current number of replicas vs desired number of replicas after running statefulSetBuilder.Update() - return errors, logs, publish events and set ReconcileSuccess to false if scale down request detected

mkuratczyk approved these changes Mar 1, 2021

View reviewed changes

Disable scale down

b09b47a

- prevent cluster scale down from happening by checking current number of replicas vs desired number of replicas after running statefulSetBuilder.Update() - return errors, logs, publish events and set ReconcileSuccess to false if scale down request detected

ChunyiLyu force-pushed the parallel branch from 146236b to b09b47a Compare March 1, 2021 15:40

MirahImage assigned MirahImage and unassigned MirahImage Mar 3, 2021

MirahImage self-requested a review March 3, 2021 10:49

MirahImage approved these changes Mar 3, 2021

View reviewed changes

ChunyiLyu merged commit 31fbfbc into main Mar 3, 2021

ChunyiLyu deleted the parallel branch March 3, 2021 11:28

ChunyiLyu mentioned this pull request Mar 9, 2021

Cluster doesn't recover if all rabbitmq-server pods deleted from cluster #609

Closed

gerhard mentioned this pull request Apr 14, 2021

Nodes that form a new cluster may not cluster correctly #662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set Pod Management to 'Parallel' and disallow cluster scale down entirely #621

Set Pod Management to 'Parallel' and disallow cluster scale down entirely #621

ChunyiLyu commented Mar 1, 2021 •

edited

mkuratczyk left a comment

Set Pod Management to 'Parallel' and disallow cluster scale down entirely #621

Set Pod Management to 'Parallel' and disallow cluster scale down entirely #621

Conversation

ChunyiLyu commented Mar 1, 2021 • edited

Summary Of Changes

Additional Context

mkuratczyk left a comment

Choose a reason for hiding this comment

ChunyiLyu commented Mar 1, 2021 •

edited