Don't create an impossible disruption budget for smaller clusters #384

coderanger · 2018-09-11T23:49:27Z

When the cluster has 1 node (or probably 0 nodes but that wasn't the case I hit), there is still a disruption budget with minimum=1, so the single existing node can never be killed. I hit this because it left those database nodes blocking a pre-upgrade node drain :) If you've already opted to create a single node cluster, it seems fair to allow it to be disrupted if needed.

coveralls · 2018-09-11T23:57:36Z

Coverage remained the same at 22.778% when pulling eea5fd5 on coderanger:disruption-budget into 0b75a89 on zalando-incubator:master.

Jan-M · 2018-09-12T07:59:21Z

I disagree here. Because our disruption budget is always impossible to resolve. It covers always only the master pod of which there is always only 1.

The intent is that there is some kind of coordination going on with the operator to move and failover database pods instead of the "kubernetest cluster"-manager.

The operator also does contain code to handle those single master pods with no replicas. You can trigger migration by setting / removing labels on nodes.

Our Kubernetes infrastructure does have a timeout set, if the operator does not manage until that point to move the master it gets killed.

sdudoladov · 2018-09-12T10:59:56Z

Our Kubernetes infrastructure does have a timeout set,

We also allow single node clusters in our test environment; master nodes there should not go away voluntarily despite hosting test PG clusters, hence the need for small disruption budgets.

coderanger · 2018-09-12T16:22:40Z

If that's the case, then the operator needs to watch for evictions (nodes being set unschedulable) or something and implement its own version of draining nodes marked for maintenance :) Otherwise this will block kubectl drain or any similar system used during rolling upgrades or similar.

hjacobs · 2018-09-12T18:21:07Z

@coderanger please watch our KubeCon talk "Continuously Deliver your Kubernetes Infrastructure" on how we do rolling cluster upgrades --- it's not a naive rolling node upgrade, but a bit more coordination (for PostgreSQL).

coderanger · 2018-09-12T18:23:06Z

@hjacobs That doesn't really apply for things like GKE and EKS though, and there should at least be options to work in those environments since they are very common :) I can move this elsewhere, maybe if the pod disruption budget name is "" then don't create it at all? Or a new flag in the Postgresql object?

coderanger · 2018-09-12T19:52:13Z

Anyone have a strong feeling on how to structure it? Happy to do either the object name template approach above or a new flag, or something else that's better :)

alexeyklyukin · 2018-09-12T20:58:46Z

@coderanger

At the moment there is no setting in the operator to turn off creation of PDB for the Postgres cluster. Normally, even for a cluster with one pod, the operator should do the right thing and delete the master pod when the node is being drained and the 'ready' label is removed.

Shortly, the procedure we used is: remove the ready label from all old Kubernetes nodes, create the new N nodes, drain the old N nodes, the operator will pick up all master pods from those old N nodes and move them to one of the new ones. How is that specific to AWS and what should be done differently on GKE or EKS?

The legitimate case I see here is when one wants to opt-out completely from our node rolling update schema for some reason. We could have a work-around of updating the PDB with "MinAvailable: 0" iff the cluster is declared with number of instances set to 0.

alexeyklyukin · 2018-09-12T21:04:12Z

pkg/cluster/k8sres.go

@@ -1052,6 +1052,11 @@ func (c *Cluster) generateCloneEnvironment(description *acidv1.CloneDescription)
 func (c *Cluster) generatePodDisruptionBudget() *policybeta1.PodDisruptionBudget {
 	minAvailable := intstr.FromInt(1)

+	// Avoid creating an unsatisfyable budget.
+	if c.Spec.NumberOfInstances <= 1 {


I'd change it to < 1. A minAvailable: 1 for a cluster with one pod is easy to satisfy.

coderanger · 2018-09-12T21:10:49Z

@alexeyklyukin when GKE runs an upgrade, it runs the whole process itself. Having to manually cycle pods is technically doable but creates substantial complexity that doens't need to be there. The expected behavior for most applications is if they run multiple replicas, set a disruption budget with a min of 1 so that at least one pod will migrated before the node cycles. For a single replica service, there will have to be downtime during a node cycle so having a pod disruption budget just blocks the upgrade forever (or until a human goes in an manually kills the pod). Requiring a human operator to manage these things is kind of against the point of automating the cluster, in a multi-replica database I would expect the normal failover system to kick in and fix things, so I don't want node drains to get blocked pending human interaction. If you have very high uptime requirements I can see the argument for manually triggering the failover I guess? But for most of my databases, I'm cool with a second or two of downtime as a failover runs because one of the pods was killed, or in the case of a single instance I've accepted that there may be momentary outages when pods get moved around :)

alexeyklyukin · 2018-09-12T21:42:02Z

@coderanger so what about setting the numberOfInstances to 0 explicitly for the clusters that is expected to have a downtime during the upgrade? If the operator reacts by setting the minAvailable to 0 on the PDB I am fine with that, as the user gave an explicit consent for the downtime by setting the cluster size to 0.

coderanger · 2018-09-12T21:44:17Z

I would expect numberOfInstances:0 to scale down the existing database pods? That would also have to be done on every database object every time there is a Kubernetes upgrade to run (or any node maintenance that requires draining).

coderanger · 2018-09-12T21:46:38Z

(it's very possible I misunderstand how the operator sets the number of replicas in the statefulset :)

alexeyklyukin · 2018-09-13T07:30:07Z

I would expect numberOfInstances:0 to scale down the existing database pods? That would also have to be done on every database object every time there is a Kubernetes upgrade to run (or any node maintenance that requires draining).

Yes, setting the number of instances to 0 terminates all pods of the existing cluster.

You can do this for every cluster just before the upgrade if you are fine with an upgrade-related downtime. Otherwise, this would be received only for those clusters running just one pod; for the rest, you can either wait for the replica pods to terminate and start on new nodes and switchover to them afterwards manually, or use the existing upgrade procedure in the operator and figure out how to make Google remove the readiness label from the nodes that are about to be drained and assign it to the new nodes.

Anyway, it will not make things worse, as the naive upgrade on GKE will most certainly lead to at least one downtime (when the master is terminated without any prior notice it may take up to 30 seconds to failover to the replica) and in many cases even more than one (if the master pod fail over to a node that is about to be decommissioned one or more times). By devising a special upgrade procedure we are avoiding exactly the issue of having multiple downtimes for the cluster during a single Kubernetes upgrade.

Jan-M · 2018-09-13T15:43:34Z

I would be fine with a feature toggle by setting the min number of instances for PDB to be created if this helps the GKE deployment, which we have not had enough time to test right now. (e.g. default to value would be 1 pod, but you could change it to 2 and then skip pdb for 1 pod clusters)

Otherwise we need some time to look into explaining you the triggering of the "pod migration" the operator does. To my memory you just need to set the end of life label. As Oleksii points out the important part is to reduce number of fail overs for clusters with 2+ pods as you do not want to fail over to a replica still running on an old node but a replica that has been moved to an updated node.

Jan-M · 2018-09-13T15:47:04Z

A quick look at the call sites for "moveMasterPodsOffNode" would probably help.

Seems we watch for unscheduleable and the removal of the ready label. (The absense of the ready label makes Postgres pods start on new nodes given the affinity config)

coderanger · 2018-09-13T17:43:47Z

While I'm 100% on board that the Zalando-style node upgrade/maintenance workflow is better, stuff like GKE and Kops upgrades are what they are :) I can rework this using some new options. I think the clearest way would be a allowMasterDisruption flag that can be explicitly set. That at least hints in the name that setting to true will potentially result in downtime. Seem reasonable?

…ich shouldn't matter anyway but be nicer to the scheduler).

coderanger · 2018-09-16T23:36:31Z

Updated to add a new allow_master_disruption cluster-level setting. I left in setting the minimum to zero if number_of_instances is 0 but I don't think that technically matters in all but the edge-iest of edge cases so I'm happy to remove it if it seems like the wrong thing :)

coderanger · 2018-11-10T02:14:33Z

Anything further I should tweak on this?

erthalion · 2018-12-13T17:00:38Z

Looks good. Although I'm a bit concern about the name of this option - I'm not sure I personally would trust to a project with "allow to screw my db" option :) What about being flexible to the full extent and just have min_disruption_budget option?
Also why don't have a test for the case where you hit this issue, i.e. NumberOfInstances: 1?

coderanger · 2018-12-13T18:22:51Z

Heh, there was a test for that back on the previous implementation. I can switch the option to direct control over the budget minimum though :)

erthalion · 2018-12-13T20:24:59Z

Heh, there was a test for that back on the previous implementation.

What happened to it?

I can switch the option to direct control over the budget minimum though :)

Yes, that would be nice. Other than that I don't see any issues with this proposal.

coderanger · 2018-12-13T20:43:57Z

What should it do if someone puts a value other than 0 or 1?

erthalion · 2018-12-13T21:23:21Z

What should it do if someone puts a value other than 0 or 1?

Since the whole purpose of this feature is to add a bit more flexibility, I would argue that in this case operator needs just to log warning and proceed with a provided value (but of course the default value should be 1). Any other opinions?

coderanger · 2018-12-13T21:25:27Z

I mean it's a hard error if its more than 1 I think? The disruption budget only target the master pod, and I think there should only ever be one of those other than during a failover? 2 or more would be a permanently unsatisfiable budget in that case.

erthalion · 2018-12-14T11:46:26Z

I mean it's a hard error if its more than 1 I think? The disruption budget only target the master pod, and I think there should only ever be one of those other than during a failover?

With the current setup - yes, but then my vivid imagination helps me to think of operator managing a sharded cluster with more than one master. At the same time it's pretty far fetched for now, so you're probably right - let's then throw a validation error if min availability is more than 1.

sdudoladov · 2019-02-27T11:54:17Z

@coderanger @erthalion do you have any updates on this PR ?

FxKu · 2019-05-24T10:40:10Z

Looks like this PR would also resolve #547.

Jan-M · 2019-05-24T12:02:03Z

pkg/cluster/k8sres.go

@@ -1052,6 +1052,11 @@ func (c *Cluster) generateCloneEnvironment(description *acidv1.CloneDescription)
 func (c *Cluster) generatePodDisruptionBudget() *policybeta1.PodDisruptionBudget {
 	minAvailable := intstr.FromInt(1)

+	// Is master disruption is enabled or if there is no master, set the budget to 0.
+	if c.Spec.AllowMasterDisruption || c.Spec.NumberOfInstances <= 0 {


Do we create the pod disruption budget when we scale up?

I'll test with kind

So, simply scaling to 0 and then back to 1 does't work out of the box. Syncing gets interrupted while syncing roles:

time="2019-05-24T14:01:14Z" level=debug msg="updating statefulset" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0 time="2019-05-24T14:01:14Z" level=debug msg="updating statefulset annotations" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0 time="2019-05-24T14:01:14Z" level=debug msg="syncing roles" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0 time="2019-05-24T14:01:15Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.110.86.179:5432: connect: connection refused" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0 time="2019-05-24T14:01:30Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.110.86.179:5432: connect: connection refused" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0 time="2019-05-24T14:01:44Z" level=debug msg="closing database connection" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0 time="2019-05-24T14:01:44Z" level=info msg="cluster has been updated" cluster-name=default/acid-minimal-cluster pkg=controller worker=0

Deleting the operator pod helps to sync roles so that pdb is synced, too (and set to minAvailable: 1).

time="2019-05-24T14:10:53Z" level=debug msg="statefulset's rolling update annotation has been set to false" cluster-name=default/acid-minimal-cluster pkg=cluster time="2019-05-24T14:10:53Z" level=debug msg="syncing roles" cluster-name=default/acid-minimal-cluster pkg=cluster time="2019-05-24T14:10:53Z" level=debug msg="closing database connection" cluster-name=default/acid-minimal-cluster pkg=cluster time="2019-05-24T14:10:53Z" level=debug msg="syncing databases" cluster-name=default/acid-minimal-cluster pkg=cluster time="2019-05-24T14:10:53Z" level=debug msg="closing database connection" cluster-name=default/acid-minimal-cluster pkg=cluster time="2019-05-24T14:10:53Z" level=debug msg="syncing pod disruption budgets" cluster-name=default/acid-minimal-cluster pkg=cluster time="2019-05-24T14:10:53Z" level=info msg="cluster has been synced" cluster-name=default/acid-minimal-cluster pkg=controller worker=0

Jan-M · 2019-05-24T12:04:25Z

pkg/apis/acid.zalan.do/v1/postgresql_type.go

+	Tolerations           []v1.Toleration      `json:"tolerations,omitempty"`
+	Sidecars              []Sidecar            `json:"sidecars,omitempty"`
+	PodPriorityClassName  string               `json:"pod_priority_class_name,omitempty"`
+	AllowMasterDisruption bool                 `json:"allow_master_disruption,omitempty"`


is "allow" a good word? should we not just stick with "enable"?

And then it should also default to "enabled=True"

Agree. And should be introduced to all the different config targets?

FxKu · 2019-06-18T08:50:36Z

Thanks @coderanger for your contribution. I have followed up on your work and created a configuration toggle for enabling/disabling the PDB instead of having a manifest parameter. Hope it helps your workflows.

Don't create an impossible disruption budget for smaller clusters.

9f58185

coderanger requested review from alexeyklyukin, avaczi, CyberDem0n, erthalion, Jan-M and sdudoladov as code owners September 11, 2018 23:49

alexeyklyukin closed this Sep 12, 2018

alexeyklyukin reopened this Sep 12, 2018

alexeyklyukin reviewed Sep 12, 2018

View reviewed changes

Move the setting to an explicit thing, or if there are zero nodes (wh…

eea5fd5

…ich shouldn't matter anyway but be nicer to the scheduler).

erthalion self-assigned this Dec 12, 2018

FxKu added this to the v1.2 milestone May 20, 2019

Jan-M reviewed May 24, 2019

View reviewed changes

Jan-M added the in-progress label May 28, 2019

FxKu mentioned this pull request Jun 7, 2019

Add PDB configuration toggle #583

Merged

FxKu closed this in #583 Jun 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't create an impossible disruption budget for smaller clusters #384

Don't create an impossible disruption budget for smaller clusters #384

coderanger commented Sep 11, 2018

coveralls commented Sep 11, 2018 •

edited

Loading

Jan-M commented Sep 12, 2018

sdudoladov commented Sep 12, 2018

coderanger commented Sep 12, 2018

hjacobs commented Sep 12, 2018

coderanger commented Sep 12, 2018

coderanger commented Sep 12, 2018

alexeyklyukin commented Sep 12, 2018 •

edited

Loading

alexeyklyukin Sep 12, 2018

coderanger commented Sep 12, 2018

alexeyklyukin commented Sep 12, 2018

coderanger commented Sep 12, 2018

coderanger commented Sep 12, 2018

alexeyklyukin commented Sep 13, 2018

Jan-M commented Sep 13, 2018

Jan-M commented Sep 13, 2018

coderanger commented Sep 13, 2018

coderanger commented Sep 16, 2018

coderanger commented Nov 10, 2018

erthalion commented Dec 13, 2018

coderanger commented Dec 13, 2018

erthalion commented Dec 13, 2018

coderanger commented Dec 13, 2018

erthalion commented Dec 13, 2018

coderanger commented Dec 13, 2018 •

edited

Loading

erthalion commented Dec 14, 2018

sdudoladov commented Feb 27, 2019

FxKu commented May 24, 2019

Jan-M May 24, 2019

FxKu May 24, 2019 •

edited

Loading

Jan-M May 24, 2019

Jan-M May 24, 2019

FxKu May 24, 2019

FxKu commented Jun 18, 2019

Don't create an impossible disruption budget for smaller clusters #384

Don't create an impossible disruption budget for smaller clusters #384

Conversation

coderanger commented Sep 11, 2018

coveralls commented Sep 11, 2018 • edited Loading

Jan-M commented Sep 12, 2018

sdudoladov commented Sep 12, 2018

coderanger commented Sep 12, 2018

hjacobs commented Sep 12, 2018

coderanger commented Sep 12, 2018

coderanger commented Sep 12, 2018

alexeyklyukin commented Sep 12, 2018 • edited Loading

alexeyklyukin Sep 12, 2018

Choose a reason for hiding this comment

coderanger commented Sep 12, 2018

alexeyklyukin commented Sep 12, 2018

coderanger commented Sep 12, 2018

coderanger commented Sep 12, 2018

alexeyklyukin commented Sep 13, 2018

Jan-M commented Sep 13, 2018

Jan-M commented Sep 13, 2018

coderanger commented Sep 13, 2018

coderanger commented Sep 16, 2018

coderanger commented Nov 10, 2018

erthalion commented Dec 13, 2018

coderanger commented Dec 13, 2018

erthalion commented Dec 13, 2018

coderanger commented Dec 13, 2018

erthalion commented Dec 13, 2018

coderanger commented Dec 13, 2018 • edited Loading

erthalion commented Dec 14, 2018

sdudoladov commented Feb 27, 2019

FxKu commented May 24, 2019

Jan-M May 24, 2019

Choose a reason for hiding this comment

FxKu May 24, 2019 • edited Loading

Choose a reason for hiding this comment

Jan-M May 24, 2019

Choose a reason for hiding this comment

Jan-M May 24, 2019

Choose a reason for hiding this comment

FxKu May 24, 2019

Choose a reason for hiding this comment

FxKu commented Jun 18, 2019

coveralls commented Sep 11, 2018 •

edited

Loading

alexeyklyukin commented Sep 12, 2018 •

edited

Loading

coderanger commented Dec 13, 2018 •

edited

Loading

FxKu May 24, 2019 •

edited

Loading