Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't create an impossible disruption budget for smaller clusters #384

Closed
wants to merge 2 commits into from

Conversation

coderanger
Copy link
Contributor

When the cluster has 1 node (or probably 0 nodes but that wasn't the case I hit), there is still a disruption budget with minimum=1, so the single existing node can never be killed. I hit this because it left those database nodes blocking a pre-upgrade node drain :) If you've already opted to create a single node cluster, it seems fair to allow it to be disrupted if needed.

@coveralls
Copy link

coveralls commented Sep 11, 2018

Coverage Status

Coverage remained the same at 22.778% when pulling eea5fd5 on coderanger:disruption-budget into 0b75a89 on zalando-incubator:master.

@Jan-M
Copy link
Member

Jan-M commented Sep 12, 2018

I disagree here. Because our disruption budget is always impossible to resolve. It covers always only the master pod of which there is always only 1.

The intent is that there is some kind of coordination going on with the operator to move and failover database pods instead of the "kubernetest cluster"-manager.

The operator also does contain code to handle those single master pods with no replicas. You can trigger migration by setting / removing labels on nodes.

Our Kubernetes infrastructure does have a timeout set, if the operator does not manage until that point to move the master it gets killed.

@sdudoladov
Copy link
Member

Our Kubernetes infrastructure does have a timeout set,

We also allow single node clusters in our test environment; master nodes there should not go away voluntarily despite hosting test PG clusters, hence the need for small disruption budgets.

@coderanger
Copy link
Contributor Author

If that's the case, then the operator needs to watch for evictions (nodes being set unschedulable) or something and implement its own version of draining nodes marked for maintenance :) Otherwise this will block kubectl drain or any similar system used during rolling upgrades or similar.

@hjacobs
Copy link
Contributor

hjacobs commented Sep 12, 2018

@coderanger please watch our KubeCon talk "Continuously Deliver your Kubernetes Infrastructure" on how we do rolling cluster upgrades --- it's not a naive rolling node upgrade, but a bit more coordination (for PostgreSQL).

@coderanger
Copy link
Contributor Author

@hjacobs That doesn't really apply for things like GKE and EKS though, and there should at least be options to work in those environments since they are very common :) I can move this elsewhere, maybe if the pod disruption budget name is "" then don't create it at all? Or a new flag in the Postgresql object?

@coderanger
Copy link
Contributor Author

Anyone have a strong feeling on how to structure it? Happy to do either the object name template approach above or a new flag, or something else that's better :)

@alexeyklyukin
Copy link
Contributor

alexeyklyukin commented Sep 12, 2018

@coderanger

At the moment there is no setting in the operator to turn off creation of PDB for the Postgres cluster. Normally, even for a cluster with one pod, the operator should do the right thing and delete the master pod when the node is being drained and the 'ready' label is removed.

Shortly, the procedure we used is: remove the ready label from all old Kubernetes nodes, create the new N nodes, drain the old N nodes, the operator will pick up all master pods from those old N nodes and move them to one of the new ones. How is that specific to AWS and what should be done differently on GKE or EKS?

The legitimate case I see here is when one wants to opt-out completely from our node rolling update schema for some reason. We could have a work-around of updating the PDB with "MinAvailable: 0" iff the cluster is declared with number of instances set to 0.

@@ -1052,6 +1052,11 @@ func (c *Cluster) generateCloneEnvironment(description *acidv1.CloneDescription)
func (c *Cluster) generatePodDisruptionBudget() *policybeta1.PodDisruptionBudget {
minAvailable := intstr.FromInt(1)

// Avoid creating an unsatisfyable budget.
if c.Spec.NumberOfInstances <= 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change it to < 1. A minAvailable: 1 for a cluster with one pod is easy to satisfy.

@coderanger
Copy link
Contributor Author

@alexeyklyukin when GKE runs an upgrade, it runs the whole process itself. Having to manually cycle pods is technically doable but creates substantial complexity that doens't need to be there. The expected behavior for most applications is if they run multiple replicas, set a disruption budget with a min of 1 so that at least one pod will migrated before the node cycles. For a single replica service, there will have to be downtime during a node cycle so having a pod disruption budget just blocks the upgrade forever (or until a human goes in an manually kills the pod). Requiring a human operator to manage these things is kind of against the point of automating the cluster, in a multi-replica database I would expect the normal failover system to kick in and fix things, so I don't want node drains to get blocked pending human interaction. If you have very high uptime requirements I can see the argument for manually triggering the failover I guess? But for most of my databases, I'm cool with a second or two of downtime as a failover runs because one of the pods was killed, or in the case of a single instance I've accepted that there may be momentary outages when pods get moved around :)

@alexeyklyukin
Copy link
Contributor

@coderanger so what about setting the numberOfInstances to 0 explicitly for the clusters that is expected to have a downtime during the upgrade? If the operator reacts by setting the minAvailable to 0 on the PDB I am fine with that, as the user gave an explicit consent for the downtime by setting the cluster size to 0.

@coderanger
Copy link
Contributor Author

I would expect numberOfInstances:0 to scale down the existing database pods? That would also have to be done on every database object every time there is a Kubernetes upgrade to run (or any node maintenance that requires draining).

@coderanger
Copy link
Contributor Author

(it's very possible I misunderstand how the operator sets the number of replicas in the statefulset :)

@alexeyklyukin
Copy link
Contributor

I would expect numberOfInstances:0 to scale down the existing database pods? That would also have to be done on every database object every time there is a Kubernetes upgrade to run (or any node maintenance that requires draining).

Yes, setting the number of instances to 0 terminates all pods of the existing cluster.

You can do this for every cluster just before the upgrade if you are fine with an upgrade-related downtime. Otherwise, this would be received only for those clusters running just one pod; for the rest, you can either wait for the replica pods to terminate and start on new nodes and switchover to them afterwards manually, or use the existing upgrade procedure in the operator and figure out how to make Google remove the readiness label from the nodes that are about to be drained and assign it to the new nodes.

Anyway, it will not make things worse, as the naive upgrade on GKE will most certainly lead to at least one downtime (when the master is terminated without any prior notice it may take up to 30 seconds to failover to the replica) and in many cases even more than one (if the master pod fail over to a node that is about to be decommissioned one or more times). By devising a special upgrade procedure we are avoiding exactly the issue of having multiple downtimes for the cluster during a single Kubernetes upgrade.

@Jan-M
Copy link
Member

Jan-M commented Sep 13, 2018

I would be fine with a feature toggle by setting the min number of instances for PDB to be created if this helps the GKE deployment, which we have not had enough time to test right now. (e.g. default to value would be 1 pod, but you could change it to 2 and then skip pdb for 1 pod clusters)

Otherwise we need some time to look into explaining you the triggering of the "pod migration" the operator does. To my memory you just need to set the end of life label. As Oleksii points out the important part is to reduce number of fail overs for clusters with 2+ pods as you do not want to fail over to a replica still running on an old node but a replica that has been moved to an updated node.

@Jan-M
Copy link
Member

Jan-M commented Sep 13, 2018

A quick look at the call sites for "moveMasterPodsOffNode" would probably help.

Seems we watch for unscheduleable and the removal of the ready label. (The absense of the ready label makes Postgres pods start on new nodes given the affinity config)

@coderanger
Copy link
Contributor Author

While I'm 100% on board that the Zalando-style node upgrade/maintenance workflow is better, stuff like GKE and Kops upgrades are what they are :) I can rework this using some new options. I think the clearest way would be a allowMasterDisruption flag that can be explicitly set. That at least hints in the name that setting to true will potentially result in downtime. Seem reasonable?

…ich shouldn't matter anyway but be nicer to the scheduler).
@coderanger
Copy link
Contributor Author

Updated to add a new allow_master_disruption cluster-level setting. I left in setting the minimum to zero if number_of_instances is 0 but I don't think that technically matters in all but the edge-iest of edge cases so I'm happy to remove it if it seems like the wrong thing :)

@coderanger
Copy link
Contributor Author

Anything further I should tweak on this?

@erthalion erthalion self-assigned this Dec 12, 2018
@erthalion
Copy link
Contributor

Looks good. Although I'm a bit concern about the name of this option - I'm not sure I personally would trust to a project with "allow to screw my db" option :) What about being flexible to the full extent and just have min_disruption_budget option?
Also why don't have a test for the case where you hit this issue, i.e. NumberOfInstances: 1?

@coderanger
Copy link
Contributor Author

Heh, there was a test for that back on the previous implementation. I can switch the option to direct control over the budget minimum though :)

@erthalion
Copy link
Contributor

Heh, there was a test for that back on the previous implementation.

What happened to it?

I can switch the option to direct control over the budget minimum though :)

Yes, that would be nice. Other than that I don't see any issues with this proposal.

@coderanger
Copy link
Contributor Author

What should it do if someone puts a value other than 0 or 1?

@erthalion
Copy link
Contributor

What should it do if someone puts a value other than 0 or 1?

Since the whole purpose of this feature is to add a bit more flexibility, I would argue that in this case operator needs just to log warning and proceed with a provided value (but of course the default value should be 1). Any other opinions?

@coderanger
Copy link
Contributor Author

coderanger commented Dec 13, 2018

I mean it's a hard error if its more than 1 I think? The disruption budget only target the master pod, and I think there should only ever be one of those other than during a failover? 2 or more would be a permanently unsatisfiable budget in that case.

@erthalion
Copy link
Contributor

I mean it's a hard error if its more than 1 I think? The disruption budget only target the master pod, and I think there should only ever be one of those other than during a failover?

With the current setup - yes, but then my vivid imagination helps me to think of operator managing a sharded cluster with more than one master. At the same time it's pretty far fetched for now, so you're probably right - let's then throw a validation error if min availability is more than 1.

@sdudoladov
Copy link
Member

@coderanger @erthalion do you have any updates on this PR ?

@FxKu FxKu added this to the v1.2 milestone May 20, 2019
@FxKu
Copy link
Member

FxKu commented May 24, 2019

Looks like this PR would also resolve #547.

@@ -1052,6 +1052,11 @@ func (c *Cluster) generateCloneEnvironment(description *acidv1.CloneDescription)
func (c *Cluster) generatePodDisruptionBudget() *policybeta1.PodDisruptionBudget {
minAvailable := intstr.FromInt(1)

// Is master disruption is enabled or if there is no master, set the budget to 0.
if c.Spec.AllowMasterDisruption || c.Spec.NumberOfInstances <= 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we create the pod disruption budget when we scale up?

Copy link
Member

@FxKu FxKu May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll test with kind

So, simply scaling to 0 and then back to 1 does't work out of the box. Syncing gets interrupted while syncing roles:

time="2019-05-24T14:01:14Z" level=debug msg="updating statefulset" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:14Z" level=debug msg="updating statefulset annotations" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:14Z" level=debug msg="syncing roles" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:15Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.110.86.179:5432: connect: connection refused" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:30Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.110.86.179:5432: connect: connection refused" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:44Z" level=debug msg="closing database connection" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:44Z" level=info msg="cluster has been updated" cluster-name=default/acid-minimal-cluster pkg=controller worker=0

Deleting the operator pod helps to sync roles so that pdb is synced, too (and set to minAvailable: 1).

time="2019-05-24T14:10:53Z" level=debug msg="statefulset's rolling update annotation has been set to false" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="syncing roles" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="closing database connection" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="syncing databases" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="closing database connection" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="syncing pod disruption budgets" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=info msg="cluster has been synced" cluster-name=default/acid-minimal-cluster pkg=controller worker=0

Tolerations []v1.Toleration `json:"tolerations,omitempty"`
Sidecars []Sidecar `json:"sidecars,omitempty"`
PodPriorityClassName string `json:"pod_priority_class_name,omitempty"`
AllowMasterDisruption bool `json:"allow_master_disruption,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "allow" a good word? should we not just stick with "enable"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then it should also default to "enabled=True"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. And should be introduced to all the different config targets?

@FxKu
Copy link
Member

FxKu commented Jun 18, 2019

Thanks @coderanger for your contribution. I have followed up on your work and created a configuration toggle for enabling/disabling the PDB instead of having a manifest parameter. Hope it helps your workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants