Add basic retry around switchover #1510

sdudoladov · 2021-05-31T15:13:39Z

Operator should retry switchovers before giving up on moving master pods from non-ready nodes.
That is currently not the case: operator attempts to move the master pods once and then leaves them as is, thereby potentially blocking k8s cluster-wide processes such as node rotation. With retries we avoid some of the blocking, namely the cases where a replica was moved shortly before the master and is not ready at the time of the first switchover attempt of the operator.

To test, build and start operator and one PG cluster in kind as normal. Then:

Tag the replica pod with the nofailover tag

kubectl exec -it $(kubectl get pods -l spilo-role=replica -o jsonpath={.items[0].metadata.name}) -- su postgres
echo -e "tags:\n nofailover: true" >> postgres.yml
patronictl reload $SCOPE --force

Make the node with the master pod non-schedulable

kubectl cordon $(kubectl get pods -l spilo-role=master -o jsonpath={..nodeName})

The operator will log unsuccessful attempts to do a switchover with 1 minutes intervals for 5 minutes.

time="2021-05-31T13:16:26Z" level=debug msg="Waiting for any replica pod to become ready" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-05-31T13:16:26Z" level=debug msg="Found 1 running replica pods" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-05-31T13:16:26Z" level=info msg="check failed: pod \"default/acid-minimal-cluster-1\" is already on a live node" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-05-31T13:16:26Z" level=debug msg="switching over from \"acid-minimal-cluster-0\" to \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-05-31T13:16:26Z" level=debug msg="making POST http request: http://10.244.2.3:8008/failover" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-05-31T13:16:26Z" level=debug msg="subscribing to pod \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-05-31T13:16:27Z" level=debug msg="unsubscribing from pod \"default/acid-minimal-cluster-1\" events" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-05-31T13:16:27Z" level=error msg="could not failover to pod \"default/acid-minimal-cluster-1\": could not switch over from \"acid-minimal-cluster-0\" to \"default/acid-minimal-cluster-1\": patroni returned 'failover is not possible: no good candidates have been found'" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-05-31T13:17:26Z" level=debug msg="switching over from \"acid-minimal-cluster-0\" to \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0

Jan-M · 2021-06-03T09:27:50Z

postgres-operator/pkg/controller/node.go

Line 142 in 0745ce7

if err := cl.MigrateMasterPod(podName); err != nil {

@FxKu @sdudoladov imho the pr currently changes semantics, no proper error propagation after 5 failures

pkg/cluster/pod.go

sdudoladov · 2021-06-04T08:51:00Z

in the case when all switchovers are unsuccessful, the operator log looks like that (with only the last attempt showed):

time="2021-06-04T08:47:28Z" level=debug msg="switching over from \"acid-minimal-cluster-0\" to \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-06-04T08:47:28Z" level=debug msg="making POST http request: http://10.244.2.3:8008/failover" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-06-04T08:47:28Z" level=debug msg="subscribing to pod \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-06-04T08:47:28Z" level=debug msg="unsubscribing from pod \"default/acid-minimal-cluster-1\" events" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-06-04T08:47:28Z" level=error msg="could not failover to pod \"default/acid-minimal-cluster-1\": could not switch over from \"acid-minimal-cluster-0\" to \"default/acid-minimal-cluster-1\": patroni returned 'failover is not possible: no good candidates have been found'" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2021-06-04T08:47:28Z" level=error msg="could not move master pod \"default/acid-minimal-cluster-0\": could not migrate master pod: still failing after 5 retries" pkg=controller
time="2021-06-04T08:47:28Z" level=info msg="0/1 master pods have been moved out from the \"/kind-worker2\" node" pkg=controller
time="2021-06-04T08:47:28Z" level=warning msg="failed to move master pods from the node \"kind-worker2\": could not move master 1/1 pods from the \"/kind-worker2\" node" pkg=controller

after that the operator gives up on moving a master pod from a non-schedulable node. That matches the current behavior.

FxKu · 2021-06-11T08:36:17Z

👍

sdudoladov · 2021-06-17T06:47:20Z

👍

add basic retry around switchover

b3b9e52

sdudoladov added enhancement zalando labels May 31, 2021

sdudoladov added this to the 1.7 milestone May 31, 2021

sdudoladov requested a review from CyberDem0n as a code owner May 31, 2021 15:13

sdudoladov self-assigned this May 31, 2021

sdudoladov requested review from erthalion, FxKu, Jan-M and RafiaSabih as code owners May 31, 2021 15:13

sdudoladov commented Jun 3, 2021

View reviewed changes

pkg/cluster/pod.go Outdated Show resolved Hide resolved

propagate the error

570f30b

sdudoladov merged commit 53fb540 into master Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add basic retry around switchover #1510

Add basic retry around switchover #1510

Uh oh!

sdudoladov commented May 31, 2021

Uh oh!

Jan-M commented Jun 3, 2021

Uh oh!

Uh oh!

sdudoladov commented Jun 4, 2021 •

edited

Loading

Uh oh!

FxKu commented Jun 11, 2021

Uh oh!

sdudoladov commented Jun 17, 2021

Uh oh!

Uh oh!

Add basic retry around switchover #1510

Add basic retry around switchover #1510

Uh oh!

Conversation

sdudoladov commented May 31, 2021

Uh oh!

Jan-M commented Jun 3, 2021

Uh oh!

Uh oh!

sdudoladov commented Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FxKu commented Jun 11, 2021

Uh oh!

sdudoladov commented Jun 17, 2021

Uh oh!

Uh oh!

sdudoladov commented Jun 4, 2021 •

edited

Loading