No switchover candidate found #1992

dobrac · 2022-08-10T12:00:48Z

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.2
Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s (microk8s)
Are you running Postgres Operator in production? yes
Type of issue? Bug report

Helllo,

we are encountering an error while cluster updating (any change in the configuration that needs pods restart), where the master is not correctly switched to the synchronous replica. We have 2 instances, one is master and one is a synchronous replica.

Here is log from the operator, the problem occurs every time:

time="2022-08-10T11:39:52Z" level=info msg="reason: new statefulset containers's postgres (index 0) environment does not match the current one" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:39:52Z" level=debug msg="updating statefulset" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:39:53Z" level=debug msg="making GET http request: http://10.1.63.119:8008/config" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:05Z" level=debug msg="making GET http request: http://10.1.11.85:8008/patroni" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:05Z" level=debug msg="making GET http request: http://10.1.63.119:8008/patroni" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:05Z" level=debug msg="performing rolling update" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:05Z" level=info msg="there are 2 pods in the cluster to recreate" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:05Z" level=debug msg="subscribing to pod "postgres-qplay/qplay-prod-0"" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:27Z" level=info msg="pod "postgres-qplay/qplay-prod-0" has been recreated" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:27Z" level=debug msg="unsubscribing from pod "postgres-qplay/qplay-prod-0" events" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:27Z" level=debug msg="making GET http request: http://10.1.63.119:8008/cluster" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:27Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not recreate pods: skipping switchover: no switchover candidate found" cluster-name=postgres-qplay/qplay-prod pkg=cluster
time="2022-08-10T11:40:27Z" level=error msg="could not sync cluster: could not sync statefulsets: could not recreate pods: skipping switchover: no switchover candidate found" cluster-name=postgres-qplay/qplay-prod pkg=controller worker=0
time="2022-08-10T11:40:27Z" level=info msg="received add event for already existing Postgres cluster" cluster-name=postgres-qplay/qplay-prod pkg=controller worker=0

Any idea what could be wrong? It could be definitely some wrong configuration (maybe three instances are the minimum?), but I can't figure out that it is.

FxKu · 2022-08-23T15:12:25Z

Are both instances healthy? "no switchover candidate found" sounds like no healthy replica has been found.

dobrac · 2022-08-23T16:38:51Z

I’m not fully sure if the second one is successfully restarted (and healthy) at that moment, but it’s definitely healthy after short time. If I do manual switchover using patronictl, everything is working well.

I haven’t manage to do successful switchover while updating, it fails every time.

bootc · 2024-01-14T09:52:16Z

I am also running into this issue, but I'm not sure it's exactly like OP. Like @dobrac I can exec into the pod and force a switchover and all is well, but the Operator fails to do the switchover when pods need evicting:

time="2024-01-14T09:34:53Z" level=info msg="starting process to migrate master pod \"gitlab/gitlab-postgres-1\"" cluster-name=gitlab/gitlab-postgres pkg=cluster
time="2024-01-14T09:34:54Z" level=debug msg="making GET http request: http://10.44.95.149:8008/cluster" cluster-name=gitlab/gitlab-postgres pkg=cluster
time="2024-01-14T09:34:54Z" level=error msg="could not move master pod \"gitlab/gitlab-postgres-1\": could not find suitable replica pod as candidate for failover: no switchover candidate found" pkg=controller

The cluster status from the URI mentioned is:

{
  "members": [
    {
      "name": "gitlab-postgres-0",
      "role": "replica",
      "state": "streaming",
      "api_url": "http://10.44.22.248:8008/patroni",
      "host": "10.44.22.248",
      "port": 5432,
      "timeline": 63,
      "lag": 0
    },
    {
      "name": "gitlab-postgres-1",
      "role": "leader",
      "state": "running",
      "api_url": "http://10.44.95.149:8008/patroni",
      "host": "10.44.95.149",
      "port": 5432,
      "timeline": 63
    },
    {
      "name": "gitlab-postgres-2",
      "role": "replica",
      "state": "streaming",
      "api_url": "http://10.44.98.184:8008/patroni",
      "host": "10.44.98.184",
      "port": 5432,
      "timeline": 63,
      "lag": 0
    }
  ]
}

This is with Postgres Operator 1.10.1 and spilo-15:3.1-p1 on Kubernetes 1.28.5 using Rancher RKE2 on on-prem VMs and bare metal.

It looks like the "state": "streaming" is not accounted for; the code expects "state": "running" to find candidates.

…ando#1992)

macedigital · 2024-01-19T17:26:45Z

Thank you @bootc for picking this up and providing a solution 👍

It seems it might trace down to a change introduced here patroni/patroni@d46ca88.

We'll aim to merge the fix next week.

…ando#1992)

…) (#2515) * fix: no switchover candidate found with member state "streaming" (#1992) * Add test * Also handle "in archive recovery" state

macedigital · 2024-01-24T11:00:20Z

Thank you @bootc for your contribution!

monotek · 2024-05-23T11:33:43Z

We still see this problem with operator version v1.11.0.

The "ghcr.io/zalando/spilo-15:3.2-p1" image, where the patrocnictl list output has "streaming" instead of "running" in the state.

Going back to "ghcr.io/zalando/spilo-15:3.0-p1" image sovles the problem

Edit: Nevermind, we updated the operator chart but had set the old operator image explicitly.
Works after removing the old image tag and using the charts default image version.

FxKu added the technical issue label Aug 23, 2022

bootc added a commit to bootc/postgres-operator that referenced this issue Jan 14, 2024

fix: no switchover candidate found with member state "streaming" (zal…

92d310c

…ando#1992)

bootc mentioned this issue Jan 14, 2024

fix: no switchover candidate found with member state "streaming" (#1992) #2515

Merged

bootc added a commit to bootc/postgres-operator that referenced this issue Jan 22, 2024

fix: no switchover candidate found with member state "streaming" (zal…

27ce843

…ando#1992)

FxKu closed this as completed in #2515 Jan 24, 2024

FxKu pushed a commit that referenced this issue Jan 24, 2024

fix: no switchover candidate found with member state "streaming" (#1992…

8f31399

…) (#2515) * fix: no switchover candidate found with member state "streaming" (#1992) * Add test * Also handle "in archive recovery" state

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No switchover candidate found #1992

No switchover candidate found #1992

dobrac commented Aug 10, 2022 •

edited

Loading

FxKu commented Aug 23, 2022

dobrac commented Aug 23, 2022

bootc commented Jan 14, 2024

macedigital commented Jan 19, 2024

macedigital commented Jan 24, 2024

monotek commented May 23, 2024 •

edited

Loading

No switchover candidate found #1992

No switchover candidate found #1992

Comments

dobrac commented Aug 10, 2022 • edited Loading

FxKu commented Aug 23, 2022

dobrac commented Aug 23, 2022

bootc commented Jan 14, 2024

macedigital commented Jan 19, 2024

macedigital commented Jan 24, 2024

monotek commented May 23, 2024 • edited Loading

dobrac commented Aug 10, 2022 •

edited

Loading

monotek commented May 23, 2024 •

edited

Loading