Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failover fails with synchronous_mode and 2 instances #2276

Closed
koelnconcert opened this issue Mar 28, 2023 · 3 comments · Fixed by #2278
Closed

failover fails with synchronous_mode and 2 instances #2276

koelnconcert opened this issue Mar 28, 2023 · 3 comments · Fixed by #2278
Labels

Comments

@koelnconcert
Copy link

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.9.0
  • Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s
  • Are you running Postgres Operator in production? yes
  • Type of issue? Bug report

We have installed postgres-operator 1.9.0 via helm with default values and a cluster with synchronous_mode: true and numberOfInstances: 2.

On initial deployment everything is okay. postgres-testdb-0 is the leader, postgres-testdb-1 is following.

When we force a cluster update, e.g. by updating podAnnotations (foo: "2"):

  1. kubectl get postgresqls shows Updating
  2. postgres-testdb-1 is restarted
  3. kubectl get postgresqls shows UpdateFailed
  4. kubectl describe postgresql shows

    Switchover from "postgres-testdb-1" to "default/postgres-testdb-0" FAILED: could not switch over from "postgres-testdb-1" to "default/postgres-testdb-0": patroni returned 'Failover failed'

  5. postgres-testdb-0 is not restarted

Combined log of postgres-operator ("time=") and leader pod:

time="2023-03-27T14:32:27Z" level=info msg="there are 2 pods in the cluster to recreate"
2023-03-27 14:32:33,814 INFO: Updating synchronous privilege temporarily from ['postgres-testdb-1'] to []
time="2023-03-27T14:32:52Z" level=info msg="pod \"default/postgres-testdb-1\" has been recreated
time="2023-03-27T14:32:52Z" level=debug msg="making POST http request: http://10.147.180.36:8008/failover"
2023-03-27 14:32:52,621 WARNING: Failover candidate=postgres-testdb-1 does not match with sync_standbys=None
2023-03-27 14:32:52,621 WARNING: manual failover: members list is empty
2023-03-27 14:32:52,621 WARNING: manual failover: no healthy members found, failover is not possible
2023-03-27 14:32:52,669 INFO: Assigning synchronous standby status to ['postgres-testdb-1']
2023-03-27 14:32:54,732 INFO: Synchronous standby status assigned to ['postgres-testdb-1']

It looks to me that postgres-operator is initiating the failover to the (only possible) replica although the restarted replica is not yet marked as fully ready. This finally happens 2 seconds too late.

For more extensive logs and cluster yaml see https://gist.github.com/koelnconcert/30f541aee49b0de163faeefa1bce74a7

Remarks:

  • this worked absolutely fine with postgres-operator 1.6.3
  • with postgres-operator 1.8.0 we have a very similar (probably the same) issue
  • from the release notes of 1.7.0: "Add basic retry around switchover". This does not happen here, but we are not sure if that feature is even meant to be applicable here.
  • when using numberOfInstances: 3 we do not face this issue
@kkrasnov1
Copy link

I have the same problem.
I found the following workarounds:

  • wait for the operator to retry synchronization
  • restart operator

@FxKu FxKu added the bug label Mar 30, 2023
@FxKu
Copy link
Member

FxKu commented Mar 30, 2023

Thanks for raising this @koelnconcert and @kkrasnov1 . It looks indeed like we do not wait long enough. I will try to fix it for the next release.

@kkrasnov1
Copy link

I updated the operator to version 1.10.0 but the problem remained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants