Discussion on Race Conditions in Patroni Failover #536

jberkus · 2017-10-04T20:53:33Z

This issue is to preserve discussion of possible and known race conditions in Patroni mastering and failover, especially in Kubernetes-native Patroni.

Summary: since any container/pod can be suspended at any time, several different race conditions on the sequence of obtaining the master lock and becoming master are possible.

jberkus · 2017-10-04T20:53:51Z

Discussion from PR #500:

ants 6 days ago Contributor

Doing this as a callback is fragile, If the API call fails the cluster is effectively unavailable until failover is externally triggered.

Also, there is no interlock to prevent racy updates. Imaginary scenario:

unhealthy cluster
n-0 acquires leader lock, starts callback.py
n-0 hangs for a period of time (IO system problems, network issues, etc.)
n-1 acquires leader lock, starts callback.py
n-1 callback.py completes, directing endpoint traffic to n-1
n-0 becomes active again, runs the API call and overwrites the endpoint with n-0
n-0 demotes itself
until next failover traffic is now routed to the standby.

I think this needs integration with the HA loop to fix properly.
@CyberDem0n
CyberDem0n 6 days ago Member

That was just a fast way to copy a script we already have in Spilo: https://github.com/zalando/spilo/blob/master/postgres-appliance/callback_role.py

Agree, it probably should be done from HA loop.
@jberkus
jberkus 6 days ago Contributor

There's always going to be the possibility of code halting between checking role='master' and patch_master_endpoint, no? Or are you suggesting that the master should have a loop which constantly checks that it still owns the endpoint, so that incoherency can be limited to the loop time?
@jberkus
jberkus 6 days ago Contributor

I suppose the demotion code needs to have an endpoint check as well. Although what would a demoting node do if it found it owned the endpoint?
@ants
ants 6 days ago Contributor

Yes I think the loop should check that it still own the endpoint. Also, I think it should release the leader key if it fails to get the endpoint within a reasonable amount of time. I think this might have some overlap with my todo item of implementing basic OCF resource support. Similar considerations would apply with moving a virtual IP around, etc.
@CyberDem0n
CyberDem0n 5 days ago • edited Member

In order not to have a race conditions master endpoint should be changed only by leader.
Current implementation is using ConfigMaps to store information about leader. The same way we can use Endpoints instead. I.e. Patroni can periodically (every HA loop) update annotations and subsets of the leader Endpoint.

Off course such solution has it's own pros and cons.

Pros: when the new master get a leader lock it will atomically change master service

Cons:

I have no idea how kubernetes will react on periodic updates of leader Endpoint. Every update in theory should cause update of leader Service. We should really check that it will not trigger rebuild of iptables rules everywhere.
Right now Patroni is doing 2 API reads (list_namespaced_config_map and list_namespaced_pod). New implementation will do 3 calls (list_namespaced_endpoint). We kind of can avoid it by converting all ConfigMaps to Endpoints, but I am not sure that it is a good idea.

@ants
ants 5 days ago Contributor

Directly patching Endpoint has another downside - OpenShift does not let normal users make endpoints that point to pods. As documented in the final paragraph here. At least for OpenShift I would like to rely on pod annotations and have the platform take care of routing a Service to the master. But even that still has the same issue that for reliability a callback is less than optimal.
@jberkus
jberkus 5 days ago Contributor

I'm nervous about the idea of dropping the leader lock if grabbing the endpoint/service fails. That ups the number of sequential operations which need to succeed to three; that seems prone to race-condition type failures.

Some unresolved questions:

What should be the proper ordering of leader lock, failover/promotion, endpoint?
What's the sequence of undoing these actions if any of them fails? What happens if the undo fails?
What do we want to be the result if kubernetes is unresponsive/misconfigured and no node can update the endpoint?

That last case is potentially pathological, and argues that updating the endpoint needs to come before failover/promotion instead of afterwards, since changing the endpoint is easily reversable and failover is not. If we keep it afterwards, having RBAC config issues or container networking issues could mean that you have a set of disconnected nodes which can't be ever reconnected as replicas, and among which Patroni can't choose a good new master candidate.

I think this needs a deeper discussion than a line-item comment on a patch review.
@CyberDem0n
CyberDem0n a day ago Member

@ants, @jberkus:

I this this commit baaf20b should fix most of the problems.
Since Patroni is already doing some metadata manipulations on the Pod, why not assign label with the role right from HA loop? It makes it possible to use labelSelector without any external callback script.

Unfortunately it is still not solving a biggest problem, when the master Pod hangs for a long time and looses the leader key. In this case it still could happen that two Pods will get role=master...

It looks like the only reliable way to make sure that master Endpoint has only one IP is patching it from Patroni HA loop together (and atomically) with updating leader lock.

jberkus · 2017-10-12T23:18:05Z

Possible sources of race conditions:

Race Conditions During Failover:

We have a number of potential race conditions during failover, due to the fact that we have three actions which need to be peformed (master lock, promotion, endpoint update) and no way to perform them in a "transaction". No matter how many master lock checks we perform, there is always the possibility of the pending master's container being suspended between two lines of code. The primary danger areas are:

new master is suspended between the master lock check and promoting itself.
new master is suspended between performing the failover and updating the endpoint

In either case, we can get a period of inconsistency during which another master has promoted itself due to the first pending master losing the lock, but proceeding to promote and/or update the endpoint anyway. During this period, data writes could go to the wrong master and be lost.

Proposal: Add multiple master lock checks in order to minimize period of inconsistency, e.g.

get master lock
promote
check master lock, if lost attempt to find correct master, and become replica of it, otherwise shut down
set endpoint
check master lock, if lost set endpoint to null and reboot as read-only

jberkus · 2017-10-12T23:21:36Z

Race Conditions During Operation

Because Etcd TTLs are client-based, we can have an issue of inconsistency between nodes if system clocks on the nodes are sufficiently different. For example, if the TTL for the master lock is seen as expired by a replica and non-expired by the current master.

However, it is unclear on what specific harm this would cause other than an unnecessary failover. Discuss.

CyberDem0n · 2017-10-19T07:51:49Z

Hi @jberkus,

I have a receipt against first race condition. It is pretty simple but and effective. "Get leader lock" and "set endpoint" will happen absolutely atomically if Patroni uses Endpoints objects instead of ConfigMaps. Config file should be changed to the following

kubernetes:
  namespace: default
  labels:
    application: spilo
  scope_label: cluster-name
  role_label: role
  use_endpoints: true
  pod_ip: 1.2.3.4
  ports:  # optional, defaults to 5432
    - port: 5432
      name: postgresql  # optional, should not be set or set to the
#                         same value as in corresponding Service
# https://github.com/kubernetes-incubator/client-python/blob/master/kubernetes/docs/V1EndpointPort.md
postgresql:
  listen: 0.0.0.0:5432
  connect_address: 1.2.3.4:5432

Unfortunately it will not really work on OpenShift, because they disallow setting ips from the range allocated to Pods as subsets of Endpoints.

CyberDem0n · 2017-10-19T07:58:49Z

And regarding the second one and client-based ttl.

This implementation can tolerate arbitrary clock difference on different hosts.
Nodes doesn't pay much attention on timestamps stored in annotations, they just check whether entire annotation has changed or not. If it didn't changed during TTL, it considered to be "stale".

Since we have TTL=30 and loop_wait=10, clock on one node should run 3 times faster than on the node where the master run to hit such race condition.

jberkus · 2017-10-20T00:21:45Z

I'll bring up the limitation with the OpenShift team.

jberkus · 2019-05-28T21:44:21Z

Closing this since we're no longer discussing it.

CyberDem0n mentioned this issue Nov 27, 2017

Make Patroni Kubernetes native #500

Merged

CyberDem0n mentioned this issue Mar 10, 2018

Cluster without a VIP ClusterLabs/PAF#108

Closed

jberkus closed this as completed May 28, 2019

CyberDem0n mentioned this issue Oct 20, 2019

guarantee callback order and serialization of commands #1239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on Race Conditions in Patroni Failover #536

Discussion on Race Conditions in Patroni Failover #536

jberkus commented Oct 4, 2017

jberkus commented Oct 4, 2017

jberkus commented Oct 12, 2017

jberkus commented Oct 12, 2017

CyberDem0n commented Oct 19, 2017

CyberDem0n commented Oct 19, 2017

jberkus commented Oct 20, 2017

jberkus commented May 28, 2019

Discussion on Race Conditions in Patroni Failover #536

Discussion on Race Conditions in Patroni Failover #536

Comments

jberkus commented Oct 4, 2017

jberkus commented Oct 4, 2017

jberkus commented Oct 12, 2017

jberkus commented Oct 12, 2017

CyberDem0n commented Oct 19, 2017

CyberDem0n commented Oct 19, 2017

jberkus commented Oct 20, 2017

jberkus commented May 28, 2019