New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on Race Conditions in Patroni Failover #536
Comments
Discussion from PR #500: ants 6 days ago Contributor Doing this as a callback is fragile, If the API call fails the cluster is effectively unavailable until failover is externally triggered. Also, there is no interlock to prevent racy updates. Imaginary scenario:
I think this needs integration with the HA loop to fix properly. That was just a fast way to copy a script we already have in Spilo: https://github.com/zalando/spilo/blob/master/postgres-appliance/callback_role.py Agree, it probably should be done from HA loop. There's always going to be the possibility of code halting between checking role='master' and patch_master_endpoint, no? Or are you suggesting that the master should have a loop which constantly checks that it still owns the endpoint, so that incoherency can be limited to the loop time? I suppose the demotion code needs to have an endpoint check as well. Although what would a demoting node do if it found it owned the endpoint? Yes I think the loop should check that it still own the endpoint. Also, I think it should release the leader key if it fails to get the endpoint within a reasonable amount of time. I think this might have some overlap with my todo item of implementing basic OCF resource support. Similar considerations would apply with moving a virtual IP around, etc. In order not to have a race conditions master endpoint should be changed only by leader. Off course such solution has it's own pros and cons. Pros: when the new master get a leader lock it will atomically change master service Cons:
@ants Directly patching Endpoint has another downside - OpenShift does not let normal users make endpoints that point to pods. As documented in the final paragraph here. At least for OpenShift I would like to rely on pod annotations and have the platform take care of routing a Service to the master. But even that still has the same issue that for reliability a callback is less than optimal. I'm nervous about the idea of dropping the leader lock if grabbing the endpoint/service fails. That ups the number of sequential operations which need to succeed to three; that seems prone to race-condition type failures. Some unresolved questions:
That last case is potentially pathological, and argues that updating the endpoint needs to come before failover/promotion instead of afterwards, since changing the endpoint is easily reversable and failover is not. If we keep it afterwards, having RBAC config issues or container networking issues could mean that you have a set of disconnected nodes which can't be ever reconnected as replicas, and among which Patroni can't choose a good new master candidate. I think this needs a deeper discussion than a line-item comment on a patch review. I this this commit baaf20b should fix most of the problems. Unfortunately it is still not solving a biggest problem, when the master Pod hangs for a long time and looses the leader key. In this case it still could happen that two Pods will get role=master... It looks like the only reliable way to make sure that master Endpoint has only one IP is patching it from Patroni HA loop together (and atomically) with updating leader lock. |
Possible sources of race conditions: Race Conditions During Failover: We have a number of potential race conditions during failover, due to the fact that we have three actions which need to be peformed (master lock, promotion, endpoint update) and no way to perform them in a "transaction". No matter how many master lock checks we perform, there is always the possibility of the pending master's container being suspended between two lines of code. The primary danger areas are:
In either case, we can get a period of inconsistency during which another master has promoted itself due to the first pending master losing the lock, but proceeding to promote and/or update the endpoint anyway. During this period, data writes could go to the wrong master and be lost. Proposal: Add multiple master lock checks in order to minimize period of inconsistency, e.g.
|
Race Conditions During Operation Because Etcd TTLs are client-based, we can have an issue of inconsistency between nodes if system clocks on the nodes are sufficiently different. For example, if the TTL for the master lock is seen as expired by a replica and non-expired by the current master. However, it is unclear on what specific harm this would cause other than an unnecessary failover. Discuss. |
Hi @jberkus, I have a receipt against first race condition. It is pretty simple but and effective. "Get leader lock" and "set endpoint" will happen absolutely atomically if Patroni uses kubernetes:
namespace: default
labels:
application: spilo
scope_label: cluster-name
role_label: role
use_endpoints: true
pod_ip: 1.2.3.4
ports: # optional, defaults to 5432
- port: 5432
name: postgresql # optional, should not be set or set to the
# same value as in corresponding Service
# https://github.com/kubernetes-incubator/client-python/blob/master/kubernetes/docs/V1EndpointPort.md
postgresql:
listen: 0.0.0.0:5432
connect_address: 1.2.3.4:5432 Unfortunately it will not really work on OpenShift, because they disallow setting ips from the range allocated to Pods as subsets of Endpoints. |
And regarding the second one and client-based ttl. This implementation can tolerate arbitrary clock difference on different hosts. Since we have TTL=30 and loop_wait=10, clock on one node should run 3 times faster than on the node where the master run to hit such race condition. |
I'll bring up the limitation with the OpenShift team. |
Closing this since we're no longer discussing it. |
@ants @CyberDem0n
This issue is to preserve discussion of possible and known race conditions in Patroni mastering and failover, especially in Kubernetes-native Patroni.
Summary: since any container/pod can be suspended at any time, several different race conditions on the sequence of obtaining the master lock and becoming master are possible.
The text was updated successfully, but these errors were encountered: