Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cascade replicaiton in openshift in case of failover #984

Closed
ainlolcat opened this issue Feb 28, 2019 · 9 comments
Closed

Cascade replicaiton in openshift in case of failover #984

ainlolcat opened this issue Feb 28, 2019 · 9 comments

Comments

@ainlolcat
Copy link
Contributor

We tried to setup cascade replication in openshift and find several problems:

  1. replicatefrom tag requires members name but it is name of pod and it can be changed after DeploymentConfiguration change or VM restart or whatever. It's small concern because patroni will use master instead and cluster will survive but still requires reconfiguration from outside of patroni sooner or later to recover cascade replication.
  2. Problems with failed pod - broken replication, frozen docker, VM loss. It will be considered as member at least untill openshift will shut down pod. Such problem cluster cannot survive.

Is there a safe method to setup cascade replication 1>2>3 and make sure everything will work fine even if any of pods will freeze or die? May be we can hack method patroni.ha.Ha#_get_node_to_follow to propose node for replicaiton instead of static tag?

@ainlolcat
Copy link
Contributor Author

I suppose this issue has something common with #422 but with more specific cluster configuration. I suppose we can see this as label on pod with role (like existing master/replica but more specific like main replica/second replica) and setup chain of priorities like [second replica, main replica, master]. On the other side if _get_node_to_follow provides extension point and pass list of members with with replication lag and labels we can pick up target for replication.

@CyberDem0n
Copy link
Member

Like it is proposed in the #422, we can introduce a new tag with the name for example cascading, which will show that this specific node is supposed to be a cascading replica.

What should be the behavior of untagged nodes? Right now they always replicate from the master.
The new behavior could be:

if any cascading nodes:
  stream from cascading node
else:
  stream from the master

There is actually another important question, how to chose the cascading replica to stream from? Is the random good enough? Or we should somehow load-balance across them? What should be the behavior if the number of cascading replicas increases? Should other replicas rebalance automatically to start streaming from the new node?

@ainlolcat
Copy link
Contributor Author

I think cascade replication depends on hardware or/and network. For example I can allow 1>2>3>4 but dont want 1>3>2>4 because 3 and 4 are in different network or have slower disks. So I prefer 1>2>3>4 or 1>3>4 or 1>2>4. So it will be nice to have mechanism which will allow us to select replication source with internal knowledge in mind.

@CyberDem0n
Copy link
Member

So it will be nice to have mechanism which will allow us to select replication source with internal knowledge in mind.

Well, the thing is, that the current mechanism allows exactly this, every node could specify which node it wants to replicate from.

@ainlolcat
Copy link
Contributor Author

We can specify it during start but cannot change it if some node failed. For example if we have 1>2>3>4 and 2 was damaged or cannot start replication or anything else (but pod 2 still alive) we cannot change to 1>3>4 without external orchestrator which will restart pods 3 and 4 with new tag in mind. Patroni will handle situation in some cases (for example if pod with node2 will be removed).

Examples:

  • If node2 cannot start postgres (no space left, FS damage, other problems) patroni will wait and cluster will be useless (at least I cannot find check if member "healthy" as replication source). We will need to setup arbiter which will change replication to 1>3>4 untill administrator fix node2.
  • If node2 restarted patroni will change replicaiton to 1>2 + 1>3>4 because node2 will have new name in openshift. We will need arbiter to restore replication after node2 will be healthy enough.
  • If node3 lost I prefer replication 1>2>4 but patroni will use 1>2 + 1>4 because default fallback is to use leader.

So current mechanism assumes patroni over patroni or manual operations. I suppose one of options is to mimic mechanics of bootstrap when patroni has default behaviour but allows to setup custom bootstrap method.

@CyberDem0n
Copy link
Member

1>2>3>4 is already quite a long chain, but you are right, there is no readiness check in the code which selects the node to replicate from. Therefore until 2 is registered in DCS 3 will try to replicate from it.
If 2 will go away, 3 will start replicating from 1.

If node2 restarted patroni will change replicaiton to 1>2 + 1>3>4 because node2 will have new name in openshift. We will need arbiter to restore replication after node2 will be healthy enough.

That's sad but true.

If node3 lost I prefer replication 1>2>4 but patroni will use 1>2 + 1>4 because default fallback is to use leader.

Yeah, this is really how it is programmed to work. Complex topologies are not easy to describe.

If you come up with a nice way of describing such topologies we would be very happy to support it.

@ainlolcat
Copy link
Contributor Author

Without fixed names and roles it will be hard to describe such topology. I can design algorithm for our topology but cannot generalize it for any possible. My best proposal - extension point with default implementations like this:

  • none - without cascade replicaiton (default if no other settings provided)
  • replicatefrom - Current implementation based on replicatefrom tags (default if replicatefrom specified)
  • cascade - implementation based on another mechanics (for example minimum ping for replicationn source, role of node and health check) .
  • custom - method which will take current cluster description and returns single member name or None. So user can specify method just like for bootstrap.

For cascade method we need to build graph without cycles with beast load spread and minimum overhead. I suppose we can use network lag and may be overall node performance (io wait?) and load to put node in right place in graph. For several nodes we can just check every possible configuration with given master and select configuration with less panalty. For our cluster such graph will be like 1(master) > 2(closest to master) > 3(can replicate from 2 and prefer to leave master as it is already under load) > 4 (3 is closest free node available) eventually. If any of members will be shut down or damaged we will need to calculate another topology and apply it.

@ainlolcat
Copy link
Contributor Author

I want to implement this feature so it will be nice to come to agreement so we can sync branches in future.

@walbertus
Copy link
Contributor

Cascade replication without specific replicatefrom seems useful.
If the implementation issue is the topology, can we replicate it from a random standby without rebalancing for initial implementation? This random standby choice will help with issue #422

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants