Stanby Cluster questions #1151

anikin-aa · 2019-08-19T12:19:58Z

Guys, hi !

I was playing around with standby cluster feature and got some questions about it.

What is the best way for "promoting" standby cluster ? In case if Primary DC is not available ?

What I did, remove standby cluster keys from DCS. Is it correct ?

RafiaSabih · 2019-08-19T12:26:15Z

Hello,
As mentioned in the documents, the best way to promote a standby cluster is through patronictl edit-config.
Have a look there and let us know if there are more questions.

anikin-aa · 2019-08-20T08:52:48Z

@RafiaSabih thanks. Yes, I done the same but through changes via DCS.
Also, I have got another question, so there is no automatic promotion to standby when primary is not available ?

CyberDem0n · 2019-08-20T08:56:31Z

The standby cluster doesn't know anything about the primary.
It doesn't even need a connection to it, because you can feed the standby cluster from WAL archive.

anikin-aa · 2019-08-20T09:46:33Z

The standby cluster doesn't know anything about the primary.

my configuration is

    standby_cluster:
      host: 10.105.32.128
      port: 5432
      primary_slot_name: patroni
      create_replica_methods:
      - basebackup

where 10.105.32.128 - is a ip of primary cluster.

When i turn off primary cluster next errors occurs:

[2019-08-20 09:40:09,314][INFO] no action.  i am the standby leader with the lock
[2019-08-20 09:40:14.120 UTC]FATAL:  could not connect to the primary server: could not connect to server: Connection refused

RafiaSabih · 2019-08-20T10:31:20Z

Yeah, since you mentioned the host, it is understandable to have such errors. With just the restore command there won't be such errors.

Regarding your auto promotion, there is nothing like automatic promotion of standby cluster yet. Since, there is no way to know if the primary has died or the network issue is there.

anikin-aa · 2019-08-21T14:34:56Z

Since, there is no way to know if the primary has died or the network issue is there.

So may be its worth to add some kind of configurable timeout, after which, standby cluster can be promoted ?

RafiaSabih · 2019-08-21T14:40:11Z

As far as I know, for the automatic promotion there is this pull request to decide if the promotion is possible using quorum commit.

CyberDem0n · 2019-08-21T14:41:20Z

So may be its worth to add some kind of configurable timeout, after which, standby cluster can be promoted?

No way. Do it right or don't do it at all. In order to do it right there must be a global quorum across multiple data-centers (two is not enough).

@RafiaSabih #672 has nothing to do with it. It is about making use of quorum commit feature in postgres.

alexeyklyukin · 2019-08-21T15:07:07Z

Since, there is no way to know if the primary has died or the network issue is there.

So may be its worth to add some kind of configurable timeout, after which, standby cluster can be promoted ?

The problem is that you also should tell the former "primary" cluster to demote (otherwise, hello split-brain), and how are you going to do this while it's unavailable? So this has to be done via some fencing mechanism that is tied to the cluster participation in a consistency layer (i.e. RAFT) over multiple DCs. The current standby cluster implementation in Patroni does none of this. One reason is that running Etcd over multiple DCs with high-latency links may lead to significant performance impact and requires tuning, and other similar systems likely behave in a similar way. The other is that such a failover will force the DB clients to connect over high-latency links, which may break certain critical architectural assumptions of your applications. Yet another one is that it will not provide any additional reliability guarantees over what you can achieve with a single Patroni cluster spread over multiple DCs.

Which leads to the point that If you need a failover between DCs with reasonable latencies (say. less than 50-100ms), you should simply run a single Patroni cluster spread over those DCs. Make sure you have at least 3 independent datacenters (given that you also run Etcd or a similar supported system there), otherwise you simply lose the majority once a DC goes down, rendering the whole setup pointless.

anikin-aa · 2019-08-22T11:50:01Z

@CyberDem0n @alexeyklyukin thanks for your answers, now everything is clear for me.

anikin-aa closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanby Cluster questions #1151

Stanby Cluster questions #1151

anikin-aa commented Aug 19, 2019

RafiaSabih commented Aug 19, 2019

anikin-aa commented Aug 20, 2019

CyberDem0n commented Aug 20, 2019

anikin-aa commented Aug 20, 2019 •

edited

RafiaSabih commented Aug 20, 2019

anikin-aa commented Aug 21, 2019

RafiaSabih commented Aug 21, 2019

CyberDem0n commented Aug 21, 2019

alexeyklyukin commented Aug 21, 2019 •

edited

anikin-aa commented Aug 22, 2019

Stanby Cluster questions #1151

Stanby Cluster questions #1151

Comments

anikin-aa commented Aug 19, 2019

RafiaSabih commented Aug 19, 2019

anikin-aa commented Aug 20, 2019

CyberDem0n commented Aug 20, 2019

anikin-aa commented Aug 20, 2019 • edited

RafiaSabih commented Aug 20, 2019

anikin-aa commented Aug 21, 2019

RafiaSabih commented Aug 21, 2019

CyberDem0n commented Aug 21, 2019

alexeyklyukin commented Aug 21, 2019 • edited

anikin-aa commented Aug 22, 2019

anikin-aa commented Aug 20, 2019 •

edited

alexeyklyukin commented Aug 21, 2019 •

edited