Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stanby Cluster questions #1151

Closed
anikin-aa opened this issue Aug 19, 2019 · 10 comments
Closed

Stanby Cluster questions #1151

anikin-aa opened this issue Aug 19, 2019 · 10 comments

Comments

@anikin-aa
Copy link
Contributor

Guys, hi !

I was playing around with standby cluster feature and got some questions about it.

What is the best way for "promoting" standby cluster ? In case if Primary DC is not available ?

What I did, remove standby cluster keys from DCS. Is it correct ?

@RafiaSabih
Copy link
Contributor

Hello,
As mentioned in the documents, the best way to promote a standby cluster is through patronictl edit-config.
Have a look there and let us know if there are more questions.

@anikin-aa
Copy link
Contributor Author

@RafiaSabih thanks. Yes, I done the same but through changes via DCS.
Also, I have got another question, so there is no automatic promotion to standby when primary is not available ?

@CyberDem0n
Copy link
Member

The standby cluster doesn't know anything about the primary.
It doesn't even need a connection to it, because you can feed the standby cluster from WAL archive.

@anikin-aa
Copy link
Contributor Author

anikin-aa commented Aug 20, 2019

The standby cluster doesn't know anything about the primary.

my configuration is

    standby_cluster:
      host: 10.105.32.128
      port: 5432
      primary_slot_name: patroni
      create_replica_methods:
      - basebackup

where 10.105.32.128 - is a ip of primary cluster.

When i turn off primary cluster next errors occurs:

[2019-08-20 09:40:09,314][INFO] no action.  i am the standby leader with the lock
[2019-08-20 09:40:14.120 UTC]FATAL:  could not connect to the primary server: could not connect to server: Connection refused

@RafiaSabih
Copy link
Contributor

Yeah, since you mentioned the host, it is understandable to have such errors. With just the restore command there won't be such errors.

Regarding your auto promotion, there is nothing like automatic promotion of standby cluster yet. Since, there is no way to know if the primary has died or the network issue is there.

@anikin-aa
Copy link
Contributor Author

Since, there is no way to know if the primary has died or the network issue is there.

So may be its worth to add some kind of configurable timeout, after which, standby cluster can be promoted ?

@RafiaSabih
Copy link
Contributor

As far as I know, for the automatic promotion there is this pull request to decide if the promotion is possible using quorum commit.

@CyberDem0n
Copy link
Member

So may be its worth to add some kind of configurable timeout, after which, standby cluster can be promoted?

No way. Do it right or don't do it at all. In order to do it right there must be a global quorum across multiple data-centers (two is not enough).

@RafiaSabih #672 has nothing to do with it. It is about making use of quorum commit feature in postgres.

@alexeyklyukin
Copy link
Contributor

alexeyklyukin commented Aug 21, 2019

Since, there is no way to know if the primary has died or the network issue is there.

So may be its worth to add some kind of configurable timeout, after which, standby cluster can be promoted ?

The problem is that you also should tell the former "primary" cluster to demote (otherwise, hello split-brain), and how are you going to do this while it's unavailable? So this has to be done via some fencing mechanism that is tied to the cluster participation in a consistency layer (i.e. RAFT) over multiple DCs. The current standby cluster implementation in Patroni does none of this. One reason is that running Etcd over multiple DCs with high-latency links may lead to significant performance impact and requires tuning, and other similar systems likely behave in a similar way. The other is that such a failover will force the DB clients to connect over high-latency links, which may break certain critical architectural assumptions of your applications. Yet another one is that it will not provide any additional reliability guarantees over what you can achieve with a single Patroni cluster spread over multiple DCs.

Which leads to the point that If you need a failover between DCs with reasonable latencies (say. less than 50-100ms), you should simply run a single Patroni cluster spread over those DCs. Make sure you have at least 3 independent datacenters (given that you also run Etcd or a similar supported system there), otherwise you simply lose the majority once a DC goes down, rendering the whole setup pointless.

@anikin-aa
Copy link
Contributor Author

@CyberDem0n @alexeyklyukin thanks for your answers, now everything is clear for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants