Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No callbacks on failed start & bringing down a VIP on a failed master #1418

Closed
blogh opened this issue Feb 25, 2020 · 3 comments · Fixed by #1420
Closed

No callbacks on failed start & bringing down a VIP on a failed master #1418

blogh opened this issue Feb 25, 2020 · 3 comments · Fixed by #1420

Comments

@blogh
Copy link
Contributor

blogh commented Feb 25, 2020

Hi,

First and foremost : thank you for the software !

We have built a 3 nodes cluster with the following software on each nodes :

  • etcd
  • patroni
  • postgresql

We use a VIP to direct the connection on the master node. For this we use patroni's callbacks.

  • We delete the VIP on :

    • any callback on a replica
    • on_stop callbacks on the master
  • We create the VIP on :

    • any callback on the master except on_stop

One of our crash scenarios involves removing the $PGDATA/global/pg_control on the master server.

Patroni crashes with the following errors since the data from pg_controldata is not available :

2020-02-25 14:41:18,104 ERROR: Error when calling pg_controldata
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/__init__.py", line 641, in controldata
    data = subprocess.check_output([self.pgcommand('pg_controldata'), self._data_dir], env=env)
  File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib64/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/pgsql-11/bin/pg_controldata', '/var/lib/pgsql/11/data']' returned non-zero exit status 1.

2020-02-25 14:41:18,105 ERROR: Exception during execution of long running task restarting after failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/patroni/async_executor.py", line 97, in run
    wakeup = func(*args) if args else func()
  File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/__init__.py", line 715, in follow
    self.start(timeout=timeout, block_callbacks=change_role, role=role)
  File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/__init__.py", line 418, in start
    configuration = self.config.effective_configuration
  File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/config.py", line 1027, in effective_configuration
    cvalue = parse_int(data[cname])
KeyError: 'max_connections setting'

The failure is handled correctly by patroni, the master fails over and the crashed primary is marked as failed. But in that case the on_start callback is never fired since PostgreSQL is never started as a strandby . In our case this means the VIP is not going down on the crashed primary.

Is there another way to archive our goal (bringing down the VIP) with this kind of crash ?

It feels like the failure could be handled more gracefully.
What do you think about adding a on_failed callback ?

Thank you in advance for your feedback.

Benoit

@ghost
Copy link

ghost commented Feb 25, 2020

I’m interested in learning more about these callbacks - could you point me to relevant documentation?

We use vip-manager (https://github.com/cybertec-postgresql/vip-manager) to manage VIPs. It runs on each node and checks Consul (could also use etcd) regularly. This may work better for your needs, as there is no dependency on Patroni functioning correctly.

@CyberDem0n
Copy link
Member

@blogh yeah, looks like a bug.
It supposed to call on_role_change callback, but it didn't due to lack of exception handling in the #703

I will prepare a fix tomorrow.

As @caseyallenshobe correctly mentioned it is better to rely on Cybertec vip-manager.

@blogh
Copy link
Contributor Author

blogh commented Feb 25, 2020

Thank you guys for the feedback.

It felt weird to build layers upon layers of HA code, so we left out the cybertec solution for now and used a custom script : https://github.com/daamien/ansible-role-patroni-vip

We will look into the cybertec solution again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants