Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No transfer master role if current failed #1728

Closed
and111016 opened this issue Oct 5, 2020 · 4 comments
Closed

No transfer master role if current failed #1728

and111016 opened this issue Oct 5, 2020 · 4 comments

Comments

@and111016
Copy link

Describe the bug
Patroni doesn't demote failed master after postmaster stopped unexpected.
Consul de-registered master tag for service.
Cluster was restored after execute manual failover by patronictl

To Reproduce
Steps to reproduce the behavior:
According to #1371. Thanks to @kmsarabu
ps -ef|grep 'postgres .* --listen_addresses=.* --port='|xargs kill -STOP

Expected behavior
Failed Leader demoted and new Leader promoted after service check failed.

Environment

  • Patroni version: 1.6.4
  • PostgreSQL version: PostgreSQL 11.8
  • DCS (and its version): Consul 1.7.1

Patroni configuration file
patronictl show-config

loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters:
archive_command: mkdir -p ../wal_archive && test ! -f ../wal_archive/%f && cp
%p ../wal_archive/%f
archive_mode: 'on'
archive_timeout: 1800s
hot_standby: 'on'
max_replication_slots: 5
max_wal_senders: 10
wal_keep_segments: 8
wal_level: hot_standby
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
ttl: 30


**Have you checked Patroni logs?**

Sep 30 18:55:50 app-1 patroni[5395]: 2020-09-30 18:55:50,916 INFO: Lock owner: app-1; I am app-1
Sep 30 18:55:50 app-1 patroni[5395]: 2020-09-30 18:55:50,918 INFO: no action. i am the leader with the lock
Sep 30 18:56:00 app-1 patroni[5395]: 2020-09-30 18:56:00,916 INFO: Lock owner: app-1; I am app-1
Sep 30 18:56:00 app-1 patroni[5395]: 2020-09-30 18:56:00,918 INFO: no action. i am the leader with the lock
Sep 30 18:56:10 app-1 patroni[5395]: 2020-09-30 18:56:10,916 INFO: Lock owner: app-1; I am app-1
Sep 30 18:56:10 app-1 patroni[5395]: 2020-09-30 18:56:10,918 INFO: no action. i am the leader with the lock
Sep 30 18:56:17 app-1 patroni[5395]: 2020-09-30 18:56:17,637 ERROR: get_postgresql_status
Sep 30 18:56:17 app-1 patroni[5395]: Traceback (most recent call last):
Sep 30 18:56:17 app-1 patroni[5395]: File "/usr/local/lib/python3.6/site-packages/patroni/api.py", line 443, in get_postgresql_status
Sep 30 18:56:17 app-1 patroni[5395]: self.server.patroni.postgresql.lsn_name), retry=retry)[0]
Sep 30 18:56:17 app-1 patroni[5395]: File "/usr/local/lib/python3.6/site-packages/patroni/api.py", line 416, in query
Sep 30 18:56:17 app-1 patroni[5395]: return self.server.query(sql, *params)
Sep 30 18:56:17 app-1 patroni[5395]: File "/usr/local/lib/python3.6/site-packages/patroni/api.py", line 503, in query
Sep 30 18:56:17 app-1 patroni[5395]: raise e
Sep 30 18:56:17 app-1 patroni[5395]: File "/usr/local/lib/python3.6/site-packages/patroni/api.py", line 499, in query
Sep 30 18:56:17 app-1 patroni[5395]: cursor.execute(sql, params)
Sep 30 18:56:17 app-1 patroni[5395]: psycopg2.OperationalError: server closed the connection unexpectedly
Sep 30 18:56:17 app-1 patroni[5395]: This probably means the server terminated abnormally
Sep 30 18:56:17 app-1 patroni[5395]: before or while processing the request.
Sep 30 18:56:17 app-1 patroni[5395]: 2020-09-30 18:56:17,638 ERROR: get_postgresql_status
Sep 30 18:56:17 app-1 patroni[5395]: Traceback (most recent call last):
Sep 30 18:56:17 app-1 patroni[5395]: File "/usr/local/lib/python3.6/site-packages/patroni/api.py", line 499, in query
Sep 30 18:56:17 app-1 patroni[5395]: cursor.execute(sql, params)
Sep 30 18:56:17 app-1 patroni[5395]: psycopg2.OperationalError: server closed the connection unexpectedly
Sep 30 18:56:17 app-1 patroni[5395]: This probably means the server terminated abnormally
Sep 30 18:56:17 app-1 patroni[5395]: before or while processing the request.
Sep 30 18:56:17 app-1 patroni[5395]: During handling of the above exception, another exception occurred:
Sep 30 18:56:17 app-1 patroni[5395]: Traceback (most recent call last):
Sep 30 18:56:17 app-1 patroni[5395]: File "/usr/local/lib/python3.6/site-packages/patroni/api.py", line 443, in get_postgresql_status
Sep 30 18:56:17 app-1 patroni[5395]: self.server.patroni.postgresql.lsn_name), retry=retry)[0]
Sep 30 18:56:17 app-1 patroni[5395]: File "/usr/local/lib/python3.6/site-packages/patroni/api.py", line 416, in query
Sep 30 18:56:17 app-1 patroni[5395]: return self.server.query(sql, *params)
Sep 30 18:56:17 app-1 patroni[5395]: File "/usr/local/lib/python3.6/site-packages/patroni/api.py", line 504, in query
Sep 30 18:56:17 app-1 patroni[5395]: raise PostgresConnectionException('connection problems')
Sep 30 18:56:17 app-1 patroni[5395]: patroni.exceptions.PostgresConnectionException: 'connection problems'
...
Sep 30 20:10:44 app-1 patroni[5395]: 2020-09-30 20:10:44,450 INFO: Lock owner: app-1; I am app-1
Sep 30 20:10:44 app-1 patroni[5395]: 2020-09-30 20:10:44,450 INFO: Lock owner: app-1; I am app-1
Sep 30 20:10:44 app-1 patroni[5395]: 2020-09-30 20:10:44,451 INFO: starting as readonly because i had the session lock
Sep 30 20:10:44 app-1 patroni[5395]: 2020-09-30 20:10:44.652 MSK [4753] LOG: listening on IPv4 address "0.0.0.0", port 5432
Sep 30 20:10:44 app-1 patroni[5395]: 2020-09-30 20:10:44.653 MSK [4753] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
Sep 30 20:10:44 app-1 patroni[5395]: 2020-09-30 20:10:44.655 MSK [4753] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
Sep 30 20:10:44 app-1 patroni[5395]: 2020-09-30 20:10:44,664 INFO: postmaster pid=4753
Sep 30 20:10:44 app-1 patroni[5395]: 2020-09-30 20:10:44.689 MSK [4753] LOG: redirecting log output to logging collector process
Sep 30 20:10:44 app-1 patroni[5395]: 2020-09-30 20:10:44.689 MSK [4753] HINT: Future log output will appear in directory "log".
Sep 30 20:10:44 app-1 patroni[5395]: localhost:5432 - rejecting connections
Sep 30 20:10:44 app-1 patroni[5395]: localhost:5432 - rejecting connections
Sep 30 20:10:45 app-1 patroni[5395]: localhost:5432 - accepting connections
Sep 30 20:10:45 app-1 patroni[5395]: 2020-09-30 20:10:45,848 INFO: Lock owner: app-1; I am app-1
Sep 30 20:10:45 app-1 patroni[5395]: 2020-09-30 20:10:45,848 INFO: establishing a new patroni connection to the postgres cluster
Sep 30 20:10:45 app-1 patroni[5395]: 2020-09-30 20:10:45,864 INFO: Got response from app-2 http://10.0.0.3:8008/patroni: {"state": "running", "postmaster_start_time": "2020-08-27 15:56:54.462 MSK", "role": "replica", "server_version": 110008, "cluster_unlocked": false, "xlog": {"received_location": 529186947072, "replayed_locattion": 529186944744, "replayed_timestamp": "2020-09-30 18:53:55.857 MSK", "paused": false}, "timeline": 10, "database_stion": 529186944744, "replayed_timestamp": "2020-09-30 18:53:55.857 MSK", "paused": false}, "timeline": 10, "database_system_identifier": "6839947303312909016", "patroni": {"version": "1.6.4", "scope": "db-main"}}
Sep 30 20:10:45 app-1 patroni[5395]: 2020-09-30 20:10:45,963 INFO: Register service db-main, params {'service_id': 'db-main/app-1', 'address': '10.0.0.1', 'port': 5432, 'check': {'http': 'http://10.0.0.1:8008/replica', 'interval': '5s', 'DeregisterCriticalServiceAfter': '150.0s'}, 'tags': ['replica']}
Sep 30 20:10:45 app-1 patroni[5395]: 2020-09-30 20:10:45,971 INFO: manual failover: demoting myself
Sep 30 20:10:46 app-1 patroni[5395]: 2020-09-30 20:10:46,511 WARNING: Connection pool is full, discarding connection: localhost
Sep 30 20:10:46 app-1 patroni[5395]: 2020-09-30 20:10:46,599 INFO: Deregister service db-main/app-1
Sep 30 20:10:46 app-1 patroni[5395]: 2020-09-30 20:10:46,725 INFO: Leader key released
Sep 30 20:10:48 app-1 patroni[5395]: 2020-09-30 20:10:48,733 INFO: Local timeline=10 lsn=7B/35FFF758
Sep 30 20:10:48 app-1 patroni[5395]: 2020-09-30 20:10:48,737 INFO: master_timeline=11
Sep 30 20:10:48 app-1 patroni[5395]: 2020-09-30 20:10:48,737 INFO: master: history=1 0/7000098 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 2 0/8000098 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 3 0/9000098 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 4 0/A000098 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 5 3/E9000098 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 6 3/EA000098 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 7 10/E2000098 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 8 46/79FFEF70 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 9 55/D4000098 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 10 7B/35FFF6E8 no recovery target specified
Sep 30 20:10:48 app-1 patroni[5395]: 2020-09-30 20:10:48,740 WARNING: Postgresql is not running.
Sep 30 20:10:48 app-1 patroni[5395]: 2020-09-30 20:10:48,740 INFO: Lock owner: app-2; I am app-1


**Have you checked PostgreSQL logs?**

WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-30 18:56:16.722 MSK [6757] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2020-09-30 18:56:16.732 MSK [27959] FATAL: could not connect to the primary server: FATAL: the database system is in recovery mode
2020-09-30 18:56:21.732 MSK [28056] FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "10.0.0.1" and accepting
TCP/IP connections on port 5432?
...
2020-09-30 18:57:26.804 MSK [28522] FATAL: could not connect to the primary server: FATAL: the database system is starting up
2020-09-30 18:57:31.808 MSK [28555] FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "10.0.0.1" and accepting

...
2020-09-30 20:10:46.674 MSK [27026] FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "10.0.0.1" and accepting
TCP/IP connections on port 5432?
2020-09-30 20:10:46.675 MSK [2609] LOG: received promote request
2020-09-30 20:10:46.675 MSK [2609] LOG: redo done at 7B/35FFD6C0
2020-09-30 20:10:46.675 MSK [2609] LOG: last completed transaction was at log time 2020-09-30 18:53:55.857465+03
2020-09-30 20:10:46.686 MSK [2609] LOG: selected new timeline ID: 11
2020-09-30 20:10:46.811 MSK [2609] LOG: archive recovery complete
2020-09-30 20:10:47.028 MSK [2606] LOG: database system is ready to accept connections
2020-09-30 20:11:30.000 MSK [4927] LOG: database system was interrupted while in recovery at log time 2020-09-30 20:10:47 MSK
2020-09-30 20:11:30.000 MSK [4927] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2020-09-30 20:11:30.000 MSK [4928] FATAL: the database system is starting up
2020-09-30 20:11:30.009 MSK [4930] FATAL: the database system is starting up
2020-09-30 20:11:30.200 MSK [4927] LOG: entering standby mode
2020-09-30 20:11:30.211 MSK [4927] LOG: redo starts at 7B/351219C0
2020-09-30 20:11:30.271 MSK [4927] LOG: consistent recovery state reached at 7B/360266E8
2020-09-30 20:11:30.271 MSK [4927] LOG: invalid record length at 7B/360266E8: wanted 24, got 0
2020-09-30 20:11:30.272 MSK [4924] LOG: database system is ready to accept read only connections
2020-09-30 20:11:30.280 MSK [4934] LOG: started streaming WAL from primary at 7B/36000000 on timeline 11


**Have you tried to use GitHub issue search?**
I found only #1371


**Additional context**
Add any other context about the problem here.
@CyberDem0n
Copy link
Member

And what is the reason for opening a duplicate?

@and111016
Copy link
Author

and111016 commented Oct 5, 2020

I agreed issue is duplicate but on #1371 your comment as unreal case:

Sorry, but no. When you send SIGSTOP you just shoot your own foot, and it hurts badly. 

This issue is real life case and has happened last week. The application that kill postmaster under investigate now.
But I hope cluster manager should solve this situation self and not wait manual action from ops.
Do you recommend some solution based on existed functionality?
I can't found it.

@and111016
Copy link
Author

and111016 commented Oct 6, 2020

If you think it is necessary and possible, I will not object to the closure the issue after transfer all necessary information to the #1371

@CyberDem0n
Copy link
Member

The main question here who is going to work on it.
We have about 2000 clusters managed by Patroni and never seen postmaster "stopped" or not accepting connections without a clear reason, therefore we are not really interested to work on it.

Maybe you @and111016 volunteer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants