Race Condition With restore_command Breaks Standby Cluster Replication #1747

bradnicholson · 2020-11-04T16:19:35Z

Describe the bug
When performing a clean Patroni switchover on a formation that is using wal archiving and has a a restore_command defined on the standby cluster to use that wal archive, there is a race condition where the standby cluster looks for the .history file in the wal archive before it is generated. It doesn't find it, and looks for the next wal file on the current timeline (which will never exist). At this point, the standby cluster is waiting on that non-existent wal file and replication stops. Restarting the standby cluster picks replication fixes replication.

Here is a summarized timeline of events. m-0 and m-1 are source formation, m-0 is the primary to start. Primary and standby clusters are on Timeline 9.

2020-11-04 14:51:11 m-0 switchover request
2020-11-04 14:51:11 m-0 is shutdown
2020-11-04 14:51:13.015 replication connection terminated on both m-1 and the standby cluster
2020-11-04 14:51:13.015 m-1 and standby cluster look for 0000000A.history in archive, does not find it
2020-11-04 14:51:17,137 m-1 and standby cluster looks for 0000000900000000000000B6 in archive (this file won't exist)
2020-11-04 14:51:17,119 m-1 promotes to primary INFO: promoted self to leader by acquiring session lock
2020-11-04 14:51:17,137 m-1 looks for 0000000900000000000000B6 in the archive again - does not find it
2020-11-04 14:51:17.557 m-1 promotion at the DB: received promote request
2020-11-04 14:51:17.557 m-1 promotion at the DB: redo done at 0/B6000028
2020-11-04 14:51:19.025 m-1 selected new timeline ID: 10
2020-11-04 14:51:19.852 m-1 archive recovery complete
2020-11-04 14:51:20 m-1 gets 00000009.history in the archive
2020-11-04 14:51:20,177 m-1 pushes 0000000A.history to archive

At this point, the standby cluster is not looking for 0000000A.history anymore, and is stuck on timeline 9 waiting for 0000000900000000000000B6

Possibly interesting, we have pgBackRest configured to log on archive-get command called by the restore_command - and the standby cluster is not issuing additional archive-gets to get 0000000900000000000000B6 after the initial attempt. It just logs no action. i am the standby leader with the lock in it's loop with no indication of any errors.

To Reproduce

Set up a two member Patroni cluster.
Set up wal archiving on the primary formation. In my case, it is using pgBackRest archiving to IBM Cloud Object Storage (s3 compatible storage). This should be repeatable with any archiving command and s3.
Setup a standby cluster with a single member
Setup the restore command on the standby cluster to use the restore command to get the wal file from the archive.

This is happening with no DB activity in testing.

Perform a controlled switchover via Patroni. In my tests, sometimes it works and replication continues on the standby cluster. In other times, the race condition is hit and replication stalls. There standby cluster waits on the next wal file from the old timeline (that will never exist) and does not connect back to the Primary formation.

When I observe this, switchover on the primary is around 6 seconds from the switchover request on the old primary until Postgres on the new secondary promote request is issued to Postgres.

Expected behavior
A clean Patroni switchover should not break replication to the standby cluster

Environment

Patroni version: 2.0.0
PostgreSQL version: 12.4
DCS (and its version): etcd

Patroni configuration file
Source formation relevant bits

scope: postgresql
namespace: /patroni/
name: c-test-m-0
restapi:
  listen: 0.0.0.0:8008
  connect_address:  172.30.37.114:8008
etcd:
  srv: c-test-etcd-m.brad.svc.cluster.local
  protocol: http
bootstrap:
  method: initdb_or_restore
  initdb_or_restore:
      keep_existing_recovery_conf: True
      command: /usr/local/bin/patroni_custom_bootstrap
  post_bootstrap: /usr/local/bin/patroni_post_bootstrap
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      remove_data_directory_on_rewind_failure: true
      use_slots: true
....

  recovery_conf:
    restore_command: /usr/bin/pgbackrest --config /conf/postgresql/pgbackrest.conf --stanza=formation archive-get %f "%p"

  pg_rewind:
     username: user
     password: password
     sslmode: require

Standby Cluster Relevant bits:

scope: postgresql
namespace: /patroni/
name: c-test2-m-0_read_only
restapi:
  listen: 0.0.0.0:8008
  connect_address:  172.30.170.84:8008
etcd:
  srv: c-test2-etcd-m.brad.svc.cluster.local
  protocol: http
bootstrap:
  dcs:
    standby_cluster:
      host: c-test2-p-internal.brad.svc.cluster.local
      port: 5500
      create_replica_methods:
        - basebackup
    postgresql:
      use_pg_rewind: false
      remove_data_directory_on_rewind_failure: true
      use_slots: False
      parameters:
      ...
  recovery_conf:
    restore_command: /usr/bin/pgbackrest --config /conf/postgresql/pgbackrest_restore.conf --stanza=formation archive-get %f "%p"

Have you checked Patroni logs?
Yes, summarized timeline is above. Details below.

Have you checked PostgreSQL logs?
m-0 logs:

2020-11-04 14:51:11,139 INFO: received switchover request with leader=c-test-m-0 candidate=c-test-m-1 scheduled_at=None
2020-11-04 14:51:11,153 INFO: Got response from c-test-m-1 http://172.30.170.28:8008/patroni: {"cluster_unlocked": false, "timeline": 9, "xlog": {"replayed_location": 3036676768, "replayed_timestamp": null, "received_location": 3036676768, "paused": false}, "patroni": {"scope": "postgresql", "version": "2.0.0"}, "role": "replica", "postmaster_start_time": "2020-11-04 14:43:04.459 UTC", "state": "running", "database_system_identifier": "6890900504220835868", "server_version": 120004}
2020-11-04 14:51:11,257 INFO: Lock owner: c-test-m-0; I am c-test-m-0
2020-11-04 14:51:11,279 INFO: Got response from c-test-m-1 http://172.30.170.28:8008/patroni: {"cluster_unlocked": false, "timeline": 9, "xlog": {"replayed_location": 3036676768, "replayed_timestamp": null, "received_location": 3036676768, "paused": false}, "patroni": {"scope": "postgresql", "version": "2.0.0"}, "role": "replica", "postmaster_start_time": "2020-11-04 14:43:04.459 UTC", "state": "running", "database_system_identifier": "6890900504220835868", "server_version": 120004}
2020-11-04 14:51:11,382 INFO: manual failover: demoting myself
2020-11-04 14:51:11 UTC [[unknown]] [00000] [9991]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55194
2020-11-04 14:51:11 UTC [[unknown]] [00000] [9991]: [2-1] user=ibm,db=postgres,client=127.0.0.1 LOG:  connection authorized: user=ibm database=postgres application_name=Patroni SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256, compression=off)
2020-11-04 14:51:11 UTC [] [00000] [8500]: [16-1] user=,db=,client= LOG:  checkpoint starting: immediate force wait
2020-11-04 14:51:11 UTC [] [00000] [8500]: [17-1] user=,db=,client= LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.008 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=11943 kB
2020-11-04 14:51:11 UTC [] [00000] [8469]: [7-1] user=,db=,client= LOG:  received fast shutdown request
2020-11-04 14:51:11 UTC [Patroni] [00000] [9991]: [3-1] user=ibm,db=postgres,client=127.0.0.1 LOG:  disconnection: session time: 0:00:00.017 user=ibm database=postgres host=127.0.0.1 port=55194
2020-11-04 14:51:11 UTC [] [00000] [8469]: [8-1] user=,db=,client= LOG:  aborting any active transactions
2020-11-04 14:51:11 UTC [Patroni] [57P01] [8517]: [3-1] user=ibm,db=postgres,client=127.0.0.1 FATAL:  terminating connection due to administrator command
2020-11-04 14:51:11 UTC [Patroni] [00000] [8517]: [4-1] user=ibm,db=postgres,client=127.0.0.1 LOG:  disconnection: session time: 1:58:10.926 user=ibm database=postgres host=127.0.0.1 port=33796
2020-11-04 14:51:11 UTC [] [00000] [8469]: [9-1] user=,db=,client= LOG:  background worker "logical replication launcher" (PID 9808) exited with exit code 1
2020-11-04 14:51:11 UTC [] [00000] [8500]: [18-1] user=,db=,client= LOG:  shutting down
2020-11-04 14:51:12 UTC [] [00000] [8500]: [19-1] user=,db=,client= LOG:  checkpoint starting: shutdown immediate
INFO: archive-push command begin 2.28: [pg_wal/0000000900000000000000B5] --archive-async --compress-level=1 --compress-type=lz4 --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=2 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
2020-11-04 14:51:12 UTC [] [00000] [8500]: [20-1] user=,db=,client= LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.051 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB
2020-11-04 14:51:13 UTC [c-test-m-1] [00000] [9822]: [4-1] user=ibm-replication,db=[unknown],client=172.30.170.28 LOG:  disconnection: session time: 0:08:05.856 user=ibm-replication database= host=172.30.170.28 port=56654
2020-11-04 14:51:13 UTC [c-test2-m-0_read_only] [00000] [9791]: [3-1] user=ibm-replication,db=[unknown],client=172.30.93.66 LOG:  disconnection: session time: 0:08:09.491 user=ibm-replication database= host=172.30.93.66 port=51971
INFO: pushed WAL file '0000000900000000000000B5' to the archive asynchronously
INFO: archive-push command end: completed successfully (1207ms)
2020-11-04 14:51:13 UTC [] [00000] [8469]: [10-1] user=,db=,client= LOG:  database system is shut down
2020-11-04 14:51:17,095 INFO: Leader key released
2020-11-04 14:51:19,107 INFO: Local timeline=9 lsn=0/B6000028
2020-11-04 14:51:19,143 INFO: closed patroni connection to the postgresql cluster
2020-11-04 14:51:19,154 WARNING: Removing enum parameter=recovery_target from the config due to the invalid value=
2020-11-04 14:51:19 UTC [] [00000] [10007]: [1-1] user=,db=,client= LOG:  pgaudit extension initialized
2020-11-04 14:51:19 UTC [] [00000] [10007]: [2-1] user=,db=,client= LOG:  starting PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
2020-11-04 14:51:19 UTC [] [00000] [10007]: [3-1] user=,db=,client= LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-11-04 14:51:19 UTC [] [00000] [10007]: [4-1] user=,db=,client= LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-11-04 14:51:19,475 INFO: postmaster pid=10007
2020-11-04 14:51:19 UTC [] [00000] [10009]: [1-1] user=,db=,client= LOG:  database system was shut down at 2020-11-04 14:51:12 UTC
2020-11-04 14:51:19 UTC [[unknown]] [00000] [10010]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55364
2020-11-04 14:51:19 UTC [[unknown]] [57P03] [10010]: [2-1] user=ibm,db=postgres,client=127.0.0.1 FATAL:  the database system is starting up
2020-11-04 14:51:19 UTC [[unknown]] [00000] [10013]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55366
2020-11-04 14:51:19 UTC [[unknown]] [57P03] [10013]: [2-1] user=ibm,db=postgres,client=127.0.0.1 FATAL:  the database system is starting up
localhost:5432 - rejecting connections
2020-11-04 14:51:19 UTC [[unknown]] [00000] [10015]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55370
INFO: archive-get command begin 2.28: [0000000A.history, pg_wal/RECOVERYHISTORY] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
2020-11-04 14:51:19 UTC [[unknown]] [57P03] [10015]: [2-1] user=ibm,db=postgres,client=127.0.0.1 FATAL:  the database system is starting up
2020-11-04 14:51:19 UTC [[unknown]] [00000] [10016]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55372
2020-11-04 14:51:19 UTC [[unknown]] [57P03] [10016]: [2-1] user=ibm,db=postgres,client=127.0.0.1 FATAL:  the database system is starting up
localhost:5432 - rejecting connections
INFO: unable to find 0000000A.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 2, retries 0, closes 0
INFO: archive-get command end: completed successfully (251ms)
2020-11-04 14:51:19 UTC [] [00000] [10009]: [2-1] user=,db=,client= LOG:  entering standby mode
INFO: archive-get command begin 2.28: [00000009.history, pg_wal/RECOVERYHISTORY] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: found 00000009.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 3, retries 0, closes 0
INFO: archive-get command end: completed successfully (188ms)
2020-11-04 14:51:19 UTC [] [00000] [10009]: [3-1] user=,db=,client= LOG:  restored log file "00000009.history" from archive
INFO: archive-get command begin 2.28: [0000000900000000000000B6, pg_wal/RECOVERYXLOG] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
2020-11-04 14:51:20 UTC [[unknown]] [00000] [10028]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55414
2020-11-04 14:51:20 UTC [[unknown]] [57P03] [10028]: [2-1] user=ibm,db=postgres,client=127.0.0.1 FATAL:  the database system is starting up
2020-11-04 14:51:20 UTC [[unknown]] [00000] [10029]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55416
2020-11-04 14:51:20 UTC [[unknown]] [57P03] [10029]: [2-1] user=ibm,db=postgres,client=127.0.0.1 FATAL:  the database system is starting up
localhost:5432 - rejecting connections
INFO: unable to find 0000000900000000000000B6 in the archive
INFO: archive-get command end: completed successfully (757ms)
2020-11-04 14:51:20 UTC [] [00000] [10009]: [4-1] user=,db=,client= LOG:  consistent recovery state reached at 0/B60000A0
2020-11-04 14:51:20 UTC [] [00000] [10009]: [5-1] user=,db=,client= LOG:  invalid record length at 0/B60000A0: wanted 24, got 0
2020-11-04 14:51:20 UTC [] [00000] [10007]: [5-1] user=,db=,client= LOG:  database system is ready to accept read only connections
2020-11-04 14:51:20 UTC [] [00000] [10033]: [1-1] user=,db=,client= LOG:  fetching timeline history file for timeline 10 from primary server
2020-11-04 14:51:20 UTC [] [00000] [10033]: [2-1] user=,db=,client= LOG:  started streaming WAL from primary at 0/B6000000 on timeline 9
2020-11-04 14:51:20 UTC [] [00000] [10033]: [3-1] user=,db=,client= LOG:  replication terminated by primary server
2020-11-04 14:51:20 UTC [] [00000] [10033]: [4-1] user=,db=,client= DETAIL:  End of WAL reached on timeline 9 at 0/B60000A0.
INFO: archive-get command begin 2.28: [0000000A.history, pg_wal/RECOVERYHISTORY] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
2020-11-04 14:51:20 UTC [[unknown]] [00000] [10037]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55426
2020-11-04 14:51:20 UTC [[unknown]] [00000] [10037]: [2-1] user=ibm,db=postgres,client=127.0.0.1 LOG:  connection authorized: user=ibm database=postgres application_name=Patroni SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256, compression=off)
2020-11-04 14:51:20 UTC [[unknown]] [00000] [10038]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55432
2020-11-04 14:51:20 UTC [[unknown]] [00000] [10038]: [2-1] user=ibm-replication,db=[unknown],client=127.0.0.1 LOG:  replication connection authorized: user=ibm-replication SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256, compression=off)
2020-11-04 14:51:20 UTC [[unknown]] [00000] [10038]: [3-1] user=ibm-replication,db=[unknown],client=127.0.0.1 LOG:  disconnection: session time: 0:00:00.008 user=ibm-replication database= host=127.0.0.1 port=55432
INFO: found 0000000A.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 3, retries 0, closes 0
INFO: archive-get command end: completed successfully (319ms)
2020-11-04 14:51:21 UTC [] [00000] [10009]: [6-1] user=,db=,client= LOG:  restored log file "0000000A.history" from archive
INFO: archive-get command begin 2.28: [0000000B.history, pg_wal/RECOVERYHISTORY] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
2020-11-04 14:51:21,258 INFO: Lock owner: c-test-m-1; I am c-test-m-0
2020-11-04 14:51:21,271 INFO: manual failover: demote in progress
INFO: unable to find 0000000B.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 2, retries 0, closes 0
INFO: archive-get command end: completed successfully (114ms)
INFO: archive-get command begin 2.28: [0000000A.history, pg_wal/RECOVERYHISTORY] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: found 0000000A.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 3, retries 0, closes 0
INFO: archive-get command end: completed successfully (165ms)
2020-11-04 14:51:21 UTC [] [00000] [10009]: [7-1] user=,db=,client= LOG:  restored log file "0000000A.history" from archive
2020-11-04 14:51:21 UTC [] [00000] [10009]: [8-1] user=,db=,client= LOG:  new target timeline is 10
INFO: archive-get command begin 2.28: [0000000A00000000000000B6, pg_wal/RECOVERYXLOG] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
2020-11-04 14:51:21 UTC [[unknown]] [00000] [10051]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55452
2020-11-04 14:51:21 UTC [[unknown]] [00000] [10051]: [2-1] user=ibm,db=postgres,client=127.0.0.1 LOG:  connection authorized: user=ibm database=postgres application_name=pg_isready SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256, compression=off)
localhost:5432 - accepting connections
2020-11-04 14:51:21 UTC [pg_isready] [00000] [10051]: [3-1] user=ibm,db=postgres,client=127.0.0.1 LOG:  disconnection: session time: 0:00:00.011 user=ibm database=postgres host=127.0.0.1 port=55452
2020-11-04 14:51:21 UTC [[unknown]] [00000] [10060]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55470
2020-11-04 14:51:21 UTC [[unknown]] [00000] [10060]: [2-1] user=ibm-replication,db=[unknown],client=127.0.0.1 LOG:  replication connection authorized: user=ibm-replication SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256, compression=off)
2020-11-04 14:51:21 UTC [[unknown]] [00000] [10060]: [3-1] user=ibm-replication,db=[unknown],client=127.0.0.1 LOG:  disconnection: session time: 0:00:00.009 user=ibm-replication database= host=127.0.0.1 port=55470
2020-11-04 14:51:21 UTC [[unknown]] [00000] [10063]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55476
2020-11-04 14:51:21 UTC [[unknown]] [00000] [10063]: [2-1] user=ibm-replication,db=[unknown],client=127.0.0.1 LOG:  replication connection authorized: user=ibm-replication SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256, compression=off)
2020-11-04 14:51:21 UTC [[unknown]] [00000] [10063]: [3-1] user=ibm-replication,db=[unknown],client=127.0.0.1 LOG:  disconnection: session time: 0:00:00.013 user=ibm-replication database= host=127.0.0.1 port=55476
INFO: unable to find 0000000A00000000000000B6 in the archive
INFO: archive-get command end: completed successfully (407ms)
INFO: archive-get command begin 2.28: [0000000900000000000000B6, pg_wal/RECOVERYXLOG] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: unable to find 0000000900000000000000B6 in the archive
INFO: archive-get command end: completed successfully (1207ms)
2020-11-04 14:51:23 UTC [] [00000] [10033]: [5-1] user=,db=,client= LOG:  restarted WAL streaming at 0/B6000000 on timeline 10
2020-11-04 14:51:23 UTC [] [00000] [10009]: [9-1] user=,db=,client= LOG:  redo starts at 0/B60000A0
2020-11-04 14:51:27 UTC [[unknown]] [00000] [10073]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=55608
2020-11-04 14:51:27 UTC [[unknown]] [00000] [10073]: [2-1] user=ibm-replication,db=[unknown],client=127.0.0.1 LOG:  replication connection authorized: user=ibm-replication SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256, compression=off)
2020-11-04 14:51:27 UTC [[unknown]] [00000] [10073]: [3-1] user=ibm-replication,db=[unknown],client=127.0.0.1 LOG:  disconnection: session time: 0:00:00.008 user=ibm-replication database= host=127.0.0.1 port=55608
2020-11-04 14:51:31,258 INFO: Lock owner: c-test-m-1; I am c-test-m-0
2020-11-04 14:51:31,259 INFO: does not have lock
2020-11-04 14:51:31,259 INFO: establishing a new patroni connection to the postgres cluster
2020-11-04 14:51:31,261 INFO: Local timeline=10 lsn=0/B6000268
2020-11-04 14:51:31,286 INFO: master_timeline=10
2020-11-04 14:51:31,355 INFO: no action.  i am a secondary and i am following a leader

m-1 logs:

2020-11-04 14:51:11,279 INFO: Lock owner: c-test-m-0; I am c-test-m-1
2020-11-04 14:51:11,279 INFO: does not have lock
2020-11-04 14:51:11,296 INFO: no action.  i am a secondary and i am following a leader
2020-11-04 14:51:13.015 UTC [] [00000] [7532]: [6-1] user=,db=,client=LOG:  replication terminated by primary server
2020-11-04 14:51:13.015 UTC [] [00000] [7532]: [7-1] user=,db=,client=DETAIL:  End of WAL reached on timeline 9 at 0/B60000A0.
2020-11-04 14:51:13.015 UTC [] [XX000] [7532]: [8-1] user=,db=,client=FATAL:  could not send end-of-streaming message to primary: no COPY in progress
INFO: archive-get command begin 2.28: [0000000A.history, pg_wal/RECOVERYHISTORY] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: unable to find 0000000A.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 2, retries 0, closes 0
INFO: archive-get command end: completed successfully (332ms)
INFO: archive-get command begin 2.28: [0000000900000000000000B6, pg_wal/RECOVERYXLOG] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: unable to find 0000000900000000000000B6 in the archive
INFO: archive-get command end: completed successfully (406ms)
2020-11-04 14:51:13.783 UTC [] [00000] [7483]: [10-1] user=,db=,client=LOG:  record with incorrect prev-link 103FF/67E5 at 0/B60000A0
2020-11-04 14:51:13.788 UTC [] [XX000] [7658]: [1-1] user=,db=,client=FATAL:  could not connect to the primary server: could not connect to server: Connection refused
		Is the server running on host "172.30.37.114" and accepting
		TCP/IP connections on port 5432?
INFO: archive-get command begin 2.28: [0000000A.history, pg_wal/RECOVERYHISTORY] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: unable to find 0000000A.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 2, retries 0, closes 0
INFO: archive-get command end: completed successfully (285ms)
2020-11-04 14:51:17,102 INFO: Cleaning up failover key after acquiring leader lock...
2020-11-04 14:51:17,119 INFO: promoted self to leader by acquiring session lock
server promoting
2020-11-04 14:51:17,137 INFO: cleared rewind state after becoming the leader
INFO: archive-get command begin 2.28: [0000000900000000000000B6, pg_wal/RECOVERYXLOG] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: unable to find 0000000900000000000000B6 in the archive
INFO: archive-get command end: completed successfully (407ms)
2020-11-04 14:51:17.557 UTC [] [00000] [7483]: [11-1] user=,db=,client=LOG:  received promote request
2020-11-04 14:51:17.557 UTC [] [00000] [7483]: [12-1] user=,db=,client=LOG:  redo done at 0/B6000028
INFO: archive-get command begin 2.28: [0000000900000000000000B6, pg_wal/RECOVERYXLOG] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: unable to find 0000000900000000000000B6 in the archive
INFO: archive-get command end: completed successfully (1208ms)
INFO: archive-get command begin 2.28: [0000000A.history, pg_wal/RECOVERYHISTORY] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: unable to find 0000000A.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 2, retries 0, closes 0
INFO: archive-get command end: completed successfully (231ms)
2020-11-04 14:51:19.025 UTC [] [00000] [7483]: [13-1] user=,db=,client=LOG:  selected new timeline ID: 10
2020-11-04 14:51:19.852 UTC [] [00000] [7483]: [14-1] user=,db=,client=LOG:  archive recovery complete
INFO: archive-get command begin 2.28: [00000009.history, pg_wal/RECOVERYHISTORY] --archive-async --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: found 00000009.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 3, retries 0, closes 0
INFO: archive-get command end: completed successfully (258ms)
2020-11-04 14:51:20.124 UTC [] [00000] [7483]: [15-1] user=,db=,client=LOG:  restored log file "00000009.history" from archive
2020-11-04 14:51:20.142 UTC [] [00000] [7504]: [4-1] user=,db=,client=LOG:  checkpoint starting: force
2020-11-04 14:51:20.146 UTC [] [00000] [7481]: [6-1] user=,db=,client=LOG:  database system is ready to accept connections
2020-11-04 14:51:20.157 UTC [] [00000] [7504]: [5-1] user=,db=,client=LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.014 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB
2020-11-04 14:51:20,177 INFO: Lock owner: c-test-m-1; I am c-test-m-1
INFO: archive-push command begin 2.28: [pg_wal/0000000A.history] --archive-async --compress-level=1 --compress-type=lz4 --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=2 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
2020-11-04 14:51:20.198 UTC [[unknown]] [00000] [7707]: [1-1] user=[unknown],db=[unknown],client=127.0.0.1LOG:  connection received: host=127.0.0.1 port=42016
2020-11-04 14:51:20.208 UTC [[unknown]] [00000] [7707]: [2-1] user=ibm,db=postgres,client=127.0.0.1LOG:  connection authorized: user=ibm database=postgres application_name=Patroni SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256, compression=off)
2020-11-04 14:51:20.212 UTC [] [00000] [7504]: [6-1] user=,db=,client=LOG:  checkpoint starting: immediate force wait
2020-11-04 14:51:20,229 INFO: no action.  i am the leader with the lock
2020-11-04 14:51:20.230 UTC [] [00000] [7504]: [7-1] user=,db=,client=LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.018 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=14745 kB
2020-11-04 14:51:20.234 UTC [Patroni] [00000] [7707]: [3-1] user=ibm,db=postgres,client=127.0.0.1LOG:  disconnection: session time: 0:00:00.037 user=ibm database=postgres host=127.0.0.1 port=42016
2020-11-04 14:51:20,244 INFO: Lock owner: c-test-m-1; I am c-test-m-1
2020-11-04 14:51:20,270 INFO: no action.  i am the leader with the lock
2020-11-04 14:51:20.779 UTC [[unknown]] [00000] [7711]: [1-1] user=[unknown],db=[unknown],client=172.30.37.114LOG:  connection received: host=172.30.37.114 port=40580
2020-11-04 14:51:20.793 UTC [[unknown]] [00000] [7711]: [2-1] user=ibm-replication,db=[unknown],client=172.30.37.114LOG:  replication connection authorized: user=ibm-replication application_name=c-test-m-0 SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256, compression=off)
2020-11-04 14:51:20.852 UTC [c-test-m-0] [00000] [7711]: [3-1] user=ibm-replication,db=[unknown],client=172.30.37.114LOG:  standby "c-test-m-0" is now a synchronous standby with priority 1
INFO: pushed WAL file '0000000A.history' to the archive asynchronously
INFO: archive-push command end: completed successfully (712ms)
INFO: archive-push command begin 2.28: [pg_wal/0000000900000000000000B6.partial] --archive-async --compress-level=1 --compress-type=lz4 --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=2 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation
INFO: pushed WAL file '0000000900000000000000B6.partial' to the archive asynchronously
INFO: archive-push command end: completed successfully (402ms)
2020-11-04 14:51:30,256 INFO: no action.  i am the leader with the lock

Standby Cluster logs:

2020-11-04 14:51:08,708 INFO: Lock owner: c-test2-m-0_read_only; I am c-test2-m-0_read_only
2020-11-04 14:51:08,743 INFO: no action.  i am the standby leader with the lock
2020-11-04 14:51:13 UTC [] [00000] [27574]: [6-1] user=,db=,client= LOG:  replication terminated by primary server
2020-11-04 14:51:13 UTC [] [00000] [27574]: [7-1] user=,db=,client= DETAIL:  End of WAL reached on timeline 9 at 0/B60000A0.
2020-11-04 14:51:13 UTC [] [XX000] [27574]: [8-1] user=,db=,client= FATAL:  could not send end-of-streaming message to primary: no COPY in progress
INFO: archive-get command begin 2.28: [0000000A.history, pg_wal/RECOVERYHISTORY] --config=/conf/postgresql/pgbackrest_restore.conf --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3-api.us-geo.objectstorage.service.networklayer.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --stanza=formation
INFO: unable to find 0000000A.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 2, retries 0, closes 0
INFO: archive-get command end: completed successfully (1278ms)
INFO: archive-get command begin 2.28: [0000000900000000000000B6, pg_wal/RECOVERYXLOG] --config=/conf/postgresql/pgbackrest_restore.conf --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3-api.us-geo.objectstorage.service.networklayer.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --stanza=formation
INFO: unable to find 0000000900000000000000B6 in the archive
INFO: http statistics: objects 1, sessions 1, requests 2, retries 0, closes 0
INFO: archive-get command end: completed successfully (192ms)
2020-11-04 14:51:14 UTC [] [00000] [22110]: [13-1] user=,db=,client= LOG:  invalid record length at 0/B60000A0: wanted 24, got 0
2020-11-04 14:51:16 UTC [] [XX000] [27995]: [1-1] user=,db=,client= FATAL:  could not connect to the primary server: server closed the connection unexpectedly
		This probably means the server terminated abnormally
		before or while processing the request.
INFO: archive-get command begin 2.28: [0000000A.history, pg_wal/RECOVERYHISTORY] --config=/conf/postgresql/pgbackrest_restore.conf --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3-api.us-geo.objectstorage.service.networklayer.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --stanza=formation
INFO: unable to find 0000000A.history in the archive
INFO: http statistics: objects 1, sessions 1, requests 2, retries 0, closes 0
INFO: archive-get command end: completed successfully (152ms)
2020-11-04 14:51:18,708 INFO: Lock owner: c-test2-m-0_read_only; I am c-test2-m-0_read_only
2020-11-04 14:51:18,748 INFO: no action.  i am the standby leader with the lock
INFO: archive-get command begin 2.28: [0000000900000000000000B6, pg_wal/RECOVERYXLOG] --config=/conf/postgresql/pgbackrest_restore.conf --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3-api.us-geo.objectstorage.service.networklayer.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-standard --repo1-type=s3 --stanza=formation
INFO: unable to find 0000000900000000000000B6 in the archive
INFO: http statistics: objects 1, sessions 1, requests 2, retries 0, closes 0
INFO: archive-get command end: completed successfully (178ms)
2020-11-04 14:51:28,708 INFO: Lock owner: c-test2-m-0_read_only; I am c-test2-m-0_read_only

Have you tried to use GitHub issue search?
Yes, didn't find anything.

The text was updated successfully, but these errors were encountered:

CyberDem0n · 2020-11-04T18:39:22Z

Hi @bradnicholson,

thank you for the detailed report.

This is the Postgres decides which file to restore, not Patroni.
And actually, logs from the standby cluster showing that postgres is asking restore_command for 0000000A.history.
I guess it is looking for a file in the wrong place:

INFO: archive-push command begin 2.28: [pg_wal/0000000A.history] --archive-async --compress-level=1 --compress-type=lz4 --config=/conf/postgresql/pgbackrest.conf --db-timeout=86400 --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=2 --protocol-timeout=86450 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3.private.us.cloud-object-storage.appdomain.cloud --repo1-s3-key= --repo1-s3-key-secret= --repo1-s3-region=us-standard --repo1-type=s3 --spool-path=/data/postgresql/spool --stanza=formation

vs

INFO: archive-get command begin 2.28: [0000000A.history, pg_wal/RECOVERYHISTORY] --config=/conf/postgresql/pgbackrest_restore.conf --log-level-console=info --log-level-stderr=info --pg1-path=/data/postgresql/12 --process-max=4 --repo1-path=/postgresql_backup_test --repo1-s3-bucket=test --repo1-s3-endpoint=s3-api.us-geo.objectstorage.service.networklayer.com --repo1-s3-key= --repo1-s3-key-secret= --repo1-s3-region=us-standard --repo1-type=s3 --stanza=formation

There is also one little issue with the standby cluster configuration, you need to define the restore_command in the standby_cluster section.

The difference between standby_cluster.restore_command and recovery_conf.restore_command is that the first one will be used only for the standby cluster mode, while the second if the standby cluster is promoted (standby_cluster section removed from the global config).

bradnicholson · 2020-11-04T18:50:54Z

Thanks @CyberDem0n - I did suspect a PG problem, but wanted to float it over here in case there was anything overt from the Patroni side happening.

The different endpoints are an implementation detail - they are fine and will end up in the same place.

The difference between standby_cluster.restore_command and recovery_conf.restore_command is that the first one will be used only for the standby cluster mode, while the second if the standby cluster is promoted (standby_cluster section removed from the global config).

Thanks for this - I'll correct it

I'll take this to PG.

bradnicholson closed this as completed Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Condition With restore_command Breaks Standby Cluster Replication #1747

Race Condition With restore_command Breaks Standby Cluster Replication #1747

bradnicholson commented Nov 4, 2020

CyberDem0n commented Nov 4, 2020

bradnicholson commented Nov 4, 2020

Race Condition With restore_command Breaks Standby Cluster Replication #1747

Race Condition With restore_command Breaks Standby Cluster Replication #1747

Comments

bradnicholson commented Nov 4, 2020

CyberDem0n commented Nov 4, 2020

bradnicholson commented Nov 4, 2020