Description
Overview
Create a Postgres Cluster with 3 replicas. The replica node failed to start after removing $PGDATA
Environment
Please provide the following details:
- Platform: (
Kubernetes
) - Platform Version: (
v1.23.7
) - PGO Image Tag: (e.g.
ubi8-5.6.0-0
) - Postgres Version (e.g.
16
) - Storage: (
hostpath
)
Steps to Reproduce
REPRO
Provide steps to get to the error condition:
-
run 'kubectl apply -k kustomize/postgres'
-
run 'kubectl get pv | grep repo'
-
run 'kubectl get pv pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281 -o yaml'
below is the result:
.............
the hostpath is: hostPath:
path: /var/lib/data/local-path-provisioner/pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281_postgres-operator_hippo-repo1
.............. -
delete all under repo's hostpath:
run 'cd /var/lib/data/local-path-provisioner/pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281_postgres-operator_hippo-repo1'
run 'rm -rf *' -
run 'kubectl get pods -n postgres-operator'
below is the result:
NAME READY STATUS RESTARTS AGE
hippo-backup-xt6v-k445d 0/1 Completed 0 36m
hippo-instance1-24sl-0 4/4 Running 0 21m
hippo-instance1-9n7g-0 4/4 Running 0 14m
hippo-instance1-l7h5-0 3/4 Running 0 19m
hippo-repo-host-0 2/2 Running 0 36m
pgo-7784d579df-glpz6 1/1 Running 0 47m -
find one replica nodes
run 'kubectl exec -it hippo-instance1-l7h5-0 -n postgres-operator -c database -- /bin/sh'
below is the result:
[root@k8s-master postgres]# kubectl exec -it hippo-instance1-l7h5-0 -n postgres-operator -c database -- /bin/sh
sh-4.4$ patronictl list- Cluster: hippo-ha (7477835052125798486) -------------------+---------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------+-----------------------------------+---------+-----------+----+-----------+
| hippo-instance1-24sl-0 | hippo-instance1-24sl-0.hippo-pods | Leader | running | 5 | |
| hippo-instance1-9n7g-0 | hippo-instance1-9n7g-0.hippo-pods | Replica | streaming | 5 | 0 |
| hippo-instance1-l7h5-0 | hippo-instance1-l7h5-0.hippo-pods | Replica | stopped | | unknown |
+------------------------+-----------------------------------+---------+-----------+----+-----------+
sh-4.4$
- Cluster: hippo-ha (7477835052125798486) -------------------+---------+-----------+----+-----------+
-
go to the replica's pv, and remove $PGDATA
- run 'kubectl get pv | grep l7h5'
below is the result:
pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5 1Gi RWO Delete Bound postgres-operator/hippo-instance1-l7h5-pgdata hostpath 39m
2)run 'kubectl get pv pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5 -o yaml'
below is the result:
................
hostPath:
path: /var/lib/data/local-path-provisioner/pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5_postgres-operator_hippo-instance1-l7h5-pgdata
3) run 'cd /var/lib/data/local-path-provisioner/pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5_postgres-operator_hippo-instance1-l7h5-pgdata‘
4) run 'rm -rf pg16' - run 'kubectl get pv | grep l7h5'
-
check postgres cluster status
run: patronictl list
below is the result:
sh-4.4$ patronictl list
- Cluster: hippo-ha (7477835052125798486) -------------------+---------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------+-----------------------------------+---------+-----------+----+-----------+
| hippo-instance1-24sl-0 | hippo-instance1-24sl-0.hippo-pods | Leader | running | 5 | |
| hippo-instance1-9n7g-0 | hippo-instance1-9n7g-0.hippo-pods | Replica | streaming | 5 | 0 |
| hippo-instance1-l7h5-0 | hippo-instance1-l7h5-0.hippo-pods | Replica | stopped | | unknown |
+------------------------+-----------------------------------+---------+-----------+----+-----------+
sh-4.4$
EXPECTED
- the member 'hippo-instance1-l7h5-0' state shoud be streaming.
ACTUAL
- the member 'hippo-instance1-l7h5-0' state shoud be stopped.
Logs
2025-03-04 06:47:35,702 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0
2025-03-04 06:47:35,720 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0'
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg16' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/repo1/backup/db/backup.info' or '/pgbackrest/repo1/backup/db/backup.info.copy':
FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info' for read
FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info.copy' for read
HINT: backup.info cannot be opened and is required to perform a backup.
HINT: has a stanza-create been performed?
ERROR: [075]: no backup set found to restore
2025-03-04 06:47:35,755 ERROR: Error creating replica using method pgbackrest: 'bash' '-ceu' '--' 'install --directory --mode=0700 "${PGDATA?}" && exec "$@"' '-' 'pgbackrest' 'restore' '--delta' '--stanza=db' '--repo=1' '--link-map=pg_wal=/pgdata/pg16_wal' '--type=standby' exited with code=75
pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty
pg_basebackup: removing contents of data directory "/pgdata/pg16"
2025-03-04 06:47:35,783 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-03-04 06:47:35,783 WARNING: Trying again in 5 seconds
pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty
pg_basebackup: removing contents of data directory "/pgdata/pg16"
2025-03-04 06:47:40,817 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-03-04 06:47:40,818 ERROR: failed to bootstrap from leader 'hippo-instance1-24sl-0'
2025-03-04 06:47:40,818 INFO: Removing data directory: /pgdata/pg16
2025-03-04 06:47:45,702 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0
2025-03-04 06:47:45,702 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0'
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg16' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/repo1/backup/db/backup.info' or '/pgbackrest/repo1/backup/db/backup.info.copy':
FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info' for read
FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info.copy' for read
HINT: backup.info cannot be opened and is required to perform a backup.
HINT: has a stanza-create been performed?
ERROR: [075]: no backup set found to restore
2025-03-04 06:47:45,728 ERROR: Error creating replica using method pgbackrest: 'bash' '-ceu' '--' 'install --directory --mode=0700 "${PGDATA?}" && exec "$@"' '-' 'pgbackrest' 'restore' '--delta' '--stanza=db' '--repo=1' '--link-map=pg_wal=/pgdata/pg16_wal' '--type=standby' exited with code=75
pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty
pg_basebackup: removing contents of data directory "/pgdata/pg16"
2025-03-04 06:47:45,755 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-03-04 06:47:45,755 WARNING: Trying again in 5 seconds
pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty
pg_basebackup: removing contents of data directory "/pgdata/pg16"
2025-03-04 06:47:50,796 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-03-04 06:47:50,796 ERROR: failed to bootstrap from leader 'hippo-instance1-24sl-0'
2025-03-04 06:47:50,797 INFO: Removing data directory: /pgdata/pg16
2025-03-04 06:47:55,703 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0
2025-03-04 06:47:55,704 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0'
Additional Information
- cat /etc/patroni/~postgres-operator_cluster.yaml
Generated by postgres-operator. DO NOT EDIT.
Your changes will not be saved.
ctl:
cacert: /etc/patroni/~postgres-operator/patroni.ca-roots
certfile: /etc/patroni/~postgres-operator/patroni.crt+key
insecure: false
keyfile: null
kubernetes:
labels:
postgres-operator.crunchydata.com/cluster: hippo
namespace: postgres-operator
role_label: postgres-operator.crunchydata.com/role
scope_label: postgres-operator.crunchydata.com/patroni
use_endpoints: true
postgresql:
authentication:
replication:
sslcert: /tmp/replication/tls.crt
sslkey: /tmp/replication/tls.key
sslmode: verify-ca
sslrootcert: /tmp/replication/ca.crt
username: _crunchyrepl
rewind:
sslcert: /tmp/replication/tls.crt
sslkey: /tmp/replication/tls.key
sslmode: verify-ca
sslrootcert: /tmp/replication/ca.crt
username: _crunchyrepl
restapi:
cafile: /etc/patroni/~postgres-operator/patroni.ca-roots
certfile: /etc/patroni/~postgres-operator/patroni.crt+key
keyfile: null
verify_client: optional
scope: hippo-ha
watchdog:
mode: "off"
- cat /etc/patroni/~postgres-operator_instance.yaml
Generated by postgres-operator. DO NOT EDIT.
Your changes will not be saved.
kubernetes: {}
postgresql:
basebackup:
- waldir=/pgdata/pg16_wal
create_replica_methods: - pgbackrest
- basebackup
pgbackrest:
command: '''bash'' ''-ceu'' ''--'' ''install --directory --mode=0700 "${PGDATA?}"
&& exec "$@"'' ''-'' ''pgbackrest'' ''restore'' ''--delta'' ''--stanza=db''
''--repo=1'' ''--link-map=pg_wal=/pgdata/pg16_wal'' ''--type=standby'''
keep_data: true
no_master: true
no_params: true
pgpass: /tmp/.pgpass
use_unix_socket: true
restapi: {}
tags: {}