Skip to content

replica node db failed to start after removing its $PGDATA #4117

Open
@lynlvcheng

Description

@lynlvcheng

Overview

Create a Postgres Cluster with 3 replicas. The replica node failed to start after removing $PGDATA

Environment

Please provide the following details:

  • Platform: (Kubernetes)
  • Platform Version: (v1.23.7)
  • PGO Image Tag: (e.g. ubi8-5.6.0-0)
  • Postgres Version (e.g. 16)
  • Storage: (hostpath)

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

  1. run 'kubectl apply -k kustomize/postgres'

  2. run 'kubectl get pv | grep repo'

  3. run 'kubectl get pv pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281 -o yaml'

    below is the result:
    .............
    the hostpath is: hostPath:
    path: /var/lib/data/local-path-provisioner/pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281_postgres-operator_hippo-repo1
    ..............

  4. delete all under repo's hostpath:
    run 'cd /var/lib/data/local-path-provisioner/pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281_postgres-operator_hippo-repo1'
    run 'rm -rf *'

  5. run 'kubectl get pods -n postgres-operator'
    below is the result:
    NAME READY STATUS RESTARTS AGE
    hippo-backup-xt6v-k445d 0/1 Completed 0 36m
    hippo-instance1-24sl-0 4/4 Running 0 21m
    hippo-instance1-9n7g-0 4/4 Running 0 14m
    hippo-instance1-l7h5-0 3/4 Running 0 19m
    hippo-repo-host-0 2/2 Running 0 36m
    pgo-7784d579df-glpz6 1/1 Running 0 47m

  6. find one replica nodes
    run 'kubectl exec -it hippo-instance1-l7h5-0 -n postgres-operator -c database -- /bin/sh'
    below is the result:
    [root@k8s-master postgres]# kubectl exec -it hippo-instance1-l7h5-0 -n postgres-operator -c database -- /bin/sh
    sh-4.4$ patronictl list

    • Cluster: hippo-ha (7477835052125798486) -------------------+---------+-----------+----+-----------+
      | Member | Host | Role | State | TL | Lag in MB |
      +------------------------+-----------------------------------+---------+-----------+----+-----------+
      | hippo-instance1-24sl-0 | hippo-instance1-24sl-0.hippo-pods | Leader | running | 5 | |
      | hippo-instance1-9n7g-0 | hippo-instance1-9n7g-0.hippo-pods | Replica | streaming | 5 | 0 |
      | hippo-instance1-l7h5-0 | hippo-instance1-l7h5-0.hippo-pods | Replica | stopped | | unknown |
      +------------------------+-----------------------------------+---------+-----------+----+-----------+
      sh-4.4$
  7. go to the replica's pv, and remove $PGDATA

    1. run 'kubectl get pv | grep l7h5'
      below is the result:
      pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5 1Gi RWO Delete Bound postgres-operator/hippo-instance1-l7h5-pgdata hostpath 39m

    2)run 'kubectl get pv pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5 -o yaml'
    below is the result:
    ................
    hostPath:
    path: /var/lib/data/local-path-provisioner/pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5_postgres-operator_hippo-instance1-l7h5-pgdata
    3) run 'cd /var/lib/data/local-path-provisioner/pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5_postgres-operator_hippo-instance1-l7h5-pgdata‘
    4) run 'rm -rf pg16'

  8. check postgres cluster status
    run: patronictl list
    below is the result:
    sh-4.4$ patronictl list

  • Cluster: hippo-ha (7477835052125798486) -------------------+---------+-----------+----+-----------+
    | Member | Host | Role | State | TL | Lag in MB |
    +------------------------+-----------------------------------+---------+-----------+----+-----------+
    | hippo-instance1-24sl-0 | hippo-instance1-24sl-0.hippo-pods | Leader | running | 5 | |
    | hippo-instance1-9n7g-0 | hippo-instance1-9n7g-0.hippo-pods | Replica | streaming | 5 | 0 |
    | hippo-instance1-l7h5-0 | hippo-instance1-l7h5-0.hippo-pods | Replica | stopped | | unknown |
    +------------------------+-----------------------------------+---------+-----------+----+-----------+
    sh-4.4$

EXPECTED

  1. the member 'hippo-instance1-l7h5-0' state shoud be streaming.

ACTUAL

  1. the member 'hippo-instance1-l7h5-0' state shoud be stopped.

Logs

2025-03-04 06:47:35,702 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0
2025-03-04 06:47:35,720 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0'
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg16' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/repo1/backup/db/backup.info' or '/pgbackrest/repo1/backup/db/backup.info.copy':
FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info' for read
FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info.copy' for read
HINT: backup.info cannot be opened and is required to perform a backup.
HINT: has a stanza-create been performed?
ERROR: [075]: no backup set found to restore
2025-03-04 06:47:35,755 ERROR: Error creating replica using method pgbackrest: 'bash' '-ceu' '--' 'install --directory --mode=0700 "${PGDATA?}" && exec "$@"' '-' 'pgbackrest' 'restore' '--delta' '--stanza=db' '--repo=1' '--link-map=pg_wal=/pgdata/pg16_wal' '--type=standby' exited with code=75
pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty
pg_basebackup: removing contents of data directory "/pgdata/pg16"
2025-03-04 06:47:35,783 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-03-04 06:47:35,783 WARNING: Trying again in 5 seconds
pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty
pg_basebackup: removing contents of data directory "/pgdata/pg16"
2025-03-04 06:47:40,817 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-03-04 06:47:40,818 ERROR: failed to bootstrap from leader 'hippo-instance1-24sl-0'
2025-03-04 06:47:40,818 INFO: Removing data directory: /pgdata/pg16
2025-03-04 06:47:45,702 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0
2025-03-04 06:47:45,702 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0'
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg16' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/repo1/backup/db/backup.info' or '/pgbackrest/repo1/backup/db/backup.info.copy':
FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info' for read
FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info.copy' for read
HINT: backup.info cannot be opened and is required to perform a backup.
HINT: has a stanza-create been performed?
ERROR: [075]: no backup set found to restore
2025-03-04 06:47:45,728 ERROR: Error creating replica using method pgbackrest: 'bash' '-ceu' '--' 'install --directory --mode=0700 "${PGDATA?}" && exec "$@"' '-' 'pgbackrest' 'restore' '--delta' '--stanza=db' '--repo=1' '--link-map=pg_wal=/pgdata/pg16_wal' '--type=standby' exited with code=75
pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty
pg_basebackup: removing contents of data directory "/pgdata/pg16"
2025-03-04 06:47:45,755 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-03-04 06:47:45,755 WARNING: Trying again in 5 seconds
pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty
pg_basebackup: removing contents of data directory "/pgdata/pg16"
2025-03-04 06:47:50,796 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-03-04 06:47:50,796 ERROR: failed to bootstrap from leader 'hippo-instance1-24sl-0'
2025-03-04 06:47:50,797 INFO: Removing data directory: /pgdata/pg16
2025-03-04 06:47:55,703 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0
2025-03-04 06:47:55,704 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0'

Additional Information

  1. cat /etc/patroni/~postgres-operator_cluster.yaml

Generated by postgres-operator. DO NOT EDIT.

Your changes will not be saved.

ctl:
cacert: /etc/patroni/~postgres-operator/patroni.ca-roots
certfile: /etc/patroni/~postgres-operator/patroni.crt+key
insecure: false
keyfile: null
kubernetes:
labels:
postgres-operator.crunchydata.com/cluster: hippo
namespace: postgres-operator
role_label: postgres-operator.crunchydata.com/role
scope_label: postgres-operator.crunchydata.com/patroni
use_endpoints: true
postgresql:
authentication:
replication:
sslcert: /tmp/replication/tls.crt
sslkey: /tmp/replication/tls.key
sslmode: verify-ca
sslrootcert: /tmp/replication/ca.crt
username: _crunchyrepl
rewind:
sslcert: /tmp/replication/tls.crt
sslkey: /tmp/replication/tls.key
sslmode: verify-ca
sslrootcert: /tmp/replication/ca.crt
username: _crunchyrepl
restapi:
cafile: /etc/patroni/~postgres-operator/patroni.ca-roots
certfile: /etc/patroni/~postgres-operator/patroni.crt+key
keyfile: null
verify_client: optional
scope: hippo-ha
watchdog:
mode: "off"

  1. cat /etc/patroni/~postgres-operator_instance.yaml

Generated by postgres-operator. DO NOT EDIT.

Your changes will not be saved.

kubernetes: {}
postgresql:
basebackup:

  • waldir=/pgdata/pg16_wal
    create_replica_methods:
  • pgbackrest
  • basebackup
    pgbackrest:
    command: '''bash'' ''-ceu'' ''--'' ''install --directory --mode=0700 "${PGDATA?}"
    && exec "$@"'' ''-'' ''pgbackrest'' ''restore'' ''--delta'' ''--stanza=db''
    ''--repo=1'' ''--link-map=pg_wal=/pgdata/pg16_wal'' ''--type=standby'''
    keep_data: true
    no_master: true
    no_params: true
    pgpass: /tmp/.pgpass
    use_unix_socket: true
    restapi: {}
    tags: {}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions