Skip to content

databases keep restarting after the upgrade from 5.3 to 5.7.2 #4069

Open
@batulziiy

Description

@batulziiy

Questions

I did postgres-operator version upgrade from 5.2 to 5.7.2 yesterday and it seemed working fine after the upgrade. I was able to connect to the database and query. However this morning found that clusters keep restarting after running normal for ~1-2 mins. I did the exact same upgrade for 2 different environments and one works fine while another one keeps restarting.

Only difference I found is on kube cluster with version v1.26.3+k3s1 it works fine, and we have an issue on the cluster with version v1.25.4+k3s1. Not sure if it makes great difference.

In the pgo pod log, I can see the below error is generated :

time="2025-01-13T09:11:21Z" level=error msg="Reconciler error" PostgresCluster=postgres-operator/hippo controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster error="Operation cannot be fulfilled on Pod \"hippo-instance1-f9rv-0\": the ResourceVersion in the precondition (1043902053) does not match the ResourceVersion in record (1043902088). The object might have been modified" file="internal/controller/postgrescluster/instance.go:879" func="postgrescluster.(*Reconciler).rolloutInstance" name=hippo namespace=postgres-operator reconcileID=6f692192-9b74-4921-891e-ac91a86b00ea version=5.7.2-0

Also in the pod log, I got an exit code 137, which is a bit strange as I have enough memory resource on the cluster.

2025-01-13 09:19:49.116 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 09:19:59.121 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 09:20:09.127 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 09:20:14.758 UTC [98] LOG:  checkpoint complete: wrote 90256 buffers (2.9%); 0 WAL file(s) added, 0 removed, 58 recycled; write=159.616 s, sync=0.007 s, total=159.711 s; sync files=70, longest=0.004 s, average=0.001 s; distance=695145 kB, estimate=695145 kB
2025-01-13 09:20:14.759 UTC [98] LOG:  checkpoint starting: immediate force wait
2025-01-13 09:20:15.001 UTC [98] LOG:  checkpoint complete: wrote 11836 buffers (0.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.233 s, sync=0.005 s, total=0.243 s; sync files=10, longest=0.003 s, average=0.001 s; distance=100283 kB, estimate=635659 kB
2025-01-13 09:20:15.417 UTC [95] LOG:  received fast shutdown request
2025-01-13 09:20:15.418 UTC [95] LOG:  aborting any active transactions
2025-01-13 09:20:15.418 UTC [752] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [211] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [210] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [181] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [187] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [179] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [175] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [177] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [173] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [169] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [171] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [167] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [161] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [115] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [163] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.419 UTC [165] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.420 UTC [95] LOG:  background worker "logical replication launcher" (PID 131) exited with exit code 1
2025-01-13 09:20:15.427 UTC [98] LOG:  shutting down
2025-01-13 09:20:15.456 UTC [98] LOG:  checkpoint starting: shutdown immediate
2025-01-13 09:20:15.477 UTC [98] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.015 s, sync=0.001 s, total=0.022 s; sync files=0, longest=0.000 s, average=0.000 s; distance=7386 kB, estimate=572832 kB
2025-01-13 09:20:16.278 UTC [95] LOG:  database system is shut down
command terminated with exit code 137

Have you ever had the same experience, would appreciate if you share your thoughts here, thanks.

Environment

Please provide the following details:

  • Platform: Kubernetes
  • Platform Version: v1.25.4+k3s1
  • PGO Image Tag: postgres-operator:ubi8-5.7.2-0
  • Postgres Version 15
  • Storage: longhorn

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions