Description
Questions
I did postgres-operator version upgrade from 5.2 to 5.7.2 yesterday and it seemed working fine after the upgrade. I was able to connect to the database and query. However this morning found that clusters keep restarting after running normal for ~1-2 mins. I did the exact same upgrade for 2 different environments and one works fine while another one keeps restarting.
Only difference I found is on kube cluster with version v1.26.3+k3s1 it works fine, and we have an issue on the cluster with version v1.25.4+k3s1. Not sure if it makes great difference.
In the pgo pod log, I can see the below error is generated :
time="2025-01-13T09:11:21Z" level=error msg="Reconciler error" PostgresCluster=postgres-operator/hippo controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster error="Operation cannot be fulfilled on Pod \"hippo-instance1-f9rv-0\": the ResourceVersion in the precondition (1043902053) does not match the ResourceVersion in record (1043902088). The object might have been modified" file="internal/controller/postgrescluster/instance.go:879" func="postgrescluster.(*Reconciler).rolloutInstance" name=hippo namespace=postgres-operator reconcileID=6f692192-9b74-4921-891e-ac91a86b00ea version=5.7.2-0
Also in the pod log, I got an exit code 137, which is a bit strange as I have enough memory resource on the cluster.
2025-01-13 09:19:49.116 UTC [95] LOG: received SIGHUP, reloading configuration files
2025-01-13 09:19:59.121 UTC [95] LOG: received SIGHUP, reloading configuration files
2025-01-13 09:20:09.127 UTC [95] LOG: received SIGHUP, reloading configuration files
2025-01-13 09:20:14.758 UTC [98] LOG: checkpoint complete: wrote 90256 buffers (2.9%); 0 WAL file(s) added, 0 removed, 58 recycled; write=159.616 s, sync=0.007 s, total=159.711 s; sync files=70, longest=0.004 s, average=0.001 s; distance=695145 kB, estimate=695145 kB
2025-01-13 09:20:14.759 UTC [98] LOG: checkpoint starting: immediate force wait
2025-01-13 09:20:15.001 UTC [98] LOG: checkpoint complete: wrote 11836 buffers (0.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.233 s, sync=0.005 s, total=0.243 s; sync files=10, longest=0.003 s, average=0.001 s; distance=100283 kB, estimate=635659 kB
2025-01-13 09:20:15.417 UTC [95] LOG: received fast shutdown request
2025-01-13 09:20:15.418 UTC [95] LOG: aborting any active transactions
2025-01-13 09:20:15.418 UTC [752] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [211] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [210] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [181] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [187] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [179] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [175] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [177] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [173] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [169] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [171] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [167] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [161] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [115] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [163] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.419 UTC [165] FATAL: terminating connection due to administrator command
2025-01-13 09:20:15.420 UTC [95] LOG: background worker "logical replication launcher" (PID 131) exited with exit code 1
2025-01-13 09:20:15.427 UTC [98] LOG: shutting down
2025-01-13 09:20:15.456 UTC [98] LOG: checkpoint starting: shutdown immediate
2025-01-13 09:20:15.477 UTC [98] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.015 s, sync=0.001 s, total=0.022 s; sync files=0, longest=0.000 s, average=0.000 s; distance=7386 kB, estimate=572832 kB
2025-01-13 09:20:16.278 UTC [95] LOG: database system is shut down
command terminated with exit code 137
Have you ever had the same experience, would appreciate if you share your thoughts here, thanks.
Environment
Please provide the following details:
- Platform: Kubernetes
- Platform Version: v1.25.4+k3s1
- PGO Image Tag: postgres-operator:ubi8-5.7.2-0
- Postgres Version 15
- Storage: longhorn