You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
I have new baremetal k8s cluster on my desk. Due power failure it rebooted twice. I didn't check stolon status after first reboot. If it is important I can extract logs from Elastic. After second reboot stolon cluster didn't came up.
I was able to bring it up by deleting sentinel pods one by one. This behaviour is strange and have to be addressed.
Honestly, I have no idea why it break and why it recovered. Since I have few production instalations stolon I'm very interested to track source so you can fix it.
2020-10-30T08:36:13.599Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon-replication-cred/pg_repl_password", "mode": "01000000777"}
2020-10-30T08:36:13.600Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon-superuser-cred/pg_su_password", "mode": "01000000777"}
2020-10-30T08:36:13.600Z INFO cmd/keeper.go:2039 exclusive lock on data dir taken
2020-10-30T08:36:13.605Z INFO cmd/keeper.go:525 keeper uid {"uid": "keeper0"}
2020-10-30T08:36:13.642Z ERROR cmd/keeper.go:673 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:36:13.653Z INFO cmd/keeper.go:1047 our db boot UID is different than the cluster data one, waiting for it to be updated {"bootUUID": "69b3fc3f-7914-4a4e-8d3b-afdf443c6d18", "clusterBootUUID": "bda6e01c-139e-4be4-96fa-4fbe003c9db3"}
stolon-keeper-1
2020-10-30T08:31:59.938Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon-replication-cred/pg_repl_password", "mode": "01000000777"}
2020-10-30T08:31:59.947Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon-superuser-cred/pg_su_password", "mode": "01000000777"}
2020-10-30T08:31:59.948Z INFO cmd/keeper.go:2039 exclusive lock on data dir taken
2020-10-30T08:31:59.960Z INFO cmd/keeper.go:525 keeper uid {"uid": "keeper1"}
2020-10-30T08:32:04.963Z ERROR cmd/keeper.go:765 error retrieving cluster data {"error": "failed to get latest version of configmap: Get https://10.12.0.1:443/api/v1/namespaces/db/configmaps/stolon-cluster-stolon?timeout=5s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:04.974Z ERROR cmd/keeper.go:673 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:07.474Z ERROR cmd/keeper.go:673 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:09.973Z ERROR cmd/keeper.go:995 error retrieving cluster data {"error": "failed to get latest version of configmap: Get https://10.12.0.1:443/api/v1/namespaces/db/configmaps/stolon-cluster-stolon?timeout=5s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:09.975Z ERROR cmd/keeper.go:673 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:10.058Z ERROR cmd/keeper.go:824 failed to update keeper info {"error": "update failed: failed to get latest version of pod: Get https://10.12.0.1:443/api/v1/namespaces/db/pods/stolon-keeper-1?timeout=5s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:12.476Z ERROR cmd/keeper.go:673 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:14.977Z ERROR cmd/keeper.go:673 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:15.008Z INFO cmd/keeper.go:1047 our db boot UID is different than the cluster data one, waiting for it to be updated {"bootUUID": "d686cc6e-f513-4dba-a8c1-84a98759018e", "clusterBootUUID": "4a2b3274-b174-4733-817d-9c45d79f6479"}
For sentenels I got only last 3 log lines on my terminal, but if is important - I can extract startup messages from Elastic.
kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-bjpt7
E1030 10:45:50.732489 1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)
2020-10-30T10:45:53.992Z INFO cmd/sentinel.go:1830 no cluster data available, waiting for it to appear
E1030 10:45:54.538881 1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)
kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-s2k68
2020-10-30T10:46:02.512Z INFO cmd/sentinel.go:1830 no cluster data available, waiting for it to appear
E1030 10:46:05.094403 1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)
2020-10-30T10:46:07.514Z INFO cmd/sentinel.go:1830 no cluster data available, waiting for it to appear
kubectl -n db logs --tail=3 stolon-proxy-756cb878f-j6v9d
2020-10-30T10:46:17.615Z INFO cmd/proxy.go:203 no clusterdata available, closing connections to master
2020-10-30T10:46:22.616Z INFO cmd/proxy.go:203 no clusterdata available, closing connections to master
2020-10-30T10:46:27.618Z INFO cmd/proxy.go:203 no clusterdata available, closing connections to master
root@stolon-sentinel-85969d666d-s2k68:/# stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap status
failed to get leader election record: the server could not find the requested resource (get configmaps stolon-cluster-stolon)
kubectl -n db delete pod stolon-sentinel-85969d666d-bjpt7
pod "stolon-sentinel-85969d666d-bjpt7" deleted
kubectl -n db logs stolon-sentinel-85969d666d-dtqc5
2020-10-30T11:11:41.213Z INFO cmd/sentinel.go:1964 sentinel uid {"uid": "012634c9"}
2020-10-30T11:11:41.229Z INFO cmd/sentinel.go:82 Trying to acquire sentinels leadership
I1030 11:11:41.231909 1 leaderelection.go:242] attempting to acquire leader lease db/stolon-cluster-stolon...
I1030 11:11:58.717346 1 leaderelection.go:252] successfully acquired lease db/stolon-cluster-stolon
2020-10-30T11:11:58.717Z INFO cmd/sentinel.go:89 sentinel leadership acquired
E1030 11:12:06.857376 1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on configmaps "stolon-cluster-stolon": the object has been modified; please apply your changes to the latest version and try again
kubectl -n db logs stolon-sentinel-85969d666d-s2k68 --tail
(no changes)
delete second sentinel
kubectl -n db delete pod stolon-sentinel-85969d666d-s2k68
pod "stolon-sentinel-85969d666d-s2k68" deleted
kubectl -n db logs stolon-sentinel-85969d666d-pd2vx
2020-10-30T11:12:41.266Z INFO cmd/sentinel.go:1964 sentinel uid {"uid": "ed08f2bc"}
2020-10-30T11:12:41.275Z INFO cmd/sentinel.go:82 Trying to acquire sentinels leadership
I1030 11:12:41.275991 1 leaderelection.go:242] attempting to acquire leader lease db/stolon-cluster-stolon...
Part logs of stolon-keeper-0
2020-10-30T11:12:06.438Z INFO cmd/keeper.go:1047 our db boot UID is different than the cluster data one, waiting for it to be updated {"bootUUID": "69b3fc3f-7914-4a4e-8d3b-afdf443c6d18", "clusterBootUUID": "bda6e01c-139e-4be4-96fa-4fbe003c9db3"}
2020-10-30T11:12:08.668Z ERROR cmd/keeper.go:673 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T11:12:11.169Z ERROR cmd/keeper.go:673 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T11:12:11.511Z INFO cmd/keeper.go:1457 our db requested role is master
2020-10-30T11:12:11.542Z INFO postgresql/postgresql.go:319 starting database
2020-10-30 11:12:11.591 UTC [3761] LOG: listening on IPv4 address "10.12.73.96", port 5432
2020-10-30 11:12:11.594 UTC [3761] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-10-30 11:12:11.844 UTC [3762] LOG: database system was interrupted; last known up at 2020-10-30 08:22:10 UTC
2020-10-30 11:12:11.851 UTC [3764] FATAL: the database system is starting up
2020-10-30 11:12:11.861 UTC [3763] FATAL: the database system is starting up
2020-10-30 11:12:11.872 UTC [3765] FATAL: the database system is starting up
2020-10-30 11:12:11.881 UTC [3766] FATAL: the database system is starting up
2020-10-30 11:12:11.969 UTC [3762] LOG: database system was not properly shut down; automatic recovery in progress
2020-10-30 11:12:11.974 UTC [3762] LOG: redo starts at 0/29475B58
2020-10-30 11:12:11.975 UTC [3762] LOG: invalid record length at 0/29477D18: wanted 24, got 0
2020-10-30 11:12:11.975 UTC [3762] LOG: redo done at 0/29477CC0
2020-10-30 11:12:11.975 UTC [3762] LOG: last completed transaction was at log time 2020-10-30 08:22:28.71837+00
2020-10-30 11:12:11.992 UTC [3767] FATAL: the database system is starting up
2020-10-30 11:12:12.037 UTC [3761] LOG: database system is ready to accept connections
2020-10-30T11:12:12.079Z INFO cmd/keeper.go:1493 already master
2020-10-30T11:12:12.111Z INFO cmd/keeper.go:1626 postgres parameters not changed
kubectl -n db get configmaps stolon-cluster-stolon -o yaml
@baznikin since in our tests a full k8s api restart always worked can you provide a simple reproducer of this issue so we can debug it and add a related test case?
Hmm, it's hard, this failure was unintentional. The only solution to
recreate it is to run similar virtual machines in smaller scale on
workstation and abort them. It was not restart, it was hard crash.
Is it possible to extract some information from configmap and correlate it
with sentinel and keeper complains?
пт, 30 окт. 2020 г., 22:12 Simone Gotti <notifications@github.com>:
What happened:
I have new baremetal k8s cluster on my desk. Due power failure it rebooted twice. I didn't check stolon status after first reboot. If it is important I can extract logs from Elastic. After second reboot stolon cluster didn't came up.
I was able to bring it up by deleting sentinel pods one by one. This behaviour is strange and have to be addressed.
Honestly, I have no idea why it break and why it recovered. Since I have few production instalations stolon I'm very interested to track source so you can fix it.
Helm vars
Image
127.0.0.1:5000/stolon
build with Dockerfile:Failure state
kubectl -n db get pod
stolon-keeper-0
stolon-keeper-1
For sentenels I got only last 3 log lines on my terminal, but if is important - I can extract startup messages from Elastic.
kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-bjpt7
kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-s2k68
kubectl -n db logs --tail=3 stolon-proxy-756cb878f-j6v9d
root@stolon-sentinel-85969d666d-s2k68:/#
stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap status
root@stolon-sentinel-85969d666d-s2k68:/#
stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap clusterdata read
kubectl -n db get configmaps stolon-cluster-stolon -o yaml
(blind) Recovery steppes
kubectl -n db delete pod stolon-sentinel-85969d666d-bjpt7
kubectl -n db logs stolon-sentinel-85969d666d-dtqc5
kubectl -n db logs stolon-sentinel-85969d666d-s2k68 --tail
(no changes)
kubectl -n db delete pod stolon-sentinel-85969d666d-s2k68
kubectl -n db logs stolon-sentinel-85969d666d-pd2vx
Part logs of stolon-keeper-0
kubectl -n db get configmaps stolon-cluster-stolon -o yaml
What you expected to happen:
Cluster should survive sudden reboot without looses.
How to reproduce it (as minimally and precisely as possible):
Play around with power source ¯_(ツ)_/¯
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: