Cluster down after all k8s nodes rebooted #806

baznikin · 2020-10-30T12:00:14Z

What happened:
I have new baremetal k8s cluster on my desk. Due power failure it rebooted twice. I didn't check stolon status after first reboot. If it is important I can extract logs from Elastic. After second reboot stolon cluster didn't came up.

I was able to bring it up by deleting sentinel pods one by one. This behaviour is strange and have to be addressed.

Honestly, I have no idea why it break and why it recovered. Since I have few production instalations stolon I'm very interested to track source so you can fix it.

Helm vars

#   helm install stolon stable/stolon --version 1.6.1 --namespace db -f deploy/helm/stolon-postgresql.yaml
image:
  repository: 127.0.0.1:5000/stolon
  tag: "v0.16.0-pg11"
pgParameters:
  max_connections: "800"
  shared_buffers: 5GB
  effective_cache_size: 15GB
  maintenance_work_mem: 1280MB
  checkpoint_completion_target: "0.9"
  wal_buffers: 16MB
  default_statistics_target: "100"
  random_page_cost: "1.1"
  effective_io_concurrency: "300"
  work_mem: 3276kB
  min_wal_size: 2GB
  max_wal_size: 8GB
  max_worker_processes: "8"
  max_parallel_workers_per_gather: "4"
  max_parallel_workers: "8"
  max_parallel_maintenance_workers: "4"
persistence:
  storageClassName: local-storage
  size: 20G
keeper:
  resources:
    requests:
      cpu: 2000m
      memory: 2Gi
superuserSecret:
  name: superuser-cred
replicationSecret:
  name: replication-cred

serviceMonitor:
  enabled: true

Image 127.0.0.1:5000/stolon build with Dockerfile:

FROM sorintlab/stolon:v0.16.0-pg11
RUN localedef -i ru_RU -c -f UTF-8 -A /usr/share/locale/locale.alias ru_RU.UTF-8

Failure state

kubectl -n db get pod

NAME                               READY   STATUS    RESTARTS   AGE
stolon-keeper-0                    1/1     Running   4          3d18h
stolon-keeper-1                    1/1     Running   1          3h6m
stolon-proxy-756cb878f-j6v9d       1/1     Running   1          3h11m
stolon-proxy-756cb878f-vc2vz       1/1     Running   1          3h11m
stolon-sentinel-85969d666d-bjpt7   1/1     Running   1          3h11m
stolon-sentinel-85969d666d-s2k68   1/1     Running   1          3h11m

stolon-keeper-0

2020-10-30T08:36:13.599Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-replication-cred/pg_repl_password", "mode": "01000000777"}
2020-10-30T08:36:13.600Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-superuser-cred/pg_su_password", "mode": "01000000777"}
2020-10-30T08:36:13.600Z	INFO	cmd/keeper.go:2039	exclusive lock on data dir taken
2020-10-30T08:36:13.605Z	INFO	cmd/keeper.go:525	keeper uid	{"uid": "keeper0"}
2020-10-30T08:36:13.642Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:36:13.653Z	INFO	cmd/keeper.go:1047	our db boot UID is different than the cluster data one, waiting for it to be updated	{"bootUUID": "69b3fc3f-7914-4a4e-8d3b-afdf443c6d18", "clusterBootUUID": "bda6e01c-139e-4be4-96fa-4fbe003c9db3"}

stolon-keeper-1

2020-10-30T08:31:59.938Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-replication-cred/pg_repl_password", "mode": "01000000777"}
2020-10-30T08:31:59.947Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-superuser-cred/pg_su_password", "mode": "01000000777"}
2020-10-30T08:31:59.948Z	INFO	cmd/keeper.go:2039	exclusive lock on data dir taken
2020-10-30T08:31:59.960Z	INFO	cmd/keeper.go:525	keeper uid	{"uid": "keeper1"}
2020-10-30T08:32:04.963Z	ERROR	cmd/keeper.go:765	error retrieving cluster data	{"error": "failed to get latest version of configmap: Get https://10.12.0.1:443/api/v1/namespaces/db/configmaps/stolon-cluster-stolon?timeout=5s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:04.974Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:07.474Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:09.973Z	ERROR	cmd/keeper.go:995	error retrieving cluster data	{"error": "failed to get latest version of configmap: Get https://10.12.0.1:443/api/v1/namespaces/db/configmaps/stolon-cluster-stolon?timeout=5s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:09.975Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:10.058Z	ERROR	cmd/keeper.go:824	failed to update keeper info	{"error": "update failed: failed to get latest version of pod: Get https://10.12.0.1:443/api/v1/namespaces/db/pods/stolon-keeper-1?timeout=5s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:12.476Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:14.977Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:15.008Z	INFO	cmd/keeper.go:1047	our db boot UID is different than the cluster data one, waiting for it to be updated	{"bootUUID": "d686cc6e-f513-4dba-a8c1-84a98759018e", "clusterBootUUID": "4a2b3274-b174-4733-817d-9c45d79f6479"}

For sentenels I got only last 3 log lines on my terminal, but if is important - I can extract startup messages from Elastic.

kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-bjpt7

E1030 10:45:50.732489       1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)
2020-10-30T10:45:53.992Z	INFO	cmd/sentinel.go:1830	no cluster data available, waiting for it to appear
E1030 10:45:54.538881       1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)

kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-s2k68

2020-10-30T10:46:02.512Z	INFO	cmd/sentinel.go:1830	no cluster data available, waiting for it to appear
E1030 10:46:05.094403       1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)
2020-10-30T10:46:07.514Z	INFO	cmd/sentinel.go:1830	no cluster data available, waiting for it to appear

kubectl -n db logs --tail=3 stolon-proxy-756cb878f-j6v9d

2020-10-30T10:46:17.615Z	INFO	cmd/proxy.go:203	no clusterdata available, closing connections to master
2020-10-30T10:46:22.616Z	INFO	cmd/proxy.go:203	no clusterdata available, closing connections to master
2020-10-30T10:46:27.618Z	INFO	cmd/proxy.go:203	no clusterdata available, closing connections to master

root@stolon-sentinel-85969d666d-s2k68:/# stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap status

failed to get leader election record: the server could not find the requested resource (get configmaps stolon-cluster-stolon)

root@stolon-sentinel-85969d666d-s2k68:/# stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap clusterdata read

nil cluster data: <nil>

kubectl -n db get configmaps stolon-cluster-stolon -o yaml

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"4a5ed418","leaseDurationSeconds":15,"acquireTime":"2020-10-30T07:32:14Z","renewTime":"2020-10-30T08:22:29Z","leaderTransitions":4}'
    stolon-clusterdata: '{"formatVersion":1,"changeTime":"2020-10-30T08:22:28.33612147Z","cluster":{"uid":"99e55388","generation":1,"changeTime":"2020-10-30T07:33:18.825516594Z","spec":{"additionalWalSenders":null,"additionalMasterReplicationSlots":null,"initMode":"new","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"automaticPgRestart":null},"status":{"phase":"normal","master":"0db3798e"}},"keepers":{"keeper0":{"uid":"keeper0","generation":1,"changeTime":"2020-10-30T08:22:28.33659473Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T08:22:28.323974995Z","bootUUID":"bda6e01c-139e-4be4-96fa-4fbe003c9db3","postgresBinaryVersion":{"Maj":11,"Min":7}}},"keeper1":{"uid":"keeper1","generation":1,"changeTime":"2020-10-30T08:22:28.336588896Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T08:22:28.323975862Z","bootUUID":"4a2b3274-b174-4733-817d-9c45d79f6479","postgresBinaryVersion":{"Maj":11,"Min":7}}}},"dbs":{"0db3798e":{"uid":"0db3798e","generation":4,"changeTime":"2020-10-30T08:22:28.336414945Z","spec":{"keeperUID":"keeper0","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"master","followers":["f1fd6966"],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":4,"listenAddress":"10.12.73.74","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":692549936,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000020000000000000021"}},"f1fd6966":{"uid":"f1fd6966","generation":2,"changeTime":"2020-10-30T08:22:28.336579423Z","spec":{"keeperUID":"keeper1","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"standby","followConfig":{"type":"internal","dbuid":"0db3798e"},"followers":[],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":2,"listenAddress":"10.12.99.199","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":692549936,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000030000000000000028"}}},"proxy":{"generation":13,"changeTime":"2020-10-30T07:33:23.96407502Z","spec":{"masterDbUid":"0db3798e","enabledProxies":["029c5918","5ce0e683"]},"status":{}}}'
  creationTimestamp: "2020-10-16T07:38:59Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:control-plane.alpha.kubernetes.io/leader: {}
          f:stolon-clusterdata: {}
    manager: stolon-sentinel
    operation: Update
    time: "2020-10-30T08:22:29Z"
  name: stolon-cluster-stolon
  namespace: db
  resourceVersion: "14867071"
  selfLink: /api/v1/namespaces/db/configmaps/stolon-cluster-stolon
  uid: 13d4175d-7930-4632-8c77-0ab583f94e33

(blind) Recovery steppes

delete one sentinel

kubectl -n db delete pod stolon-sentinel-85969d666d-bjpt7

pod "stolon-sentinel-85969d666d-bjpt7" deleted

kubectl -n db logs stolon-sentinel-85969d666d-dtqc5

2020-10-30T11:11:41.213Z	INFO	cmd/sentinel.go:1964	sentinel uid	{"uid": "012634c9"}
2020-10-30T11:11:41.229Z	INFO	cmd/sentinel.go:82	Trying to acquire sentinels leadership
I1030 11:11:41.231909       1 leaderelection.go:242] attempting to acquire leader lease  db/stolon-cluster-stolon...
I1030 11:11:58.717346       1 leaderelection.go:252] successfully acquired lease db/stolon-cluster-stolon
2020-10-30T11:11:58.717Z	INFO	cmd/sentinel.go:89	sentinel leadership acquired
E1030 11:12:06.857376       1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on configmaps "stolon-cluster-stolon": the object has been modified; please apply your changes to the latest version and try again

kubectl -n db logs stolon-sentinel-85969d666d-s2k68 --tail
(no changes)

delete second sentinel

kubectl -n db delete pod stolon-sentinel-85969d666d-s2k68

pod "stolon-sentinel-85969d666d-s2k68" deleted

kubectl -n db logs stolon-sentinel-85969d666d-pd2vx

2020-10-30T11:12:41.266Z	INFO	cmd/sentinel.go:1964	sentinel uid	{"uid": "ed08f2bc"}
2020-10-30T11:12:41.275Z	INFO	cmd/sentinel.go:82	Trying to acquire sentinels leadership
I1030 11:12:41.275991       1 leaderelection.go:242] attempting to acquire leader lease  db/stolon-cluster-stolon...

Part logs of stolon-keeper-0

2020-10-30T11:12:06.438Z	INFO	cmd/keeper.go:1047	our db boot UID is different than the cluster data one, waiting for it to be updated	{"bootUUID": "69b3fc3f-7914-4a4e-8d3b-afdf443c6d18", "clusterBootUUID": "bda6e01c-139e-4be4-96fa-4fbe003c9db3"}
2020-10-30T11:12:08.668Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T11:12:11.169Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T11:12:11.511Z	INFO	cmd/keeper.go:1457	our db requested role is master
2020-10-30T11:12:11.542Z	INFO	postgresql/postgresql.go:319	starting database
2020-10-30 11:12:11.591 UTC [3761] LOG:  listening on IPv4 address "10.12.73.96", port 5432
2020-10-30 11:12:11.594 UTC [3761] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-10-30 11:12:11.844 UTC [3762] LOG:  database system was interrupted; last known up at 2020-10-30 08:22:10 UTC
2020-10-30 11:12:11.851 UTC [3764] FATAL:  the database system is starting up
2020-10-30 11:12:11.861 UTC [3763] FATAL:  the database system is starting up
2020-10-30 11:12:11.872 UTC [3765] FATAL:  the database system is starting up
2020-10-30 11:12:11.881 UTC [3766] FATAL:  the database system is starting up
2020-10-30 11:12:11.969 UTC [3762] LOG:  database system was not properly shut down; automatic recovery in progress
2020-10-30 11:12:11.974 UTC [3762] LOG:  redo starts at 0/29475B58
2020-10-30 11:12:11.975 UTC [3762] LOG:  invalid record length at 0/29477D18: wanted 24, got 0
2020-10-30 11:12:11.975 UTC [3762] LOG:  redo done at 0/29477CC0
2020-10-30 11:12:11.975 UTC [3762] LOG:  last completed transaction was at log time 2020-10-30 08:22:28.71837+00
2020-10-30 11:12:11.992 UTC [3767] FATAL:  the database system is starting up
2020-10-30 11:12:12.037 UTC [3761] LOG:  database system is ready to accept connections
2020-10-30T11:12:12.079Z	INFO	cmd/keeper.go:1493	already master
2020-10-30T11:12:12.111Z	INFO	cmd/keeper.go:1626	postgres parameters not changed

kubectl -n db get configmaps stolon-cluster-stolon -o yaml

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"91aa7090","leaseDurationSeconds":15,"acquireTime":"2020-10-30T11:15:41Z","renewTime":"2020-10-30T11:36:54Z","leaderTransitions":6}'
    stolon-clusterdata: '{"formatVersion":1,"changeTime":"2020-10-30T11:36:52.818078244Z","cluster":{"uid":"99e55388","generation":1,"changeTime":"2020-10-30T07:33:18.825516594Z","spec":{"additionalWalSenders":null,"additionalMasterReplicationSlots":null,"initMode":"new","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"automaticPgRestart":null},"status":{"phase":"normal","master":"0db3798e"}},"keepers":{"keeper0":{"uid":"keeper0","generation":1,"changeTime":"2020-10-30T11:36:52.818190214Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T11:36:52.811756433Z","bootUUID":"69b3fc3f-7914-4a4e-8d3b-afdf443c6d18","postgresBinaryVersion":{"Maj":11,"Min":7}}},"keeper1":{"uid":"keeper1","generation":1,"changeTime":"2020-10-30T11:36:52.818192157Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T11:36:52.811757286Z","bootUUID":"d686cc6e-f513-4dba-a8c1-84a98759018e","postgresBinaryVersion":{"Maj":11,"Min":7}}}},"dbs":{"0db3798e":{"uid":"0db3798e","generation":4,"changeTime":"2020-10-30T11:36:52.818162694Z","spec":{"keeperUID":"keeper0","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"master","followers":["f1fd6966"],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":4,"listenAddress":"10.12.73.96","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":696785736,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000020000000000000021"}},"f1fd6966":{"uid":"f1fd6966","generation":2,"changeTime":"2020-10-30T11:36:52.818188051Z","spec":{"keeperUID":"keeper1","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"standby","followConfig":{"type":"internal","dbuid":"0db3798e"},"followers":[],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":2,"listenAddress":"10.12.99.218","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":696785736,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000030000000000000028"}}},"proxy":{"generation":17,"changeTime":"2020-10-30T11:16:15.29551675Z","spec":{"masterDbUid":"0db3798e","enabledProxies":["52052d64"]},"status":{}}}'
  creationTimestamp: "2020-10-16T07:38:59Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:control-plane.alpha.kubernetes.io/leader: {}
          f:stolon-clusterdata: {}
    manager: stolon-sentinel
    operation: Update
    time: "2020-10-30T11:36:54Z"
  name: stolon-cluster-stolon
  namespace: db
  resourceVersion: "14940223"
  selfLink: /api/v1/namespaces/db/configmaps/stolon-cluster-stolon
  uid: 13d4175d-7930-4632-8c77-0ab583f94e33

What you expected to happen:
Cluster should survive sudden reboot without looses.

How to reproduce it (as minimally and precisely as possible):
Play around with power source ¯_(ツ)_/¯

Anything else we need to know?:

Environment:

Stolon version: v0.16.0-pg11
Stolon running environment (if useful to understand the bug): k8s baremetal, installed with Helm
Others: I just notice my Helm chart for 1.6.1 and my actual stolon version is 1.6.0

The text was updated successfully, but these errors were encountered:

sgotti · 2020-10-30T15:11:45Z

@baznikin since in our tests a full k8s api restart always worked can you provide a simple reproducer of this issue so we can debug it and add a related test case?

baznikin · 2020-10-30T15:39:17Z

Hmm, it's hard, this failure was unintentional. The only solution to recreate it is to run similar virtual machines in smaller scale on workstation and abort them. It was not restart, it was hard crash. Is it possible to extract some information from configmap and correlate it with sentinel and keeper complains? пт, 30 окт. 2020 г., 22:12 Simone Gotti <notifications@github.com>:

…

@baznikin <https://github.com/baznikin> since in our tests a full k8s api restart always worked can you provide a simple reproducer of this issue so we can debug it and add a related test case? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#806 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHROZXJU72T5A5E4JINA53SNLJUDANCNFSM4TE4M37Q> .

3tem · 2020-11-13T07:29:29Z

I have encountered this problem, but it is accidental. I have been trying to reproduce this problem, but I have not ,what happened my servers reboot.

baznikin added the bug label Oct 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster down after all k8s nodes rebooted #806

Cluster down after all k8s nodes rebooted #806

baznikin commented Oct 30, 2020

sgotti commented Oct 30, 2020

baznikin commented Oct 30, 2020 via email

3tem commented Nov 13, 2020

Cluster down after all k8s nodes rebooted #806

Cluster down after all k8s nodes rebooted #806

Comments

baznikin commented Oct 30, 2020

sgotti commented Oct 30, 2020

baznikin commented Oct 30, 2020 via email

3tem commented Nov 13, 2020