Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster down after all k8s nodes rebooted #806

Open
baznikin opened this issue Oct 30, 2020 · 3 comments
Open

Cluster down after all k8s nodes rebooted #806

baznikin opened this issue Oct 30, 2020 · 3 comments
Labels

Comments

@baznikin
Copy link

What happened:
I have new baremetal k8s cluster on my desk. Due power failure it rebooted twice. I didn't check stolon status after first reboot. If it is important I can extract logs from Elastic. After second reboot stolon cluster didn't came up.

I was able to bring it up by deleting sentinel pods one by one. This behaviour is strange and have to be addressed.

Honestly, I have no idea why it break and why it recovered. Since I have few production instalations stolon I'm very interested to track source so you can fix it.

Helm vars

#   helm install stolon stable/stolon --version 1.6.1 --namespace db -f deploy/helm/stolon-postgresql.yaml
image:
  repository: 127.0.0.1:5000/stolon
  tag: "v0.16.0-pg11"
pgParameters:
  max_connections: "800"
  shared_buffers: 5GB
  effective_cache_size: 15GB
  maintenance_work_mem: 1280MB
  checkpoint_completion_target: "0.9"
  wal_buffers: 16MB
  default_statistics_target: "100"
  random_page_cost: "1.1"
  effective_io_concurrency: "300"
  work_mem: 3276kB
  min_wal_size: 2GB
  max_wal_size: 8GB
  max_worker_processes: "8"
  max_parallel_workers_per_gather: "4"
  max_parallel_workers: "8"
  max_parallel_maintenance_workers: "4"
persistence:
  storageClassName: local-storage
  size: 20G
keeper:
  resources:
    requests:
      cpu: 2000m
      memory: 2Gi
superuserSecret:
  name: superuser-cred
replicationSecret:
  name: replication-cred

serviceMonitor:
  enabled: true

Image 127.0.0.1:5000/stolon build with Dockerfile:

FROM sorintlab/stolon:v0.16.0-pg11
RUN localedef -i ru_RU -c -f UTF-8 -A /usr/share/locale/locale.alias ru_RU.UTF-8

Failure state

kubectl -n db get pod

NAME                               READY   STATUS    RESTARTS   AGE
stolon-keeper-0                    1/1     Running   4          3d18h
stolon-keeper-1                    1/1     Running   1          3h6m
stolon-proxy-756cb878f-j6v9d       1/1     Running   1          3h11m
stolon-proxy-756cb878f-vc2vz       1/1     Running   1          3h11m
stolon-sentinel-85969d666d-bjpt7   1/1     Running   1          3h11m
stolon-sentinel-85969d666d-s2k68   1/1     Running   1          3h11m

stolon-keeper-0

2020-10-30T08:36:13.599Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-replication-cred/pg_repl_password", "mode": "01000000777"}
2020-10-30T08:36:13.600Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-superuser-cred/pg_su_password", "mode": "01000000777"}
2020-10-30T08:36:13.600Z	INFO	cmd/keeper.go:2039	exclusive lock on data dir taken
2020-10-30T08:36:13.605Z	INFO	cmd/keeper.go:525	keeper uid	{"uid": "keeper0"}
2020-10-30T08:36:13.642Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:36:13.653Z	INFO	cmd/keeper.go:1047	our db boot UID is different than the cluster data one, waiting for it to be updated	{"bootUUID": "69b3fc3f-7914-4a4e-8d3b-afdf443c6d18", "clusterBootUUID": "bda6e01c-139e-4be4-96fa-4fbe003c9db3"}

stolon-keeper-1

2020-10-30T08:31:59.938Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-replication-cred/pg_repl_password", "mode": "01000000777"}
2020-10-30T08:31:59.947Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-superuser-cred/pg_su_password", "mode": "01000000777"}
2020-10-30T08:31:59.948Z	INFO	cmd/keeper.go:2039	exclusive lock on data dir taken
2020-10-30T08:31:59.960Z	INFO	cmd/keeper.go:525	keeper uid	{"uid": "keeper1"}
2020-10-30T08:32:04.963Z	ERROR	cmd/keeper.go:765	error retrieving cluster data	{"error": "failed to get latest version of configmap: Get https://10.12.0.1:443/api/v1/namespaces/db/configmaps/stolon-cluster-stolon?timeout=5s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:04.974Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:07.474Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:09.973Z	ERROR	cmd/keeper.go:995	error retrieving cluster data	{"error": "failed to get latest version of configmap: Get https://10.12.0.1:443/api/v1/namespaces/db/configmaps/stolon-cluster-stolon?timeout=5s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:09.975Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:10.058Z	ERROR	cmd/keeper.go:824	failed to update keeper info	{"error": "update failed: failed to get latest version of pod: Get https://10.12.0.1:443/api/v1/namespaces/db/pods/stolon-keeper-1?timeout=5s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:12.476Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:14.977Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:15.008Z	INFO	cmd/keeper.go:1047	our db boot UID is different than the cluster data one, waiting for it to be updated	{"bootUUID": "d686cc6e-f513-4dba-a8c1-84a98759018e", "clusterBootUUID": "4a2b3274-b174-4733-817d-9c45d79f6479"}

For sentenels I got only last 3 log lines on my terminal, but if is important - I can extract startup messages from Elastic.

kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-bjpt7

E1030 10:45:50.732489       1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)
2020-10-30T10:45:53.992Z	INFO	cmd/sentinel.go:1830	no cluster data available, waiting for it to appear
E1030 10:45:54.538881       1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)

kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-s2k68

2020-10-30T10:46:02.512Z	INFO	cmd/sentinel.go:1830	no cluster data available, waiting for it to appear
E1030 10:46:05.094403       1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)
2020-10-30T10:46:07.514Z	INFO	cmd/sentinel.go:1830	no cluster data available, waiting for it to appear

kubectl -n db logs --tail=3 stolon-proxy-756cb878f-j6v9d

2020-10-30T10:46:17.615Z	INFO	cmd/proxy.go:203	no clusterdata available, closing connections to master
2020-10-30T10:46:22.616Z	INFO	cmd/proxy.go:203	no clusterdata available, closing connections to master
2020-10-30T10:46:27.618Z	INFO	cmd/proxy.go:203	no clusterdata available, closing connections to master

root@stolon-sentinel-85969d666d-s2k68:/# stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap status

failed to get leader election record: the server could not find the requested resource (get configmaps stolon-cluster-stolon)

root@stolon-sentinel-85969d666d-s2k68:/# stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap clusterdata read

nil cluster data: <nil>

kubectl -n db get configmaps stolon-cluster-stolon -o yaml

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"4a5ed418","leaseDurationSeconds":15,"acquireTime":"2020-10-30T07:32:14Z","renewTime":"2020-10-30T08:22:29Z","leaderTransitions":4}'
    stolon-clusterdata: '{"formatVersion":1,"changeTime":"2020-10-30T08:22:28.33612147Z","cluster":{"uid":"99e55388","generation":1,"changeTime":"2020-10-30T07:33:18.825516594Z","spec":{"additionalWalSenders":null,"additionalMasterReplicationSlots":null,"initMode":"new","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"automaticPgRestart":null},"status":{"phase":"normal","master":"0db3798e"}},"keepers":{"keeper0":{"uid":"keeper0","generation":1,"changeTime":"2020-10-30T08:22:28.33659473Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T08:22:28.323974995Z","bootUUID":"bda6e01c-139e-4be4-96fa-4fbe003c9db3","postgresBinaryVersion":{"Maj":11,"Min":7}}},"keeper1":{"uid":"keeper1","generation":1,"changeTime":"2020-10-30T08:22:28.336588896Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T08:22:28.323975862Z","bootUUID":"4a2b3274-b174-4733-817d-9c45d79f6479","postgresBinaryVersion":{"Maj":11,"Min":7}}}},"dbs":{"0db3798e":{"uid":"0db3798e","generation":4,"changeTime":"2020-10-30T08:22:28.336414945Z","spec":{"keeperUID":"keeper0","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"master","followers":["f1fd6966"],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":4,"listenAddress":"10.12.73.74","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":692549936,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000020000000000000021"}},"f1fd6966":{"uid":"f1fd6966","generation":2,"changeTime":"2020-10-30T08:22:28.336579423Z","spec":{"keeperUID":"keeper1","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"standby","followConfig":{"type":"internal","dbuid":"0db3798e"},"followers":[],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":2,"listenAddress":"10.12.99.199","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":692549936,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000030000000000000028"}}},"proxy":{"generation":13,"changeTime":"2020-10-30T07:33:23.96407502Z","spec":{"masterDbUid":"0db3798e","enabledProxies":["029c5918","5ce0e683"]},"status":{}}}'
  creationTimestamp: "2020-10-16T07:38:59Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:control-plane.alpha.kubernetes.io/leader: {}
          f:stolon-clusterdata: {}
    manager: stolon-sentinel
    operation: Update
    time: "2020-10-30T08:22:29Z"
  name: stolon-cluster-stolon
  namespace: db
  resourceVersion: "14867071"
  selfLink: /api/v1/namespaces/db/configmaps/stolon-cluster-stolon
  uid: 13d4175d-7930-4632-8c77-0ab583f94e33

(blind) Recovery steppes

  1. delete one sentinel

kubectl -n db delete pod stolon-sentinel-85969d666d-bjpt7

pod "stolon-sentinel-85969d666d-bjpt7" deleted

kubectl -n db logs stolon-sentinel-85969d666d-dtqc5

2020-10-30T11:11:41.213Z	INFO	cmd/sentinel.go:1964	sentinel uid	{"uid": "012634c9"}
2020-10-30T11:11:41.229Z	INFO	cmd/sentinel.go:82	Trying to acquire sentinels leadership
I1030 11:11:41.231909       1 leaderelection.go:242] attempting to acquire leader lease  db/stolon-cluster-stolon...
I1030 11:11:58.717346       1 leaderelection.go:252] successfully acquired lease db/stolon-cluster-stolon
2020-10-30T11:11:58.717Z	INFO	cmd/sentinel.go:89	sentinel leadership acquired
E1030 11:12:06.857376       1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on configmaps "stolon-cluster-stolon": the object has been modified; please apply your changes to the latest version and try again

kubectl -n db logs stolon-sentinel-85969d666d-s2k68 --tail
(no changes)

  1. delete second sentinel

kubectl -n db delete pod stolon-sentinel-85969d666d-s2k68

pod "stolon-sentinel-85969d666d-s2k68" deleted

kubectl -n db logs stolon-sentinel-85969d666d-pd2vx

2020-10-30T11:12:41.266Z	INFO	cmd/sentinel.go:1964	sentinel uid	{"uid": "ed08f2bc"}
2020-10-30T11:12:41.275Z	INFO	cmd/sentinel.go:82	Trying to acquire sentinels leadership
I1030 11:12:41.275991       1 leaderelection.go:242] attempting to acquire leader lease  db/stolon-cluster-stolon...

Part logs of stolon-keeper-0

2020-10-30T11:12:06.438Z	INFO	cmd/keeper.go:1047	our db boot UID is different than the cluster data one, waiting for it to be updated	{"bootUUID": "69b3fc3f-7914-4a4e-8d3b-afdf443c6d18", "clusterBootUUID": "bda6e01c-139e-4be4-96fa-4fbe003c9db3"}
2020-10-30T11:12:08.668Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T11:12:11.169Z	ERROR	cmd/keeper.go:673	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T11:12:11.511Z	INFO	cmd/keeper.go:1457	our db requested role is master
2020-10-30T11:12:11.542Z	INFO	postgresql/postgresql.go:319	starting database
2020-10-30 11:12:11.591 UTC [3761] LOG:  listening on IPv4 address "10.12.73.96", port 5432
2020-10-30 11:12:11.594 UTC [3761] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-10-30 11:12:11.844 UTC [3762] LOG:  database system was interrupted; last known up at 2020-10-30 08:22:10 UTC
2020-10-30 11:12:11.851 UTC [3764] FATAL:  the database system is starting up
2020-10-30 11:12:11.861 UTC [3763] FATAL:  the database system is starting up
2020-10-30 11:12:11.872 UTC [3765] FATAL:  the database system is starting up
2020-10-30 11:12:11.881 UTC [3766] FATAL:  the database system is starting up
2020-10-30 11:12:11.969 UTC [3762] LOG:  database system was not properly shut down; automatic recovery in progress
2020-10-30 11:12:11.974 UTC [3762] LOG:  redo starts at 0/29475B58
2020-10-30 11:12:11.975 UTC [3762] LOG:  invalid record length at 0/29477D18: wanted 24, got 0
2020-10-30 11:12:11.975 UTC [3762] LOG:  redo done at 0/29477CC0
2020-10-30 11:12:11.975 UTC [3762] LOG:  last completed transaction was at log time 2020-10-30 08:22:28.71837+00
2020-10-30 11:12:11.992 UTC [3767] FATAL:  the database system is starting up
2020-10-30 11:12:12.037 UTC [3761] LOG:  database system is ready to accept connections
2020-10-30T11:12:12.079Z	INFO	cmd/keeper.go:1493	already master
2020-10-30T11:12:12.111Z	INFO	cmd/keeper.go:1626	postgres parameters not changed

kubectl -n db get configmaps stolon-cluster-stolon -o yaml

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"91aa7090","leaseDurationSeconds":15,"acquireTime":"2020-10-30T11:15:41Z","renewTime":"2020-10-30T11:36:54Z","leaderTransitions":6}'
    stolon-clusterdata: '{"formatVersion":1,"changeTime":"2020-10-30T11:36:52.818078244Z","cluster":{"uid":"99e55388","generation":1,"changeTime":"2020-10-30T07:33:18.825516594Z","spec":{"additionalWalSenders":null,"additionalMasterReplicationSlots":null,"initMode":"new","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"automaticPgRestart":null},"status":{"phase":"normal","master":"0db3798e"}},"keepers":{"keeper0":{"uid":"keeper0","generation":1,"changeTime":"2020-10-30T11:36:52.818190214Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T11:36:52.811756433Z","bootUUID":"69b3fc3f-7914-4a4e-8d3b-afdf443c6d18","postgresBinaryVersion":{"Maj":11,"Min":7}}},"keeper1":{"uid":"keeper1","generation":1,"changeTime":"2020-10-30T11:36:52.818192157Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T11:36:52.811757286Z","bootUUID":"d686cc6e-f513-4dba-a8c1-84a98759018e","postgresBinaryVersion":{"Maj":11,"Min":7}}}},"dbs":{"0db3798e":{"uid":"0db3798e","generation":4,"changeTime":"2020-10-30T11:36:52.818162694Z","spec":{"keeperUID":"keeper0","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"master","followers":["f1fd6966"],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":4,"listenAddress":"10.12.73.96","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":696785736,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000020000000000000021"}},"f1fd6966":{"uid":"f1fd6966","generation":2,"changeTime":"2020-10-30T11:36:52.818188051Z","spec":{"keeperUID":"keeper1","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"standby","followConfig":{"type":"internal","dbuid":"0db3798e"},"followers":[],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":2,"listenAddress":"10.12.99.218","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":696785736,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000030000000000000028"}}},"proxy":{"generation":17,"changeTime":"2020-10-30T11:16:15.29551675Z","spec":{"masterDbUid":"0db3798e","enabledProxies":["52052d64"]},"status":{}}}'
  creationTimestamp: "2020-10-16T07:38:59Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:control-plane.alpha.kubernetes.io/leader: {}
          f:stolon-clusterdata: {}
    manager: stolon-sentinel
    operation: Update
    time: "2020-10-30T11:36:54Z"
  name: stolon-cluster-stolon
  namespace: db
  resourceVersion: "14940223"
  selfLink: /api/v1/namespaces/db/configmaps/stolon-cluster-stolon
  uid: 13d4175d-7930-4632-8c77-0ab583f94e33

What you expected to happen:
Cluster should survive sudden reboot without looses.

How to reproduce it (as minimally and precisely as possible):
Play around with power source ¯_(ツ)_/¯

Anything else we need to know?:

Environment:

  • Stolon version: v0.16.0-pg11
  • Stolon running environment (if useful to understand the bug): k8s baremetal, installed with Helm
  • Others: I just notice my Helm chart for 1.6.1 and my actual stolon version is 1.6.0
@baznikin baznikin added the bug label Oct 30, 2020
@sgotti
Copy link
Member

sgotti commented Oct 30, 2020

@baznikin since in our tests a full k8s api restart always worked can you provide a simple reproducer of this issue so we can debug it and add a related test case?

@baznikin
Copy link
Author

baznikin commented Oct 30, 2020 via email

@3tem
Copy link

3tem commented Nov 13, 2020

I have encountered this problem, but it is accidental. I have been trying to reproduce this problem, but I have not ,what happened my servers reboot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants