Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster initialisation fails with Permission Denied on GKE #676

Closed
to-kn opened this issue Sep 30, 2019 · 19 comments
Closed

Cluster initialisation fails with Permission Denied on GKE #676

to-kn opened this issue Sep 30, 2019 · 19 comments

Comments

@to-kn
Copy link

to-kn commented Sep 30, 2019

I tried to setup a PG-Cluster with your provided minimal-database.yaml

When the Database-Pod starts, there is an error thrown while initializing the database:
initdb: could not access directory "/home/postgres/pgdata/pgroot/data": Permission denied

I did a quick: kubectl exec -it acid-test-cluster-0 -- /bin/bash
chown postgres.postgres pgdata/

and everything is fine an working (/home/postgres/pgdata had root.root as owner).

@FxKu
Copy link
Member

FxKu commented Oct 1, 2019

Never seen this error before. Would be good to know how your K8s environment looks like? Where do you run it? Which K8s version? Are there PodSecurityPolicies defined? etc.

@FxKu
Copy link
Member

FxKu commented Dec 20, 2019

Closing due to missing response. Anybody feel free to reopen if facing this problem.

@FxKu FxKu closed this as completed Dec 20, 2019
@sessfeld
Copy link

sessfeld commented Jan 6, 2020

Something similar happens to me when the pods restart after beeing killed. I'm running on GKE with version 1.3.0 with the dirty image. I don't have any PodSecurityPolicies. K8s version is v1.14.8-gke.12. When my pods start this is the log

decompressing spilo image...
2020-01-06 09:22:09,063 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2020-01-06 09:22:09,111 - bootstrapping - INFO - Looks like your running google
2020-01-06 09:22:09,125 - bootstrapping - WARNING - could not parse kubernetes labels as a JSON: Expecting value: line 1 column 1 (char 0), reverting to the default: {"application": "spilo"}
2020-01-06 09:22:09,143 - bootstrapping - INFO - Configuring certificate
2020-01-06 09:22:09,143 - bootstrapping - INFO - Generating ssl certificate
2020-01-06 09:22:09,224 - bootstrapping - INFO - Configuring patronictl
2020-01-06 09:22:09,225 - bootstrapping - INFO - Configuring patroni
2020-01-06 09:22:09,234 - bootstrapping - INFO - Writing to file /home/postgres/postgres.yml
2020-01-06 09:22:09,235 - bootstrapping - INFO - Configuring wal-e
2020-01-06 09:22:09,235 - bootstrapping - INFO - Configuring standby-cluster
2020-01-06 09:22:09,235 - bootstrapping - INFO - Configuring pgbouncer
2020-01-06 09:22:09,235 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2020-01-06 09:22:09,235 - bootstrapping - INFO - Configuring crontab
2020-01-06 09:22:09,281 - bootstrapping - INFO - Configuring bootstrap
2020-01-06 09:22:09,281 - bootstrapping - INFO - Configuring log
2020-01-06 09:22:09,281 - bootstrapping - INFO - Configuring pgqd
2020-01-06 09:22:09,281 - bootstrapping - INFO - Configuring pam-oauth2
2020-01-06 09:22:09,282 - bootstrapping - INFO - No PAM_OAUTH2 configuration was specified, skipping
2020-01-06 09:22:09,282 - bootstrapping - INFO - Configuring renice
2020-01-06 09:22:09,286 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of permissions
2020-01-06 09:22:11,078 INFO: No PostgreSQL configuration items changed, nothing to reload.
2020-01-06 09:22:11,276 WARNING: Postgresql is not running.
2020-01-06 09:22:11,276 INFO: Lock owner: None; I am orders-postgresql-0
2020-01-06 09:22:11,280 INFO: pg_controldata:
  pg_control version number: 1100
  Catalog version number: 201809051
  Database system identifier: 6778181925813633094
  Database cluster state: in archive recovery
  pg_control last modified: Sun Jan  5 01:41:45 2020
  Latest checkpoint location: 0/40000C8
  Latest checkpoint's REDO location: 0/4000058
  Latest checkpoint's REDO WAL file: 000000020000000000000004
  Latest checkpoint's TimeLineID: 2
  Latest checkpoint's PrevTimeLineID: 2
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:789
  Latest checkpoint's NextOID: 24576
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 562
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 789
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Sun Jan  5 01:41:45 2020
  Fake LSN counter for unlogged rels: 0/1
  Minimum recovery ending location: 0/4128128
  Min recovery ending loc's timeline: 2
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: d139fc53c86edda1b64b94ac0d4738d4e87c366399447ff0b567490d09eaa951

2020-01-06 09:22:11,281 INFO: Lock owner: None; I am orders-postgresql-0
2020-01-06 09:22:11,295 INFO: Lock owner: None; I am orders-postgresql-0
2020-01-06 09:22:11,554 INFO: starting as a secondary
2020-01-06 09:22:12,032 INFO: postmaster pid=59
/var/run/postgresql:5432 - no response
2020-01-06 09:22:12 UTC [59]: [1-1] 5e12fc43.3b 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2020-01-06 09:22:12 UTC [59]: [2-1] 5e12fc43.3b 0     DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2020-01-06 09:22:13,046 ERROR: postmaster is not running
2020-01-06 09:22:13,048 WARNING: Postgresql is not running.
2020-01-06 09:22:13,048 INFO: Lock owner: None; I am orders-postgresql-0
2020-01-06 09:22:13,053 INFO: pg_controldata:

and after that it loops. I can run chmod -R 0700 on the directory, but that dosn't feel like a solution.

@FxKu FxKu reopened this Jan 15, 2020
@haf
Copy link

haf commented Mar 18, 2020

Same here

grafana-pg-0 postgres /var/run/postgresql:5432 - no response
grafana-pg-1 postgres 2020-03-18 13:28:52,145 ERROR: postmaster is not running
grafana-pg-1 postgres 2020-03-18 13:28:52,170 WARNING: Postgresql is not running.
grafana-pg-1 postgres 2020-03-18 13:28:52,171 INFO: Lock owner: None; I am grafana-pg-1
grafana-pg-1 postgres 2020-03-18 13:28:52,175 INFO: pg_controldata:
grafana-pg-1 postgres   pg_control version number: 1100
grafana-pg-1 postgres   Catalog version number: 201809051
grafana-pg-1 postgres   Database system identifier: 6800310494243180603
grafana-pg-1 postgres   Database cluster state: shut down in recovery
grafana-pg-1 postgres   pg_control last modified: Wed Mar  4 14:57:30 2020
grafana-pg-1 postgres   Latest checkpoint location: 0/3022060
grafana-pg-1 postgres   Latest checkpoint's REDO location: 0/3022028
grafana-pg-1 postgres   Latest checkpoint's REDO WAL file: 000000010000000000000003
grafana-pg-1 postgres   Latest checkpoint's TimeLineID: 1
grafana-pg-1 postgres   Latest checkpoint's PrevTimeLineID: 1
grafana-pg-1 postgres   Latest checkpoint's full_page_writes: on
grafana-pg-1 postgres   Latest checkpoint's NextXID: 0:782
grafana-pg-1 postgres   Latest checkpoint's NextOID: 24576
grafana-pg-1 postgres   Latest checkpoint's NextMultiXactId: 1
grafana-pg-1 postgres   Latest checkpoint's NextMultiOffset: 0
grafana-pg-1 postgres   Latest checkpoint's oldestXID: 562
grafana-pg-1 postgres   Latest checkpoint's oldestXID's DB: 1
grafana-pg-1 postgres   Latest checkpoint's oldestActiveXID: 782
grafana-pg-1 postgres   Latest checkpoint's oldestMultiXid: 1
grafana-pg-1 postgres   Latest checkpoint's oldestMulti's DB: 1
grafana-pg-1 postgres   Latest checkpoint's oldestCommitTsXid: 0
grafana-pg-1 postgres   Latest checkpoint's newestCommitTsXid: 0
grafana-pg-1 postgres   Time of latest checkpoint: Wed Mar  4 11:23:41 2020
grafana-pg-1 postgres   Fake LSN counter for unlogged rels: 0/1
grafana-pg-1 postgres   Minimum recovery ending location: 0/4000000
grafana-pg-1 postgres   Min recovery ending loc's timeline: 1
grafana-pg-1 postgres   Backup start location: 0/0
grafana-pg-1 postgres   Backup end location: 0/0
grafana-pg-1 postgres   End-of-backup record required: no
grafana-pg-1 postgres   wal_level setting: replica
grafana-pg-1 postgres   wal_log_hints setting: on
grafana-pg-1 postgres   max_connections setting: 100
grafana-pg-1 postgres   max_worker_processes setting: 8
grafana-pg-1 postgres   max_prepared_xacts setting: 0
grafana-pg-1 postgres   max_locks_per_xact setting: 64
grafana-pg-1 postgres   track_commit_timestamp setting: off
grafana-pg-1 postgres   Maximum data alignment: 8
grafana-pg-1 postgres   Database block size: 8192
grafana-pg-1 postgres   Blocks per segment of large relation: 131072
grafana-pg-1 postgres   WAL block size: 8192
grafana-pg-1 postgres   Bytes per WAL segment: 16777216
grafana-pg-1 postgres   Maximum length of identifiers: 64
grafana-pg-1 postgres   Maximum columns in an index: 32
grafana-pg-1 postgres   Maximum size of a TOAST chunk: 1996
grafana-pg-1 postgres   Size of a large-object chunk: 2048
grafana-pg-1 postgres   Date/time type storage: 64-bit integers
grafana-pg-1 postgres   Float4 argument passing: by value
grafana-pg-1 postgres   Float8 argument passing: by value
grafana-pg-1 postgres   Data page checksum version: 1
grafana-pg-1 postgres   Mock authentication nonce: 8364537b7781ec199f42bda2453047f6c5fefa226a9d3711ad8e4f0d24351e60
grafana-pg-1 postgres
grafana-pg-1 postgres 2020-03-18 13:28:52,199 INFO: Lock owner: None; I am grafana-pg-1
grafana-pg-1 postgres 2020-03-18 13:28:52,211 INFO: Lock owner: None; I am grafana-pg-1
grafana-pg-1 postgres 2020-03-18 13:28:52,212 INFO: starting as a secondary
grafana-pg-1 postgres 2020-03-18 13:28:52,224 INFO: postmaster pid=107
grafana-pg-1 postgres /var/run/postgresql:5432 - no response
grafana-pg-1 postgres 2020-03-18 13:28:52 UTC [107]: [1-1] 5e722214.6b 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
grafana-pg-1 postgres 2020-03-18 13:28:52 UTC [107]: [2-1] 5e722214.6b 0     DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).

Happens after Istio CNI failed pod initialisation for a node that is preemtible on GKE, and then force-shutting down the pg with k delete pod grafana-pg-1 --grace-period 0 --force seems to set the wrong permissions on the data folder.

@FxKu FxKu changed the title Cluster initialisation fails with Permission Denied Cluster initialisation fails with Permission Denied on GKE Mar 19, 2020
@FxKu
Copy link
Member

FxKu commented Mar 19, 2020

Seems to be related to GKE. What is the securityContext of the Pod? Does it run in privileged mode or not? I've also remember issues with the readOnlyRootFilesystem set to true, but that was OpenShift. Any ideas @CyberDem0n ?

@ReSearchITEng
Copy link
Contributor

One may want to try this new spilo image which does chmod accordingly at startup.
Image: registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p114 # or newer

To get the latest versions of the images of both operator and spilo, do:
https://registry.opensource.zalan.do/v2/acid/postgres-operator/tags/list
https://registry.opensource.zalan.do/v2/acid/spilo-cdp-12/tags/list

It solved similar issues in openshift.
To get the latest versions of the images of both operator and spilo:
https://registry.opensource.zalan.do/v2/acid/postgres-operator/tags/list
https://registry.opensource.zalan.do/v2/acid/spilo-cdp-12/tags/list

@haf
Copy link

haf commented Apr 22, 2020

@ReSearchITEng What's the -cpd infix in the image names? I'm checking your suggestion against the config on the master branch and it's not in there?

@haf
Copy link

haf commented Apr 23, 2020

This is the permissions that are causing the crash:

$ k exec -it grafana-pg-1 -- ls -lah /home/postgres/pgdata/pgroot/data
Defaulting container name to postgres.
Use 'kubectl describe pod/grafana-pg-1 -n monitoring' to see all of the containers in this pod.
total 180K
drwxrws--- 19 postgres postgres 4.0K Apr 22 11:12 .
drwxrwsr-x  4 postgres postgres 4.0K Apr 21 09:31 ..
-rw-rw----  1 postgres postgres  224 Apr 21 09:31 backup_label.old
drwxrws---  6 postgres postgres 4.0K Apr 21 09:31 base
-rw-rw----  1 postgres postgres   34 Apr 21 09:31 current_logfiles
drwxrws---  2 postgres postgres 4.0K Apr 21 10:27 global
-rw-rw----  1 postgres postgres  886 Apr 21 09:31 patroni.dynamic.json
drwxrws---  2 postgres postgres 4.0K Apr 21 09:31 pg_commit_ts
drwxrws---  2 postgres postgres 4.0K Apr 21 09:31 pg_dynshmem
-rw-rw----  1 postgres postgres  641 Apr 23 07:12 pg_hba.conf
-rw-rw----  1 postgres postgres  641 Apr 23 07:12 pg_hba.conf.backup
-rw-rw----  1 postgres postgres 1.6K Apr 21 09:31 pg_ident.conf
-rw-rw----  1 postgres postgres 1.6K Apr 23 07:12 pg_ident.conf.backup
drwxrws---  4 postgres postgres 4.0K Apr 21 17:16 pg_logical
drwxrws---  4 postgres postgres 4.0K Apr 21 09:31 pg_multixact
drwxrws---  2 postgres postgres 4.0K Apr 21 09:31 pg_notify
drwxrws---  3 postgres postgres 4.0K Apr 21 10:27 pg_replslot
drwxrws---  2 postgres postgres 4.0K Apr 21 09:31 pg_serial
drwxrws---  2 postgres postgres 4.0K Apr 21 09:31 pg_snapshots
drwxrws---  2 postgres postgres 4.0K Apr 21 17:16 pg_stat
drwxrws---  2 postgres postgres 4.0K Apr 21 17:16 pg_stat_tmp
drwxrws---  2 postgres postgres 4.0K Apr 21 09:36 pg_subtrans
drwxrws---  2 postgres postgres 4.0K Apr 21 09:31 pg_tblspc
drwxrws---  2 postgres postgres 4.0K Apr 21 09:31 pg_twophase
-rw-rw----  1 postgres postgres    3 Apr 21 09:31 PG_VERSION
drwxrws---  3 postgres postgres 4.0K Apr 21 17:16 pg_wal
drwxrws---  2 postgres postgres 4.0K Apr 21 09:31 pg_xact
-rw-rw----  1 postgres postgres   88 Apr 21 09:31 postgresql.auto.conf
-rw-rw----  1 postgres postgres  24K Apr 21 09:31 postgresql.base.conf
-rw-rw----  1 postgres postgres  24K Apr 23 07:12 postgresql.base.conf.backup
-rw-rw----  1 postgres postgres 1.8K Apr 23 07:12 postgresql.conf
-rw-rw----  1 postgres postgres 1.8K Apr 23 07:12 postgresql.conf.backup
-rw-rw----  1 postgres postgres  475 Apr 21 09:31 postmaster.opts
-rw-------  1 postgres postgres  127 Apr 23 07:12 recovery.conf
-rw-rw----  1 postgres postgres  269 Apr 21 09:31 recovery.done

@haf
Copy link

haf commented Apr 23, 2020

Also

$ docker pull registry.opensource.zalan.do/acid/spilo-cpd-12:1.6-p114
Error response from daemon: manifest for registry.opensource.zalan.do/acid/spilo-cpd-12:1.6-p114 not found: manifest unknown: manifest unknown

@FxKu
Copy link
Member

FxKu commented Apr 23, 2020

cdp not cpd 😃

That's the acronym for our internal CI pipeline tool - continuous delivery pipeline

@haf
Copy link

haf commented Apr 23, 2020

We have a saying in Swedish: skit bakom spakarna; shit behind the levers (in this case me). 😃

@haf
Copy link

haf commented Apr 27, 2020

We started running the latest cdp container docker-pullable://registry.opensource.zalan.do/acid/spilo-cdp-12@sha256:3bb9b58a6370fb091a1be3067250b6e131f0a1bfee130062ec1811505d9522f0, and one of our pgsql nodes just started dumping continuously and failing to start with this log:

grafana-pg-1 postgres 2020-04-27 08:12:20,418 INFO: Lock owner: grafana-pg-0; I am grafana-pg-1
grafana-pg-1 postgres 2020-04-27 08:12:20,419 INFO: Lock owner: grafana-pg-0; I am grafana-pg-1
grafana-pg-1 postgres 2020-04-27 08:12:20,419 WARNING: Postgresql is not running.
grafana-pg-1 postgres 2020-04-27 08:12:20,419 INFO: Lock owner: grafana-pg-0; I am grafana-pg-1
grafana-pg-1 postgres 2020-04-27 08:12:20,425 INFO: pg_controldata:
grafana-pg-1 postgres   pg_control version number: 1100
grafana-pg-1 postgres   Catalog version number: 201809051
grafana-pg-1 postgres   Database system identifier: 6810760545460228154
grafana-pg-1 postgres   Database cluster state: shut down in recovery
grafana-pg-1 postgres   pg_control last modified: Fri Apr 24 01:19:14 2020
grafana-pg-1 postgres   Latest checkpoint location: 0/5000028
grafana-pg-1 postgres   Latest checkpoint's REDO location: 0/5000028
grafana-pg-1 postgres   Latest checkpoint's REDO WAL file: 000000020000000000000005
grafana-pg-1 postgres   Latest checkpoint's TimeLineID: 2
grafana-pg-1 postgres   Latest checkpoint's PrevTimeLineID: 2
grafana-pg-1 postgres   Latest checkpoint's full_page_writes: on
grafana-pg-1 postgres   Latest checkpoint's NextXID: 0:1025
grafana-pg-1 postgres   Latest checkpoint's NextOID: 24606
grafana-pg-1 postgres   Latest checkpoint's NextMultiXactId: 1
grafana-pg-1 postgres   Latest checkpoint's NextMultiOffset: 0
grafana-pg-1 postgres   Latest checkpoint's oldestXID: 562
grafana-pg-1 postgres   Latest checkpoint's oldestXID's DB: 1
grafana-pg-1 postgres   Latest checkpoint's oldestActiveXID: 0
grafana-pg-1 postgres   Latest checkpoint's oldestMultiXid: 1
grafana-pg-1 postgres   Latest checkpoint's oldestMulti's DB: 1
grafana-pg-1 postgres   Latest checkpoint's oldestCommitTsXid: 0
grafana-pg-1 postgres   Latest checkpoint's newestCommitTsXid: 0
grafana-pg-1 postgres   Time of latest checkpoint: Thu Apr  2 15:11:56 2020
grafana-pg-1 postgres   Fake LSN counter for unlogged rels: 0/1
grafana-pg-1 postgres   Minimum recovery ending location: 0/5000098
grafana-pg-1 postgres   Min recovery ending loc's timeline: 2
grafana-pg-1 postgres   Backup start location: 0/0
grafana-pg-1 postgres   Backup end location: 0/0
grafana-pg-1 postgres   End-of-backup record required: no
grafana-pg-1 postgres   wal_level setting: replica
grafana-pg-1 postgres   wal_log_hints setting: on
grafana-pg-1 postgres   max_connections setting: 100
grafana-pg-1 postgres   max_worker_processes setting: 8
grafana-pg-1 postgres   max_prepared_xacts setting: 0
grafana-pg-1 postgres   max_locks_per_xact setting: 64
grafana-pg-1 postgres   track_commit_timestamp setting: off
grafana-pg-1 postgres   Maximum data alignment: 8
grafana-pg-1 postgres   Database block size: 8192
grafana-pg-1 postgres   Blocks per segment of large relation: 131072
grafana-pg-1 postgres   WAL block size: 8192
grafana-pg-1 postgres   Bytes per WAL segment: 16777216
grafana-pg-1 postgres   Maximum length of identifiers: 64
grafana-pg-1 postgres   Maximum columns in an index: 32
grafana-pg-1 postgres   Maximum size of a TOAST chunk: 1996
grafana-pg-1 postgres   Size of a large-object chunk: 2048
grafana-pg-1 postgres   Date/time type storage: 64-bit integers
grafana-pg-1 postgres   Float4 argument passing: by value
grafana-pg-1 postgres   Float8 argument passing: by value
grafana-pg-1 postgres   Data page checksum version: 1
grafana-pg-1 postgres   Mock authentication nonce: a7f66a0719e679d98d03ba8cbde582cfe0b21119509766dbeed4c7b2efb46c0c
grafana-pg-1 postgres
grafana-pg-1 postgres 2020-04-27 08:12:20,426 INFO: Lock owner: grafana-pg-0; I am grafana-pg-1
grafana-pg-1 postgres 2020-04-27 08:12:20,444 INFO: Local timeline=2 lsn=0/5000098
grafana-pg-1 postgres 2020-04-27 08:12:20,456 INFO: master_timeline=14
grafana-pg-1 postgres 2020-04-27 08:12:20,457 INFO: master: history=1	0/4000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 2	0/20000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 3	0/26000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 4	0/3E000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 5	0/54000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 6	0/55000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 7	0/56000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 8	0/57000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 9	0/66000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 10	0/68000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 11	0/69000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 12	0/6A000098	no recovery target specified
grafana-pg-1 postgres
grafana-pg-1 postgres 13	0/6C000098	no recovery target specified

Would you like one of the dump files as well?

@haf
Copy link

haf commented Apr 27, 2020

Also, we have two developer laptops; the one with Windows Subsys for Linux works, with the latest cdp image, but on macOS:

grafana-pg-0 postgres 2020-04-27 13:34:38,333 INFO: Lock owner: None; I am grafana-pg-0
grafana-pg-0 postgres 2020-04-27 13:34:38,334 INFO: waiting for leader to bootstrap
grafana-pg-1 postgres 2020-04-27 13:34:40,561 INFO: Lock owner: None; I am grafana-pg-1
grafana-pg-1 postgres 2020-04-27 13:34:40,561 INFO: waiting for leader to bootstrap


kubectl get postgresql
NAME         TEAM      VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE   STATUS
grafana-pg   grafana   11        2      1Gi                                     19m   SyncFailed


Current services:
kubectl get svc -l application=spilo -L spilo-role
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE   SPILO-ROLE
grafana-pg          ClusterIP   10.106.86.246   <none>        5432/TCP   19m   master
grafana-pg-config   ClusterIP   None            <none>        <none>     19m
grafana-pg-repl     ClusterIP   10.109.92.24    <none>        5432/TCP   19m   replica


Current pods:
kubectl get pods -l application=spilo -L spilo-role
NAME           READY   STATUS    RESTARTS   AGE     SPILO-ROLE
grafana-pg-0   2/2     Running   0          2m58s
grafana-pg-1   2/2     Running   0          19m

both postgres-es are waiting for something.

Also this warning is new:

grafana-pg-0 postgres 2020-04-27 13:38:49,781 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"

and this one:

grafana-pg-1 postgres 2020-04-27 13:38:50,207 - bootstrapping - WARNING - could not parse kubernetes labels as a JSON: Expecting value: line 1 column 1 (char 0), reverting to the default: {"application": "spilo"}

HOWEVER; deleting the postgresql instance and adding it back in made one of them become the leader

@FxKu
Copy link
Member

FxKu commented Apr 28, 2020

Can you maybe raise this in the Spilo repository? The latter should already be fixed in the latest operator.

@stefanhenseler
Copy link

I can confirm this issue is not fixed in the latest version of Spilo and the Operator v1.5.0.
2020-05-12 06:10:31 UTC [2910]: [1-1] 5eba3dd7.b5e 0 FATAL: data directory "/home/postgres/pgdata/pgroot/data" has group or world access 2020-05-12 06:10:31 UTC [2910]: [2-1] 5eba3dd7.b5e 0 DETAIL: Permissions should be u=rwx (0700). 2020-05-12 06:10:31,781 INFO: postmaster pid=2910 /var/run/postgresql:5432 - no response

I experience the same behavior on GKE and PKS with PodSecurityPolicies enabled (Privileged policy enabled). It seems to work on EKS without PodSecurityPolicies enabled. This issue is quite a problem for us because it requires manual intervention every time the Postgres pod restarts (We have two single-node Postgres instances per namespace). In order to get the DB online again, the following command resolves the issue:

kubectl exec -it <pod-name> -n app-prod -c postgres -- chmod 0700 -R /home/postgres/pgdata/pgroot/data

It seems like spilo requires 0 group permissions to start up? I tried the spilo 1.6-p114 image, but for some reason it didn't change the permissions either.

@stoetti
Copy link

stoetti commented Jun 8, 2020

Any news or plans regarding this issue, started using the operator and we are running into the same problem.

@CyberDem0n
Copy link
Contributor

CyberDem0n commented Jun 8, 2020

@stoetti help us to help you:

  1. Clone https://github.com/zalando/spilo

  2. Apply following patch:

diff --git a/postgres-appliance/launch.sh b/postgres-appliance/launch.sh
index 56d4c68..4d47f2d 100755
--- a/postgres-appliance/launch.sh
+++ b/postgres-appliance/launch.sh
@@ -18,7 +18,7 @@ fi
 sysctl -w vm.dirty_background_bytes=67108864 > /dev/null 2>&1
 sysctl -w vm.dirty_bytes=134217728 > /dev/null 2>&1
 
-mkdir -p "$PGLOG" "$RW_DIR/postgresql" "$RW_DIR/tmp" "$RW_DIR/certs"
+mkdir -p "$PGLOG" "$PGDATA" "$RW_DIR/postgresql" "$RW_DIR/tmp" "$RW_DIR/certs"
 if [ "$(id -u)" -ne 0 ]; then
     sed -e "s/^postgres:x:[^:]*:[^:]*:/postgres:x:$(id -u):$(id -g):/" /etc/passwd > "$RW_DIR/tmp/passwd"
     cat "$RW_DIR/tmp/passwd" > /etc/passwd
@@ -35,6 +35,7 @@ done
 chown -R postgres: "$PGROOT" "$RW_DIR/certs"
 chmod -R go-w "$PGROOT"
 chmod 01777 "$RW_DIR/tmp"
+chmod 0700 "$PGDATA"
 
 if [ "$DEMO" = "true" ]; then
     python3 /scripts/configure_spilo.py patroni pgqd certificate pam-oauth2
  1. build your own image: cd postgres-appliance && docker build -t my-spilo-image .

  2. Test the image you build and see if it helps

  3. If it indeed helps - open a PR to spilo with a permanent fix.

@stoetti
Copy link

stoetti commented Jun 9, 2020

@CyberDem0n thanks for the quick reply and the suggestion.

I had to update your path using the permission from a previous comment '0700' instead of the the suggested '07000' because permission error came up upon initiali startup.

Opened a pull request zalando/spilo#447

@CyberDem0n
Copy link
Contributor

'0700' instead of the the suggested '07000'

Right, it was a stupid copy&paste error from my side. The permission clearly should be set to 700 oct, not 7000 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants