Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenShift Installation #985

Closed
steveyang95 opened this issue May 19, 2020 · 16 comments
Closed

OpenShift Installation #985

steveyang95 opened this issue May 19, 2020 · 16 comments

Comments

@steveyang95
Copy link

steveyang95 commented May 19, 2020

Hi!

Is there any formal documentation or directions that people can write up to get setup on OpenShift?

I have followed the following without much luck:
#852 (comment)

I am running on OpenShift 4.4 and my OpenShift cluster creation logs says: API v1.17.1 up

Error

  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 498, in _load_cluster                                                                                                       
    self._wait_caches()                                                                                                                                                                                     
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 492, in _wait_caches                                                                                                        
    raise RetryFailedError('Exceeded retry deadline')                                                                                                                                                       
patroni.utils.RetryFailedError: 'Exceeded retry deadline'                                                                                                                                                   
2020-05-19 05:47:47,803 ERROR: Error communicating with DCS                                                                                                                                                 
2020-05-19 05:47:47,803 INFO: DCS is not accessible                                                                                                                                                         
2020-05-19 05:47:47,805 WARNING: Loop time exceeded, rescheduling immediately.                                                                                                                              
2020-05-19 05:47:48,470 ERROR: ObjectCache.run ApiException()                                                                                                                                               
2020-05-19 05:47:48,473 ERROR: ObjectCache.run ApiException()                                                                                                                                               
2020-05-19 05:47:49,474 ERROR: ObjectCache.run ApiException()                                                                                                                                               
2020-05-19 05:47:49,475 ERROR: ObjectCache.run ApiException()                                                                                                                                               
2020-05-19 05:47:50,477 ERROR: ObjectCache.run ApiException()                                                                                                                                               
2020-05-19 05:47:50,479 ERROR: ObjectCache.run ApiException()                                                                                                                                               
2020-05-19 05:47:51,480 ERROR: ObjectCache.run ApiException()                                                                                                                                               
2020-05-19 05:47:51,482 ERROR: ObjectCache.run ApiException() 

I have also set kubernetes_use_configmaps: "true".

These are the commands that I run:

oc apply -f postgres-operator/manifests/operator-service-account-rbac.yaml
oc apply -f postgres-operator/manifests/postgres-operator.yaml
oc apply -f postgres-operator/manifests/api-service.yaml
oc apply -f postgres-operator/manifests/minimal-postgres-manifest.yaml

My configmap.yaml:

kind: ConfigMap
metadata:
  name: postgres-operator
data:
  # additional_secret_mount: "some-secret-name"
  # additional_secret_mount_path: "/some/dir"
  api_port: "8080"
  aws_region: eu-central-1
  cluster_domain: cluster.local
  cluster_history_entries: "1000"
  cluster_labels: application:spilo
  cluster_name_label: cluster-name
  # connection_pooler_default_cpu_limit: "1"
  # connection_pooler_default_cpu_request: "500m"
  # connection_pooler_default_memory_limit: 100Mi
  # connection_pooler_default_memory_request: 100Mi
  connection_pooler_image: "registry.opensource.zalan.do/acid/pgbouncer:master-7"
  # connection_pooler_max_db_connections: 60
  # connection_pooler_mode: "transaction"
  # connection_pooler_number_of_instances: 2
  # connection_pooler_schema: "pooler"
  # connection_pooler_user: "pooler"
  # custom_service_annotations: "keyx:valuez,keya:valuea"
  # custom_pod_annotations: "keya:valuea,keyb:valueb"
  db_hosted_zone: db.example.com
  debug_logging: "true"
  # default_cpu_limit: "1"
  # default_cpu_request: 100m
  # default_memory_limit: 500Mi
  # default_memory_request: 100Mi
  docker_image: registry.opensource.zalan.do/acid/spilo-12:1.6-p3
  # downscaler_annotations: "deployment-time,downscaler/*"
  # enable_admin_role_for_users: "true"
  # enable_crd_validation: "true"
  # enable_database_access: "true"
  # enable_init_containers: "true"
  # enable_lazy_spilo_upgrade: "false"
  enable_master_load_balancer: "false"
  # enable_pod_antiaffinity: "false"
  # enable_pod_disruption_budget: "true"
  enable_replica_load_balancer: "false"
  # enable_shm_volume: "true"
  # enable_sidecars: "true"
  # enable_team_superuser: "false"
  enable_teams_api: "false"
  # etcd_host: ""
  kubernetes_use_configmaps: "true"
  # infrastructure_roles_secret_name: postgresql-infrastructure-roles
  # inherited_labels: application,environment
  # kube_iam_role: ""
  # log_s3_bucket: ""
  logical_backup_docker_image: "registry.opensource.zalan.do/acid/logical-backup"
  # logical_backup_s3_access_key_id: ""
  logical_backup_s3_bucket: "my-bucket-url"
  # logical_backup_s3_region: ""
  # logical_backup_s3_endpoint: ""
  # logical_backup_s3_secret_access_key: ""
  logical_backup_s3_sse: "AES256"
  logical_backup_schedule: "30 00 * * *"
  master_dns_name_format: "{cluster}.{team}.{hostedzone}"
  # master_pod_move_timeout: 20m
  # max_instances: "-1"
  # min_instances: "-1"
  # min_cpu_limit: 250m
  # min_memory_limit: 250Mi
  # node_readiness_label: ""
  # oauth_token_secret_name: postgresql-operator
  # pam_configuration: |
  #  https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
  # pam_role_name: zalandos
  pdb_name_format: "postgres-{cluster}-pdb"
  # pod_antiaffinity_topology_key: "kubernetes.io/hostname"
  pod_deletion_wait_timeout: 10m
  # pod_environment_configmap: "default/my-custom-config"
  pod_label_wait_timeout: 10m
  pod_management_policy: "ordered_ready"
  pod_role_label: spilo-role
  # pod_service_account_definition: ""
  pod_service_account_name: "postgres-pod"
  # pod_service_account_role_binding_definition: ""
  pod_terminate_grace_period: 5m
  # postgres_superuser_teams: "postgres_superusers"
  # protected_role_names: "admin"
  ready_wait_interval: 3s
  ready_wait_timeout: 30s
  repair_period: 5m
  replica_dns_name_format: "{cluster}-repl.{team}.{hostedzone}"
  replication_username: standby
  resource_check_interval: 3s
  resource_check_timeout: 10m
  resync_period: 30m
  ring_log_lines: "100"
  secret_name_template: "{username}.{cluster}.credentials"
  # sidecar_docker_images: ""
  # set_memory_request_to_limit: "false"
  spilo_privileged: "false"
  super_username: postgres
  # team_admin_role: "admin"
  # team_api_role_configuration: "log_statement:all"
  # teams_api_url: http://fake-teams-api.default.svc.cluster.local
  # toleration: ""
  # wal_s3_bucket: ""
  watched_namespace: "*"  # listen to all namespaces
  workers: "4"

I have also tried the following and got same ApiException()

Operator Image should be at least: registry.opensource.zalan.do/acid/postgres-operator:v1.4.0-21-g1249626-dirty
Operator should be configured with these values:
kubernetes_use_configmaps: "true"
docker_image: registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p114 #or newer

@yaroslavkasatikov
Copy link

Hello,
I totally support this.
Got the same issue on Openshift 4.4.3.

Tried with:

kubernetes_use_configmaps: "true"
docker_image: registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p119
operator: registry.opensource.zalan.do/acid/postgres-operator:v1.5.0

All outputs are the same with @steveyang95

Also I tried to start cluster in privileged mode w/o kubernetes_use_configmaps, but got this error.

2020-05-23 13:58:32,423 ERROR: failed to update leader lock
2020-05-23 13:58:32,505 INFO: not promoting because failed to update leader lock in DCS
2020-05-23 13:58:42,369 INFO: Lock owner: acid-cluster-0; I am acid-cluster-0
2020-05-23 13:58:42,418 ERROR: Permission denied
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 288, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 634, in patch_or_create
    ret = self.retry(func, self._namespace, body) if retry else func(self._namespace, body)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 455, in retry
    return self._retry.copy()(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/utils.py", line 331, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 277, in wrapper
    return getattr(self._api, func)(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 203, in wrapper
    raise k8s_client.rest.ApiException(http_resp=response)
patroni.dcs.kubernetes.K8SClient.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '025a21dc-753a-4082-b64f-c4b9689e04e7', 'Content-Type': 'application/json', 'Date': 'Sat, 23 May 2020 13:58:42 GMT', 'Content-Length': '251'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \\"acid-cluster\\" is forbidden: endpoint address 10.129.0.87 is not allowed","reason":"Forbidden","details":{"name":"acid-cluster","kind":"endpoints"},"code":403}\n'

2020-05-23 13:58:42,419 ERROR: failed to update leader lock
2020-05-23 13:58:42,419 INFO: not promoting because failed to update leader lock in DCS

So, could you please test your installation in Openshift 4.4.3 and give a feedback.

Many thanks,
Yaroslav

@FxKu
Copy link
Member

FxKu commented May 25, 2020

Maybe @ReSearchITEng, you can opt in here as our OpenShift user. I think, you can not simply take the Spilo image as is, but must make sure it runs in rootless mode.

@yaroslavkasatikov
Copy link

yaroslavkasatikov commented May 31, 2020

Hey team,

Do you have any updates here?
We're looking for postgresql operator as strategic solution and want to test yours, but Openshift is a cornerstone of all infrastructure system. It will be too sad if we won't be able to run your postgres operator :-(

@FxKu
Copy link
Member

FxKu commented Jun 3, 2020

@steveyang95 and @yaroslavkasatikov can you check if the solution described here helps? Setting the spiloFSGroup parameter.

On the other hands, as per docs this param is also not required for OpenShift. Maybe you can choose a previous Spilo release with the v1.5.0 operator? Then we can better tell, where the incompatibility is coming from.

@Jan-M
Copy link
Member

Jan-M commented Jun 4, 2020

I would look into privileges of the pod, seems Patroni (within the Postgres pod) is not allowed to do leader election.

So you may lack pod privileges to update/write config maps which are used on open shift for election.

https://github.com/zalando/postgres-operator/blob/master/manifests/operator-service-account-rbac.yaml#L220

@CyberDem0n
Copy link
Contributor

@yaroslavkasatikov on openshift you have to set kubernetes_use_configmaps.

When using endpoints Patroni is trying to update subsets with the IP address of the pod which is running as primary, and on OpenShift it is not allowed :(

@FxKu
Copy link
Member

FxKu commented Jun 5, 2020

@steveyang95 and @yaroslavkasatikov can you try to extend the cluster role used by the Pods and hence Patroni to be able to read and update ConfigMaps? I guess, simply replacing endpoints here with configmaps should be fine. Can you try?

@ReSearchITEng
Copy link
Contributor

ReSearchITEng commented Jun 11, 2020

@steveyang95 @yaroslavkasatikov
I can confirm you it works on OCP 4.3 tested and working. Did no test on 4.4 (yet).
oc version returns: "kubernetes v1.16.2" (aka OpenShift 4.3).

Please set DEBUG messages, so we'll get better understanding what resource is OCP rejecting.

More on OCP 4.3 we use:
(On a security hardened one, where PV read is not allowed, after 30 min (resync period), the oc get pg will change status from Running to "SyncFailed". The cluster still works as expected, just resync is impacted. This should be solved by PR#958 . Meanwhile you can set resync_period to some big number.)
when you install the helm chart (1.5.0), use configTarget: "ConfigMap".

resync_period: 987654321 #some big number
spilo_privileged: "false"
kubernetes_use_configmaps: "true"
docker_image: registry.opensource.zalan.do/acid/spilo-12:1.6-p3
watched_namespace: "" #if you want one opr per namespace

When you install the cluster, make sure you comment out:

#  enableShmVolume: true
#  spiloFSGroup: 103

As the OCP SCC will allocate dynamically user/group, and newer spilo images know how to dynamically chown to that at startup.

@ReSearchITEng
Copy link
Contributor

Yes, OCP 4.4 (based on k8s 1.17), cluster pods (spilo) gives errors:

  1. I can confirm the error: ERROR: ObjectCache.run ApiException()
    Solution: Giving more perms to the postgres-pod (gave also all the postgres-operator perms to the pod). I did not identify the exact perm.

  2. After that, going into this error: (Spilo fails to run callback_endpoint.py when KUBERNETES_USE_CONFIGMAPS is set to true spilo#449):

2020-07-03 19:23:05,818 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
Traceback (most recent call last):
  File "/scripts/callback_endpoint.py", line 9, in <module>
    from kubernetes import client as k8s_client, config as k8s_config
ModuleNotFoundError: No module named 'kubernetes'
2020-07-03 19:23:05,915 INFO: promoted self to leader by acquiring session lock
server promoting
2020-07-03 19:23:05,919 INFO: cleared rewind state after becoming the leader
Traceback (most recent call last):
  File "/scripts/callback_endpoint.py", line 9, in <module>
    from kubernetes import client as k8s_client, config as k8s_config
ModuleNotFoundError: No module named 'kubernetes'

<grants, etc all ok>

2020-07-03 19:23:16,956 INFO: Lock owner: postgres-operator-cluster-0; I am postgres-operator-cluster-0
2020-07-03 19:23:17,053 INFO: no action.  i am the leader with the lock

DB looks up (psql command in the pod workds), but, cluster is in "SyncFailed" status.

@ReSearchITEng
Copy link
Contributor

ReSearchITEng commented Jul 3, 2020

Solution:

  1. make sure you have for postgres-pod serviceAccount also the below permissions:
    postgres-pod perms:
    If you don't have them already, add:
- apiGroups:
  - ""
  resources:
  - services
  verbs:
  - create
  - patch
  - get
  - list
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - patch
  - update
  - watch

on top of existing:

- apiGroups:
  - ""
  resources:
  - endpoints
  verbs:
  - get
# Patroni needs to watch pods
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - patch
  - update
  - watch
# to let Patroni create a headless service
- apiGroups:
  - ""
  resources:
  - services
  verbs:
  - create
  - patch
  1. postgres-pod perms:
    If not already there, add nodes perms:
  # to check nodes for node readiness label
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch

on top of existing.

  1. crd perms
    Make sure you have at least "get" if not the entire set like: get,create,patch,update,...
    More on: limit perms for crds #1044

  2. ignorable error
    As for from kubernetes import client as k8s_client, config as k8s_config error -> it can be ignored. It will be fixed in Spilo fails to run callback_endpoint.py when KUBERNETES_USE_CONFIGMAPS is set to true spilo#449 , but you can safely ignore it.

  3. Result:

$ oc version
Client Version: 4.5.0-202005291417-9933eb9
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.5     True        False         34d     Cluster version is 4.4.5
Kubernetes Version: v1.17.1
$ oc get pg
NAME                        TEAM       VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
postgres-cluster   postgres   12        1      1Gi      10m           100Mi            120m   Running
$ helm version
version.BuildInfo{Version:"v3.2.4"

@stewartshea
Copy link

I think you also need create for the postgres-pod perms on configmaps

- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - patch
  - update
  - watch
  - create

@davidkarlsen
Copy link

Related: #1327

@davidkarlsen
Copy link

davidkarlsen commented Jan 22, 2021

shouldn't the operator provide whatever RBAC permissions needed, else it becomes a bit hackish and not so automated.

@ghost
Copy link

ghost commented Mar 4, 2022

Even when kubernetes_use_configmaps is set the operator stills tries to create endpoints which are not allowed on OpenShift.

Probably related to this PR:

#1760

Can somebody of the maintainers have a look at this PR? Thx!

The only way I'm able to install the operator on OpenShift is to use an older version 1.6.3 and unset kubernetes_use_configmaps for the deployment to fail and add endpoint/restricted to postgres-pod

cluster-role.

resources:

  • endpoints
  • endpoints/restricted

endpoint/restricted can not be there from the start.

This trick doesn't seem to work with the latest version (1.7.x) or the main branch.

If somebody else is able to install the operator on OpenShift please share your config. Thx!

@FxKu
Copy link
Member

FxKu commented Apr 4, 2022

It was fixed now with #1760 and #1825 and will be included in the next release this week.

@FxKu FxKu closed this as completed Apr 4, 2022
@ghost
Copy link

ghost commented Apr 5, 2022

It was fixed now with #1760 and #1825 and will be included in the next release this week.

Thanks! We tested the code and I confirm that it works on OpenShift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants