add monitoring #264

theRealWardo · 2018-03-07T10:00:52Z

what does zalando do for postgres monitoring with any databases run via this operator?

I was thinking of building https://github.com/wrouesnel/postgres_exporter into the database container and having that be monitored via our prometheus operator.

is there any existing plans to add monitoring directly into this project in some way? if not, is there a need for a more detailed discussion/approach prior to contribution or shall I do as the contribution guidelines say and just hack away and send a PR?

Jan-M · 2018-03-07T10:09:16Z

Quick answer is no, there is no intent to make the operator "monitor" anything. Ideally the operator focuses on "operation" and more specifically on the provisioning and modifying part. The "ops" part we largely leave to Patroni which is very well suited for taking care of the cluster itself.

The operator however does contain a very slim API to allow monitoring it from the outside.

At Zalando we use ZMON (zmon.io) for all monitoring. But there is other options here like Prometheus.

We are running Postgres with the bg_mon extension exposing a lot of Postgres data via a rest API on port 8080 so this helps a lot I think.

theRealWardo · 2018-03-07T10:48:48Z

thanks for the quick reply! to be clear I'm not proposing monitoring the operator itself but rather the database it is operating on. if there is something in the operator that you monitor and feel others should monitor please do let me know! otherwise our system will probably just be monitoring that the pod is up and running.

what I'd like to add to this operator to facilitate that is flag would would add a simple named monitoring port on the ServiceSpec. that would enable me to have a ServiceMonitor (custom resource) which my Prometheus operator would then be able to turn into scrape targets for my Prometheus instance. does that sound reasonable?

Jan-M · 2018-03-07T10:49:02Z

I forgot one tool here, just teasing it as we have not release it yet, but teams rely on our new pgview web interface to monitor their DBs too and it has proven very useful.

theRealWardo · 2018-03-07T10:56:36Z

for that kind of web dashboard thing we've been running https://github.com/ankane/pghero which has definitely helped us a couple times but it doesn't hook into our alerting systems which is what I'm really trying to achieve here.

Jan-M · 2018-03-07T10:58:33Z

Operator monitoring: We have not figured this out completely, one part here is def. user experience making sure the operator is quick to provisioning new clusters and applying changes triggered by the user but other than that we more or less monitor that the pod is running which is not that helpful and informative.

Database monitoring: We don't consider this a task of the operator and our operator is not required once the database is "deployed" as Patroni does all the magic for high availability and failover, which makes the operator itself much smaller in scope and much less important.

To monitor clusters as said above, both postgres and patroni have REST apis that are easy to monitor.

stefanhipfel · 2018-04-12T09:12:49Z

I adapted the operator to deploy the postgres exporter as a sidecar container (Instead of running it inside the spilo container). With this we can get metrics to prometheus. So the operator is not monitoring anything just helps with the deployment. What you guys think?

Jan-M · 2018-04-13T13:35:37Z

We had the discussion once for arbitrary side car definition support, but scratched this until the need arises. Feel free to PR this or frame it in an issue, as this could become anything from simple to very generic.

Maybe we can also go for "prometheus" sidecar similar static as the Scalyr side car. Can you dump your side car definition here so we can have a look?

Jan-M · 2018-05-29T14:30:21Z

I am closing this.

The sidecar feature that we currently use for scalyr only in a hard coded way may see some improvements and become more generic, and then also serve the purpose of adding e.g. the postgres exporter as a sidecar via the operator.

theRealWardo · 2018-05-29T22:31:58Z

how about we keep this open and I send you a PR? I'll try to get you one this week which will add a monitoring side car option if you are okay with that.

Jan-M · 2018-05-30T08:36:32Z

Sure, PRs or Idea Sketches are very welcome. Maybe you can outline your idea briefly, as we have some ongoing discussions internally on how sidecars should look like: from toggled hard coded examples like Scalyr now to a very generic approach.

hjacobs · 2018-05-30T08:43:27Z

@Jan-M would be great to see that discussion here in the Open Source project, so others can comment/join.

theRealWardo · 2018-05-31T06:17:32Z

sure! so if I were to bring up the most important things for adding monitoring to this project:

make it easy for some common use case(s)
make it clear how to add other monitoring solutions

I think we should start by focusing on 2 common use cases, documenting them, and changing the project's current language of Monitoring of clusters is not in scope, for this good tools already exist from ZMON to Prometheus and more Postgres specific options.:

your case of using ZMON + pg_view and friends seems like it can be achieved simply via a modified image, right? I think this case is supported in the current design. this is interesting because it doesn't require additional permissions and instead builds it into postgres. let's document how to do this one.
I think common use case for a lot of us is a sidecar container. this would enable my goal of prometheus monitoring with something the exporter I linked above or a telegraf container. I'd propose we start by extending the current sidecar support with a monitoring specific sidecar that can be enabled. this will be trickier than the baked in approach because most of these processes running in the sidecar will require a connection URL. I believe using the superuser here is a bad idea as it can impact Patroni fail overs, correct? so using the correct user/permissions has to be figured out for this...

a bit more technical details of what I am proposing for monitoring side cars specifically:

no one wants to copy pasta a ton of config, so provide two options - configure monitoring sidecars on the operator or the cluster.
the default should just work so simply specifying monitoring_docker_image to whatever image should be run as a sidecar should just work assuming:
- the image is passed the following environment variables: POSTGRES_USER, POSTGRES_PASSWORD (and it obviously is configured to use them correctly)
- that POSTGRES_USER is granted the correct permissions
for those of us running the Prometheus Operator, we'll apply a specific label to make our ServiceMonitor pick up these pods

going to sketch some code and share it shortly to get a bit more specific and hopefully keep the discussion going. thoughts here though?

Jan-M · 2018-05-31T08:53:49Z

Just a very quick remark: Imho monitoring is still not in scope of the operator, despite that the side cars should be supported and are a good idea.

For me the essence is that the operator should itself not start to "monitor" metrics or become e.g. a metric gateway/proxy.

alexeyklyukin · 2018-05-31T09:51:03Z

Hi @theRealWardo,

I would some similar thoughts along the line of supporting any sidecar, not necessary monitoring (for instance, ours is doing log exporting and others may also do something like regular manual vacuuming, index rebuild or backup or backup, or even running 3rd party applications that do something (i.e. export the data somewhere else). Most of them, in general, need access to the PGDATA/logs and many also need access to the database itself.

The set of parameters you came with looks good to me. We could also pass the role name that should be defined inside the infrastructure roles, and the operator would perform the job of passing the role name and the password from their to the cluster. However, in some cases it might be necessary to connect as a superuser, whose password is per-cluster.

Another idea is to expose the unix socket inside the volume mount of github.com/zalando/spilo, so that other containers running in the same pod can connect with a unix socket and user postgres without a password.

In order to fully support this, we would also need something along the line of pod_environment_configmap (custom environment variables injected in every pod) to be propagated to the sidecar, and also have a similar options for passing global secret object (as in many cases values like external API keys cannot be trusted to mere configmaps) to expose secrets inside it to each container as environment variables.

I am not sure about the labels. It is not possible to apply labels to individual containers within the pod, what we could do is to apply a sidecar label with the name of the sidecar. However, it looks redundant to me, since one can always instruct monitoring to look for pods with the set of cluster_labels configured in the operator.

I'll look into your PR and will also do the global secrets when I have time.

theRealWardo · 2018-06-01T21:45:08Z

so I modified my PR to add generic sidecar support. it allows users to add as many sidecars as they like to each of the pods running their clusters. this is sufficient to meet our use cases, and could be used by your team in place of the current Scalyr specific stuff.

we are going to try and run 2 sidecar containers actually. we'll be running one that does log shipping via Filebeat and another that does monitoring via Postgres Exporter.

hopefully this PR will enable other interesting uses too.

pitabwire · 2019-03-03T15:06:23Z

@theRealWardo how are you passing in the env vars to Postgres Exporter like DATA_SOURCE_NAME as the ones available from the postgres operator are different and i.e. POSTGRES_* or do you create another container based on the one available for postgres exporter for inclusion as a sidecar?

theRealWardo · 2019-03-04T15:47:01Z

right @pitabwire - we use a sidecar, 2 of them actually. one that ships logs and one that does monitoring.

pitabwire · 2019-03-07T18:42:52Z

@theRealWardo you could guide on this. I tried to pass in the environment variables but for some reason they are not being picked in the container for postgres exporter, I get the error below

kubectl logs -n datastore -f tester-events-cluster-0 pg-exporter time="2019-03-07T07:13:56Z" level=info msg="Established new database connection." source="postgres_exporter.go:1035" time="2019-03-07T07:13:56Z" level=info msg="Error while closing non-pinging DB connection: <nil>" source="postgres_exporter.go:1041" time="2019-03-07T07:13:56Z" level=info msg="Error opening connection to database (postgresql://:PASSWORD_REMOVED@127.0.0.1:5432/postgres?sslmode=disable): pq: Could not detect default username. Please provide one explicitly" source="postgres_exporter.go:1070" time="2019-03-07T07:13:56Z" level=info msg="Starting Server: :9187" source="postgres_exporter.go:1178"

my docker file is shown below:

`
FROM ubuntu:18.04 as builder

ENV PG_EXPORTER_VERSION=v0.4.7
RUN apt-get update && apt-get install -y curl
&& curl -sL https://github.com/wrouesnel/postgres_exporter/releases/download/${PG_EXPORTER_VERSION}/postgres_exporter_${PG_EXPORTER_VERSION}_linux-amd64.tar.gz
| tar -xz

FROM scratch

ENV PG_EXPORTER_VERSION=v0.4.7
ENV POSTGRES_USER=""
ENV POSTGRES_PASSWORD=""
ENV DATA_SOURCE_NAME="postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@127.0.0.1:5432/postgres?sslmode=disable"

COPY --from=builder /postgres_exporter_${PG_EXPORTER_VERSION}_linux-amd64/postgres_exporter /postgres_exporter

EXPOSE 9187

ENTRYPOINT [ "/postgres_exporter" ]`

tritruong · 2019-03-08T08:06:03Z

I'm using sidecar to run postgres_exporter. The config look like this

apiVersion: "acid.zalan.do/v1"
kind: postgresql
spec:
    ...
    sidecars:
    - name: "prometheus-postgres-exporter"
      image: "wrouesnel/postgres_exporter:v0.4.7"
      env:
        - name: "PG_EXPORTER_EXTEND_QUERY_PATH"
          value: "/etc/config.yaml"
        - name: "DATA_SOURCE_NAME"
          value: "postgresql://postgres_exporter:password@localhost:5432/postgres?sslmode=disable"
      ports:
        - name: http
          containerPort: 9187
          protocol: TCP
    ...

Unfortunately, the endpoints don't expose the sidecar's port (9187 in this case)

pitabwire · 2019-03-09T05:29:20Z

@tritruong the challange with doing it this way is you have to do it for every cluster definition, I would like to do it globally and in an automated way so that any new cluster definitions are automatically picked up by the prometheus monitor and alerting system.

Jan-M · 2019-03-12T12:33:43Z

And dont put the password into env vars like this.

I am in general in favor of having global generic sidecar def. for whatever you need.

For monitoring though, or other tooling, the K8S API delivers you a nice way to discover services and clusters you want to monitor and the one exporter or tool per cluster may not be the best idea anymore. But this depends arguably.

tritruong · 2019-03-12T16:15:18Z

@Jan-M Yes, I could use mount secret file. Is there any way I could do to disable the default environment variables that always passed to sidecars (POSTGRES_USER and POSTGRES_PASSWORD)?
https://github.com/zalando/postgres-operator/blob/31e568157b336592debbb37f2c44c1ca1769c00d/docs/user.md#sidecar-support

frittentheke · 2019-03-17T19:49:59Z

@tritruong Maybe using a trust configuration with role-mapping in pg_hba.conf could grant the exporter sidecar just the required read-only access, potentially even without password-based authentication?

And yes @Jan-M, I believe @tritruong does have a point. Giving every little sidecar containing just a piece of monitoring software full on admin rights to the database might not be desired :-)

rporres · 2019-03-18T14:47:28Z

Unfortunately, the endpoints don't expose the sidecar's port (9187 in this case)

@tritruong I created a separate service for the exporter to work around that fact.

jtomsa · 2019-04-01T09:28:21Z

If any1 would be interested in monitoring of Patroni itself, I've written a patroni-exporter for prometheus that scrapes the Patroni API. Someone could find it useful :)
https://github.com/Showmax/patroni-exporter

Yannig · 2019-08-02T12:37:35Z

Here is a complete example we use internaly to enable prometheus exporter:

---
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: postgres
spec:
  teamId: "myteam"
  numberOfInstances: 1
  enableMasterLoadBalancer: false
  volume:
    size: 200Mi
  users:
    user_database: ["superuser", "createdb"]
  databases:
    database: user_database
  postgresql:
    version: "11"

  sidecars:
    - name: "exporter"
      image: "wrouesnel/postgres_exporter"
      ports:
        - name: exporter
          containerPort: 9187
          protocol: TCP
      resources:
        limits:
          cpu: 500m
          memory: 256M
        requests:
          cpu: 100m
          memory: 200M
      env:
        - name: "DATA_SOURCE_URI"
          value: "postgres/database?sslmode=disable"
        - name: "DATA_SOURCE_USER"
          valueFrom:
            secretKeyRef:
              name: postgres.postgres.credentials
              key: username
        - name: "DATA_SOURCE_PASS"
          valueFrom:
            secretKeyRef:
              name: postgres.postgres.credentials
              key: password

---
apiVersion: v1
kind: Service
metadata:
  name: pg-exporter
  labels:
    app: pg-exporter
spec:
  ports:
    - name: postgres
      port: 5432
      targetPort: 5432
    - name: exporter
      port: 9187
      targetPort: exporter
  selector:
    application: spilo
    team: myteam

tanordheim · 2019-09-25T10:30:17Z

I opted into baking postgres_exporter into a custom built Spilo image and have the supervisord in the Spilo image automatically start it up. Then I tweaked the Prometheus job rules to add a custom scrape target that scrapes the postgres_exporter metrics on all application=spilo pods - it seems to work quite well and lets me configure monitoring as an operator wide feature instead of having each cluster have to define this themselves.

ekeih · 2019-12-24T09:55:36Z

When we upgraded our Kubernetes cluster to 1.16 the postgres-operator (1.2.0, #674) was not able to find the existing StatefulSets anymore (because of the API changes between 1.15 and 1.16).
This led to a situation where all postgres clusters were marked as SyncFailed.

Status:
  Postgres Cluster Status:  SyncFailed

I think it would be very helpful if the operator exposed a /metrics endpoint for Prometheus which would make it possible to alert on such things. This is not an issue if the database cluster but of the operator, so monitoring the database does not expose this kind of issue.

frittentheke · 2020-04-01T14:51:35Z

@theRealWardo there are two PRs open, that combined should allow most monitoring / log-shipping use cases to be configured:

Fully speced sidecars: Fully speced global sidecars #890
Additional Volumes: Additional volumes capability #736 (i.e. to expose the PostgreSQL socket to postgres_exporter or some other tool)

theRealWardo · 2020-04-04T01:48:05Z

awesome thanks @frittentheke!

vitobotta · 2021-03-09T20:23:22Z

@Yannig Hi! Can you suggest a Grafana dashboard that works with your config? Thanks!

jkroepke · 2021-07-05T08:51:27Z

Does someone try that with the OperatorConfig?

https://github.com/zalando/postgres-operator/blob/ebb3204cdd7002742499c85b1df15b43a68f005b/docs/administrator.md#sidecars-for-postgres-clusters

apiVersion: v1
kind: OperatorConfiguration
metadata:
  name: postgresql-operator-configuration
spec:
  sidecars:
  - name: exporter
    image: prometheuscommunity/postgres-exporter:v0.9.0
    ports:
    - name: exporter
      containerPort: 9187
      protocol: TCP
    resources:
      requests:
        cpu: 50m
        memory: 200M
    env:
    - name: "DATA_SOURCE_URI"
      value: "$(POD_NAME)/postgres?sslmode=disable"
    - name: "DATA_SOURCE_USER"
      value: "$(POSTGRES_USER)"
    - name: "DATA_SOURCE_PASS"
      value: "$(POSTGRES_PASSWORD)"
    - name: "PG_EXPORTER_AUTO_DISCOVER_DATABASES"
      value "true"

Is a Service for port 9187 required? Any disadvantage using PodMonitor? Recently, I used PodMonitor for our kafka operator monitoring setup, too.

davidkarlsen · 2021-07-06T12:17:59Z

@jkroepke that works - but not via configmaps in the later versions of the operator. But yes, you will need to add *Monitor resources to activate scraping

binhnguyenduc · 2021-07-24T13:49:44Z

For anyone looking for a Grafana Dashboard to get started with Yannig's config, try this: https://grafana.com/grafana/dashboards/9628

Simply set up a Prometheus target to scrape /metrics from the pg-exporter service, Import the Grafana dashboard and voila!

MPV · 2021-09-17T06:30:12Z

I think it would be very helpful if the operator exposed a /metrics endpoint for Prometheus which would make it possible to alert on such things. This is not an issue if the database cluster but of the operator, so monitoring the database does not expose this kind of issue.

@ekeih Take a look at #1529

cdmikechen · 2022-03-02T07:25:09Z

I add a config like @jkroepke , but use pid. I feel more refined in this way.

  sidecars:
  - name: exporter
    image: postgres_exporter:v0.10.1
    ports:
    - name: pg-exporter
      containerPort: 9187
      protocol: TCP
    resources:
      requests:
        cpu: 50m
        memory: 200M
    env:
    - name: CLUSTER_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['cluster-name']
    - name: DATA_SOURCE_NAME
      value: >-
        host=/var/run/postgresql user=postgres
        application_name=postgres_exporter
    - name: PG_EXPORTER_CONSTANT_LABELS
      value: 'release=$(CLUSTER_NAME),namespace=$(POD_NAMESPACE)'

You can add a volume in CRD like this:

  additionalVolumes:
    - name: socket-directory
      mountPath: /var/run/postgresql
      targetContainers:
        - all
      volumeSource:
        emptyDir: {}

vitargelo · 2022-03-30T09:55:56Z

I'm using this helm chart:
https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-postgres-exporter

Connected to pooler-replica service. Works fine

jkroepke · 2022-03-30T10:14:56Z

How it works for you?

if you create a new database how the new exporter will be deployed? Running helm install after apply the CR is not the idea of an operator

abh · 2023-09-07T04:43:14Z

@jkroepke you can configure a sidecar in the operator configuration that gets applied to all postgres pods the operator starts.

theRealWardo mentioned this issue May 31, 2018

arbitrary sidecar support #308

Closed

FxKu added discussion enhancement labels Apr 5, 2019

frittentheke mentioned this issue Jan 6, 2020

Allow configuration of global_sidecars / sidecar_docker_images #781

Closed

FxKu mentioned this issue Jan 10, 2020

add POWA support #784

Open

paalkr mentioned this issue Apr 2, 2020

Documentation no clear on how to clone from S3 backup #858

Closed

FxKu mentioned this issue Apr 28, 2020

metrics for the postgres cluster #942

Closed

Aisuko mentioned this issue Apr 12, 2021

Add the minimal master/replica svc-monitor example manifest for end user #1452

Merged

This comment has been minimized.

Sign in to view

add monitoring #264

add monitoring #264

Comments

theRealWardo commented Mar 7, 2018

Jan-M commented Mar 7, 2018

theRealWardo commented Mar 7, 2018

Jan-M commented Mar 7, 2018

theRealWardo commented Mar 7, 2018

Jan-M commented Mar 7, 2018

stefanhipfel commented Apr 12, 2018

Jan-M commented Apr 13, 2018

Jan-M commented May 29, 2018

theRealWardo commented May 29, 2018

Jan-M commented May 30, 2018

hjacobs commented May 30, 2018

theRealWardo commented May 31, 2018

Jan-M commented May 31, 2018

alexeyklyukin commented May 31, 2018

theRealWardo commented Jun 1, 2018

pitabwire commented Mar 3, 2019

theRealWardo commented Mar 4, 2019

pitabwire commented Mar 7, 2019

tritruong commented Mar 8, 2019

pitabwire commented Mar 9, 2019

Jan-M commented Mar 12, 2019

tritruong commented Mar 12, 2019

frittentheke commented Mar 17, 2019

rporres commented Mar 18, 2019 • edited

jtomsa commented Apr 1, 2019

Yannig commented Aug 2, 2019

tanordheim commented Sep 25, 2019

ekeih commented Dec 24, 2019

frittentheke commented Apr 1, 2020

theRealWardo commented Apr 4, 2020

vitobotta commented Mar 9, 2021

jkroepke commented Jul 5, 2021 • edited

davidkarlsen commented Jul 6, 2021

binhnguyenduc commented Jul 24, 2021 • edited

This comment has been minimized.

MPV commented Sep 17, 2021

cdmikechen commented Mar 2, 2022 • edited

vitargelo commented Mar 30, 2022 • edited

jkroepke commented Mar 30, 2022

abh commented Sep 7, 2023

rporres commented Mar 18, 2019 •

edited

jkroepke commented Jul 5, 2021 •

edited

binhnguyenduc commented Jul 24, 2021 •

edited

cdmikechen commented Mar 2, 2022 •

edited

vitargelo commented Mar 30, 2022 •

edited