Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Added PostgreSQL RFC #3612

Merged
merged 11 commits into from
Aug 31, 2020
154 changes: 154 additions & 0 deletions rfcs/2020-08-27-3603-postgres-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# RFC 3603 - 2020-08-27 - Collecting metrics from PostgreSQL

This RFC is to introduce a new metrics source to consume metrics from PostgreSQL database servers. The high level plan is to implement one source that collects metrics from PostgreSQL server instances.

Background reading on PostgreSQL monitoring:

- https://www.datadoghq.com/blog/postgresql-monitoring/

## Scope

This RFC will cover:

- A new source for PostgreSQL server metrics.

This RFC will not cover:

- Other databases.

## Motivation

Users want to collect, transform, and forward metrics to better observe how their PostgreSQL databases are performing.

## Internal Proposal

Build a single source called `postgresql_metrics` (name to be confirmed) to collect PostgreSQL metrics.

The recommended implementation is to use the Rust PostgreSQL client to connect the target database server by address specified in configuration.

- https://docs.rs/postgres/0.17.5/postgres/index.html

The source would then run the following queries:

- `SELECT * FROM pg_stat_database`
- `SELECT * FROM pg_stat_database_conflicts`
- `SELECT * FROM pg_stat_bgwriter`

And return these metrics by parsing the query results and converting them into metrics using the database name and column names.

- `postgresql_up` -> Used as an uptime metric (0/1) ? - merits a broader discussion.
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved
- `pg_stat_database_blk_read_time` tagged with db, host, server, user (counter)
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved
- `pg_stat_database_blk_write_time` tagged with db, host, server, user (counter)
- `pg_stat_database_blks_hit` tagged with db, host, server, user (counter)
- `pg_stat_database_blks_read` tagged with db, host, server, user (counter)
- `pg_stat_database_stats_reset` tagged with db, host, server, user(counter)
- `pg_stat_bgwriter_buffers_alloc` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_buffers_backend` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_buffers_backend_fsync` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_buffers_checkpoint` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_buffers_clean` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_checkpoint_sync_time` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_checkpoint_write_time` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_checkpoints_req` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_checkpoints_time` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_maxwritten_clean` tagged with db, host, server, user (counter)
- `pg_stat_bgwriter_stats_reset` (counter)
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved
- `pg_stat_database_conflicts` tagged with db, host, server, user (counter)
- `pg_stat_database_datid` tagged with db, host, server, user (counter)
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved
- `pg_stat_database_deadlocks` tagged with db, host, server, user (counter)
- `pg_stat_database_numbackends` tagged with db, host, server, user (gauge)
- `pg_stat_database_temp_bytes` tagged with db, host, server, user (counter)
- `pg_stat_database_temp_files` tagged with db, host, server, user (counter)
- `pg_stat_database_tup_deleted` tagged with db, host, server, user (counter)
- `pg_stat_database_tup_fetched` tagged with db, host, server, user (counter)
- `pg_stat_database_tup_inserted` tagged with db, host, server, user (counter)
- `pg_stat_database_tup_returned` tagged with db, host, server, user (counter)
- `pg_stat_database_tup_updated` tagged with db, host, server, user (counter)
- `pg_stat_database_xact_commit` tagged with db, host, server, user (counter)
- `pg_stat_database_xact_rollback` tagged with db, host, server, user (counter)
- `postgresql_database_conflicts` tagged by db name and server (counter)
- `postgresql_database_conflicts_confl_bufferpin` tagged by db name and server (counter)
- `postgresql_database_conflicts_confl_deadlock` tagged by db name and server (counter)
- `postgresql_database_conflicts_confl_lock tagged` by db name and server (counter)
- `postgresql_database_conflicts_confl_snapshot` tagged by db name and server (counter)
- `postgresql_database_conflicts_confl_tablespace` tagged by db name and server (counter)

Naming of metrics is determined via:

- `pg _ db_name _ column_name`
bruceg marked this conversation as resolved.
Show resolved Hide resolved

This is in line with the Prometheus naming convention for their exporter.

## Doc-level Proposal

The following additional source configuration will be added:

```toml
[sources.my_source_id]
type = "postgresql_metrics" # required
address = "postgres://postgres@localhost" # required - address of the PG server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting we'll probably want to call this endpoint per #3590

We've also leaned towards allowing multiple for sources (#3519 (comment)) so this would be endpoints.

This does make the databases config a bit weird when used with multiple dbs. You could just have the database in the URL ala JDBC to allow for things like postgres://postgres@localhost/database

If we do that, we'll probably also want to label metrics with the endpoint and host to allow differentiation downstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit messy as a UI. :( And adding the DB on the URL only solves for one DB per connection, what if I want two?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it does feel messier. You could just specify the URL twice if you wanted two databases (or maybe do commas?), but I can see this would be weird for the common case of just wanting to monitor all databases on a given server. To that end, we should probably make it easy to just monitor all database (or provide a list of exclusions even). I think it's probably pretty common that I'd want to monitor all of them aside from the template* ones.

I think we should come up with a rule of thumb for when we should do single vs. multiple endpoints for a given source though; to ease the decision in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I've always assumed you'd specific the source twice for separate servers... Specifying the connection URL twice will double (or more for other DBs) the connections and mean multiple collections on a single DB server. I don't think that's a good performance experience for folks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I definitely agree in this case.

I'm wondering what you think about the apache_metrics case though. I did change it to endpoints in response to #3519 (comment) . The prometheus scraper also allows multiple endpoints.

Copy link
Contributor Author

@jamtur01 jamtur01 Aug 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like Apache and PG are differently architected and hence monitored differently. Apache is a single server you're monitoring. Monitoring PG is monitoring one or more/all databases in a PG server. That additional level of granularity makes endpoints odd to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted this to endpoint for now. We can discuss.

databases = ["production", "testing"] # optional, list of databases to query. Defaults to all if not specified.
scrape_interval_secs = 15 # optional, default, seconds
namespace = "postgresql" # optional, default is "postgresql", namespace to put metrics under
```

- We'd also add a guide for doing this without root permissions.

## Rationale

PostgreSQL is a commonly adopted modern database. Users frequently want to monitor its state and performance.

Additionally, as part of Vector's vision to be the "one tool" for ingesting and shipping observability data, it makes sense to add as many sources as possible to reduce the likelihood that a user will not be able to ingest metrics from their tools.

## Prior Art

- https://github.com/wrouesnel/postgres_exporter/
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/postgresql
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved
- https://collectd.org/wiki/index.php/Plugin:PostgreSQL
- https://collectd.org/documentation/manpages/collectd.conf.5.shtml#plugin_postgresql

## Drawbacks

- Additional maintenance and integration testing burden of a new source

## Alternatives

### Having users run telegraf or Prom node exporter and using Vector's prometheus source to scrape it

We could not add the source directly to Vector and instead instruct users to run Telegraf's `postgresl` input or Prometheus' `postgresql_exporter` and point Vector at the resulting data. This would leverage the already supported inputs from those projects.

I decided against this as it would be in contrast with one of the listed
principles of Vector:

> One Tool. All Data. - One simple tool gets your logs, metrics, and traces
> (coming soon) from A to B.

[Vector
principles](https://vector.dev/docs/about/what-is-vector/#who-should-use-vector)

On the same page, it is mentioned that Vector should be a replacement for
Telegraf.

> You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or
> similar tools.

If users are already running Telegraf or PostgreSQL Exporter though, they could opt for this path.

## Outstanding Questions

- SSL. Configure? Default to disable?
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved
- Supported PG versions? There are some differences in functionality between the versions.
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved
- Grab pg_settings?

## Plan Of Attack

Incremental steps that execute this change. Generally this is in the form of:

- [ ] Submit a PR with the initial source implementation

## Future Work

- Extend source to collect additional database metrics:
- Replication
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved
- Locks?
- pg_stat_user_tables?