Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(rfcs): Add RFC for Apache HTTP Server metrics source #3519

Merged
merged 13 commits into from
Aug 28, 2020
231 changes: 231 additions & 0 deletions rfcs/2020-08-21-3092-apache-metrics-source.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# RFC 3092 - 2020-08-21 - Apache HTTP Server metrics source

This RFC is to introduce a new metrics source to consume metrics from the
[Apache HTTP Server] (httpd). The high level plan is to implement a scrapper
juchiast marked this conversation as resolved.
Show resolved Hide resolved
similar to the existing [prometheus
source](https://vector.dev/docs/reference/sources/prometheus/) that will scrape
the Apache HTTP Server stats endpoint (provided by
[`mod_status`](https://httpd.apache.org/docs/2.4/mod/mod_status.html)) on an
interval and publish metrics to the defined pipeline.

## Scope

This RFC will cover:

- A new source for Apache metrics

This RFC will not cover:

- Generating metrics from Apache logs

## Motivation

Users running httpd want to collect, transform, and forward metrics to better
observe how their webservers are performing.

## Internal Proposal

I expect to largely copy the existing [prometheus
source](https://github.com/timberio/vector/blob/61e806d01d4cc6d2a527b52aa9388d4547f1ebc2/src/sources/prometheus/mod.rs)
and modify it to parse the output of the httpd status page which looks like:
jszwedko marked this conversation as resolved.
Show resolved Hide resolved

```
localhost
ServerVersion: Apache/2.4.46 (Unix)
ServerMPM: event
Server Built: Aug 5 2020 23:20:17
CurrentTime: Friday, 21-Aug-2020 18:41:34 UTC
RestartTime: Friday, 21-Aug-2020 18:41:08 UTC
ParentServerConfigGeneration: 1
ParentServerMPMGeneration: 0
ServerUptimeSeconds: 26
ServerUptime: 26 seconds
Load1: 0.00
Load5: 0.03
Load15: 0.03
Total Accesses: 30
Total kBytes: 217
Total Duration: 11
CPUUser: .2
CPUSystem: .02
CPUChildrenUser: 0
CPUChildrenSystem: 0
CPULoad: .846154
Uptime: 26
ReqPerSec: 1.15385
BytesPerSec: 8546.46
BytesPerReq: 7406.93
DurationPerReq: .366667
BusyWorkers: 1
IdleWorkers: 74
Processes: 3
Stopping: 0
BusyWorkers: 1
IdleWorkers: 74
ConnsTotal: 1
ConnsAsyncWriting: 0
ConnsAsyncKeepAlive: 0
ConnsAsyncClosing: 0
Scoreboard: ________________________________________________________W__________________.....................................................................................................................................................................................................................................................................................................................................
```

I'll use this to generate the following metrics:

* `apache.uptime_seconds` (counter)
* `apache.total_accesses` (counter; extended)
* `apache.total_kilobytes` (counter; extended)
* `apache.total_duration` (counter; extended)
* `apache.cpu_user` (gauge; extended)
* `apache.cpu_system` (gauge; extended)
* `apache.cpu_children_user` (gauge; extended)
* `apache.cpu_children_system` (gauge; extended)
* `apache.cpu_load` (gauge; extended)
* `apache.requests_per_second` (gauge; extended)
* `apache.bytes_per_second` (gauge; extended)
* `apache.bytes_per_request` (gauge; extended)
* `apache.duration_per_request` (gauge; extended)
* `apache.busy_workers` (gauge)
* `apache.idle_workers` (gauge)
* `apache.processes` (gauge)
* `apache.stopping` (gauge)
* `apache.conns_total` (gauge)
* `apache.conns_async_writing` (gauge)
* `apache.conns_async_keepalive` (gauge)
* `apache.conns_async_closing` (gauge)
* `apache.scoreboard_waiting` (gauge)
* `apache.scoreboard_starting` (gauge)
* `apache.scoreboard_reading` (gauge)
* `apache.scoreboard_sending` (gauge)
* `apache.scoreboard_keepalive` (gauge)
* `apache.scoreboard_dnslookup` (gauge)
* `apache.scoreboard_closing` (gauge)
* `apache.scoreboard_logging` (gauge)
* `apache.scoreboard_finishing` (gauge)
* `apache.scoreboard_idle_cleanup` (gauge)
* `apache.scoreboard_open` (gauge)

Metrics labeled `extended` are only available if `ExtendedStatus` is enabled
for Apache. This is the default in newer versions (>= 2.4; released 2012), but
purportedly [increases CPU
load](https://www.datadoghq.com/blog/collect-apache-performance-metrics/#a-note-about-extendedstatus)
so some users may turn it off. If it is off, they simply won't have those
metrics published.

I figure we probably don't want metrics for:

* Load (should be handled by a `cpu` or similar metrics source)

## Doc-level Proposal

Users will be instructed to setup
[`mod_status`](https://httpd.apache.org/docs/2.4/mod/mod_status.html) and
enable
[`ExtendedStatus`](https://httpd.apache.org/docs/2.4/mod/core.html#extendedstatus).

The following additional source configuration will be added:

```toml
[sources.my_source_id]
type = "apache" # required
endpoint = "http://localhost:8080/server-status" # required
scrape_interval_secs = 15 # optional, default, seconds
jszwedko marked this conversation as resolved.
Show resolved Hide resolved
```

Some possible configuration improvements we could add in the future would be:

* `response_timeout`; to cap request lengths
* `tls`: settings to allow setting specific chains of trust and client certs

But I chose to leave those out for now given the prometheous source doesn't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the big missing piece here is Basic auth - a lot of people still use this as security for mod_status. Prom just adds it to the scrape URL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think we could support this in the same way (via the URL), but that it'd be useful to add basic auth options for all HTTP-based client sources. I propose deferring that until after we refactor them as mentioned in "Future Work".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - missed that. +1.

support them either. We could add support to both at the same time (see Future
Work section below).

[Datadog's
plugin](https://github.com/DataDog/integrations-core/blob/master/apache/datadog_checks/apache/data/conf.yaml.example)
has numerous more options we could also consider in the future.

## Rationale

Apache HTTP Server is a fairly common webserver. If we do not support ingesting
metrics from it, it is likely to push people to use another tool to forward
metrics from httpd to the desired sink.

As part of Vector's vision to be the "one tool" for ingesting and shipping
observability data, it makes sense to add as many sources as possible to reduce
the likelihood that a user will not be able to ingest metrics from their tools.

## Prior Art

- [Datadog collection](https://www.datadoghq.com/blog/monitor-apache-web-server-datadog/#set-up-datadogs-apache-integration)
- [Telegraf](https://github.com/influxdata/telegraf/tree/release-1.15/plugins/inputs/apache)

## Drawbacks

- Additional maintenance and integration testing burden of a new source

## Alternatives

### Having users run telegraf and using Vector's prometheus source to scrape it

We could not add the source directly to Vector and instead instruct users to run
Telegraf and point Vector at the exposed Prometheus scrape endpoint. This would
leverage the already supported [telegraf Apache input
plugin](https://github.com/influxdata/telegraf/tree/release-1.15/plugins/inputs/apache)

I decided against this as it would be in contrast with one of the listed
principles of Vector:

> One Tool. All Data. - One simple tool gets your logs, metrics, and traces
> (coming soon) from A to B.

[Vector
principles](https://vector.dev/docs/about/what-is-vector/#who-should-use-vector)

On the same page, it is mentioned that Vector should be a replacement for
Telegraf.

> You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or
> similar tools.

If users are already running telegraf though, they could opt for this path.

### Have a generic HTTP scrape source

We could model this as a generic `http_scrape` source that has a `type` (or
`codec`?) that would determine how what type of endpoint it is scraping.

I would err away from this as I think there is some risk that some "HTTP
scrape" endpoints will need source specific configuration. We could do this
later if this does not end up being the case and they really are all the same.

One downside of this is that I think it'd be less discoverable than a
first-class source for each type of endpoint we support scraping.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with this.

I can also see us (at some point) having "aliased" components, that merely defer to some other generic component with a default set of configurations.

In this case, we'd have a http_scrape source that is highly configurable, and a apache alias source, that makes it more discoverable and easier to configure but defers its actual operations to the http_scrape source.

Having said that; all of this is irrelevant for this initial source implementation, which I like 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is actually what you described in "Future Work", isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The future work is just to do this internally, but I do like your idea of eventually adding an http_scrape source that would be more flexible and allow people to scrape arbitrary endpoints. I imagine it might require more configuration which the "aliases" would simply default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also see us (at some point) having "aliased" components, that merely defer to some other generic component with a default set of configurations.

We do this currently. Sources and sinks can wrap other sources and sinks. For example, the syslog source wraps the socket source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this precedence, do we want to create a generic http_scrape source and the apache source based on this RFC, or do we want to start with the latter, and transition it to the former as soon as we need another source based on HTTP scraping?

I assume we'll just start with the singular apache source, but want to make sure we're on the same page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to defer the http_scrape source as I think it'll require more thought and discussion about how to parse the incoming requests (formats? codecs? etc.).

I guess I figured I'd start with this one and the nginx one (#3091) for now and refactor them internally to share a component (along with the prometheus source).


## Outstanding Questions

- Do we want to apply any metric labels based on the other information
available via the status page? I could see labeling the `url` at least
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I assume #2660 would solve this for both logs and metrics, so it's probably blocked on that. Agree?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think that is just for internal metrics, yes? It looks like the prometheus source adds labels (looks like we call them tags):

https://github.com/timberio/vector/blob/61e806d01d4cc6d2a527b52aa9388d4547f1ebc2/src/sources/prometheus/parser.rs#L43-L51

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I do think many of these event attributes would be better represented as trace context, but we can defer that for a later discussion.

- Do we want to have one apache source able to scrape multiple endpoints?
jszwedko marked this conversation as resolved.
Show resolved Hide resolved
- Are there preferences between `apache` or `httpd` for the nomenclature? I
feel like `apache` is more well-known though `httpd` is more accurate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer httpd or apache_httpd? Apache feels a little generic since I think of the entire foundation. If we go with the apache_ prefix we should think about apache_kafka and so on :/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had the same angst about using apache, but it does match how the other two collectors I looked at, datadog and telegraf, do it so people may be more familiar with that keyword as the component name if they are coming from those tools.

I'm open to either of those options though. I guess I prefer httpd over apache_httpd given that kafka is just kafka and not apache_kafka.


## Plan Of Attack

Incremental steps that execute this change. Generally this is in the form of:

- [ ] Sumbit a PR with the initial sink implementation

## Future Work

### Refactor HTTP-scraping-based sources

I think one thing that would make sense would be to refactor the sources based
on HTTP scraping to share a base similar to how our sinks that rely on `http`
are factored (`splunk_hec`, `http`, `loki`, etc.). This allows them to share
common configuration options for their behavior.

My recommendation is to implement this and the
[`nginx`](https://github.com/timberio/vector/issues/3091) metrics source and
then figure out where the seams our to pull out an `HttpScrapeSource` module
that could be used by this source, the `nginix` source, and the `prometheus`
source.
jszwedko marked this conversation as resolved.
Show resolved Hide resolved