Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(rfcs): Add RFC for Apache HTTP Server metrics source #3519

Merged
merged 13 commits into from
Aug 28, 2020
225 changes: 225 additions & 0 deletions rfcs/2020-08-21-3092-apache-metrics-source.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# RFC 3092 - 2020-08-21 - Apache HTTP Server metrics source

This RFC is to introduce a new metrics source to consume metrics from the
[Apache HTTP Server](https://httpd.apache.org/) (httpd). The high level plan is
to implement a scrapper similar to the existing [prometheus
source](https://vector.dev/docs/reference/sources/prometheus/) that will scrape
the Apache HTTP Server stats endpoint (provided by
[`mod_status`](https://httpd.apache.org/docs/2.4/mod/mod_status.html)) on an
interval and publish metrics to the defined pipeline.

## Scope

This RFC will cover:

- A new source for Apache metrics

This RFC will not cover:

- Generating metrics from Apache logs

## Motivation

Users running httpd want to collect, transform, and forward metrics to better
observe how their webservers are performing.

## Internal Proposal

I expect to largely copy the existing [prometheus
source](https://github.com/timberio/vector/blob/61e806d01d4cc6d2a527b52aa9388d4547f1ebc2/src/sources/prometheus/mod.rs)
and modify it to parse the output of the httpd status page which looks like:
jszwedko marked this conversation as resolved.
Show resolved Hide resolved

```text
localhost
ServerVersion: Apache/2.4.46 (Unix)
ServerMPM: event
Server Built: Aug 5 2020 23:20:17
CurrentTime: Friday, 21-Aug-2020 18:41:34 UTC
RestartTime: Friday, 21-Aug-2020 18:41:08 UTC
ParentServerConfigGeneration: 1
ParentServerMPMGeneration: 0
ServerUptimeSeconds: 26
ServerUptime: 26 seconds
Load1: 0.00
Load5: 0.03
Load15: 0.03
Total Accesses: 30
Total kBytes: 217
Total Duration: 11
CPUUser: .2
CPUSystem: .02
CPUChildrenUser: 0
CPUChildrenSystem: 0
CPULoad: .846154
Uptime: 26
ReqPerSec: 1.15385
BytesPerSec: 8546.46
BytesPerReq: 7406.93
DurationPerReq: .366667
BusyWorkers: 1
IdleWorkers: 74
Processes: 3
Stopping: 0
BusyWorkers: 1
IdleWorkers: 74
ConnsTotal: 1
ConnsAsyncWriting: 0
ConnsAsyncKeepAlive: 0
ConnsAsyncClosing: 0
Scoreboard: ________________________________________________________W__________________.....................................................................................................................................................................................................................................................................................................................................
```

I'll use this to generate the following metrics:

- `apache_up` (gauge)
- `apache_uptime_seconds_total` (counter)
- `apache_accesses_total` (counter; extended)
- `apache_sent_kilobytes_total` (counter; extended)
- `apache_duration_seconds_total` (counter; extended)
- `apache_cpu_seconds_total{type=(system|user|cpu_children_user|cpu_children_system)}` (gauge; extended)
- `apache_cpu_load` (gauge; extended)
- `apache_workers{state=(busy|idle)}` (gauge)
- `apache_connections{state=(closing|keepalive|writing|total)}` (gauge)
- `apache_scoreboard_waiting{state=(waiting|starting|reading|sending|keepalive|dnslookup|closing|logging|finishing|idle_cleanup|open}` (gauge)

Metrics labeled `extended` are only available if `ExtendedStatus` is enabled
for Apache. This is the default in newer versions (>= 2.4; released 2012), but
purportedly [increases CPU
load](https://www.datadoghq.com/blog/collect-apache-performance-metrics/#a-note-about-extendedstatus)
so some users may turn it off. If it is off, they simply won't have those
metrics published.

I figure we probably don't want metrics for:

- System Load (should be handled by a `cpu` or similar metrics source)

Metrics will be labeled with:

- `endpoint` the full endpoint (sans any basic auth credentials)
- `host` the hostname and port portions of the endpoint

## Doc-level Proposal

Users will be instructed to setup
[`mod_status`](https://httpd.apache.org/docs/2.4/mod/mod_status.html) and
enable
[`ExtendedStatus`](https://httpd.apache.org/docs/2.4/mod/core.html#extendedstatus).

The following additional source configuration will be added:

```toml
[sources.my_source_id]
type = "apache_metrics" # required
endpoints = ["http://localhost/server-status?auto"] # required, default
scrape_interval_secs = 15 # optional, default, seconds
jszwedko marked this conversation as resolved.
Show resolved Hide resolved
namespace = "apache" # optional, default, namespace to put metrics under
```

Some possible configuration improvements we could add in the future would be:

- `response_timeout`; to cap request lengths
- `tls`: settings to allow setting specific chains of trust and client certs
- `basic_auth`: to set username/password for use with HTTP basic auth; we'll
allow this to be set in the URL too which will work for now

But I chose to leave those out for now given the prometheous source doesn't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the big missing piece here is Basic auth - a lot of people still use this as security for mod_status. Prom just adds it to the scrape URL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think we could support this in the same way (via the URL), but that it'd be useful to add basic auth options for all HTTP-based client sources. I propose deferring that until after we refactor them as mentioned in "Future Work".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - missed that. +1.

support them either. We could add support to both at the same time (see Future
Work section below).

[Datadog's
plugin](https://github.com/DataDog/integrations-core/blob/master/apache/datadog_checks/apache/data/conf.yaml.example)
has numerous more options we could also consider in the future.

The `host` key will be set to the host parsed out of the `endpoint`.

## Rationale

Apache HTTP Server is a fairly common webserver. If we do not support ingesting
metrics from it, it is likely to push people to use another tool to forward
metrics from httpd to the desired sink.

As part of Vector's vision to be the "one tool" for ingesting and shipping
observability data, it makes sense to add as many sources as possible to reduce
the likelihood that a user will not be able to ingest metrics from their tools.

## Prior Art

- [Datadog collection](https://www.datadoghq.com/blog/monitor-apache-web-server-datadog/#set-up-datadogs-apache-integration)
- [Telegraf](https://github.com/influxdata/telegraf/tree/release-1.15/plugins/inputs/apache)

## Drawbacks

- Additional maintenance and integration testing burden of a new source

## Alternatives

### Having users run telegraf and using Vector's prometheus source to scrape it

We could not add the source directly to Vector and instead instruct users to run
Telegraf and point Vector at the exposed Prometheus scrape endpoint. This would
leverage the already supported [telegraf Apache input
plugin](https://github.com/influxdata/telegraf/tree/release-1.15/plugins/inputs/apache)

I decided against this as it would be in contrast with one of the listed
principles of Vector:

> One Tool. All Data. - One simple tool gets your logs, metrics, and traces
> (coming soon) from A to B.

[Vector
principles](https://vector.dev/docs/about/what-is-vector/#who-should-use-vector)

On the same page, it is mentioned that Vector should be a replacement for
Telegraf.

> You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or
> similar tools.

If users are already running telegraf though, they could opt for this path.

### Have a generic HTTP scrape source

We could model this as a generic `http_scrape` source that has a `type` (or
`codec`?) that would determine how what type of endpoint it is scraping.

I would err away from this as I think there is some risk that some "HTTP
scrape" endpoints will need source specific configuration. We could do this
later if this does not end up being the case and they really are all the same.

One downside of this is that I think it'd be less discoverable than a
first-class source for each type of endpoint we support scraping.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with this.

I can also see us (at some point) having "aliased" components, that merely defer to some other generic component with a default set of configurations.

In this case, we'd have a http_scrape source that is highly configurable, and a apache alias source, that makes it more discoverable and easier to configure but defers its actual operations to the http_scrape source.

Having said that; all of this is irrelevant for this initial source implementation, which I like 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is actually what you described in "Future Work", isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The future work is just to do this internally, but I do like your idea of eventually adding an http_scrape source that would be more flexible and allow people to scrape arbitrary endpoints. I imagine it might require more configuration which the "aliases" would simply default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also see us (at some point) having "aliased" components, that merely defer to some other generic component with a default set of configurations.

We do this currently. Sources and sinks can wrap other sources and sinks. For example, the syslog source wraps the socket source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this precedence, do we want to create a generic http_scrape source and the apache source based on this RFC, or do we want to start with the latter, and transition it to the former as soon as we need another source based on HTTP scraping?

I assume we'll just start with the singular apache source, but want to make sure we're on the same page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to defer the http_scrape source as I think it'll require more thought and discussion about how to parse the incoming requests (formats? codecs? etc.).

I guess I figured I'd start with this one and the nginx one (#3091) for now and refactor them internally to share a component (along with the prometheus source).


## Outstanding Questions

- Do we want to apply any metric labels based on the other information
available via the status page? I could see labeling the `url` at least.
Ansnswer: label with `host` and `endpoint` as described above.
- Do we want to have one apache_metrics source able to scrape multiple
endpoints? Answer: yes, the config has been updated to allow multiple
endpoints.
- Are there preferences between `apache` or `httpd` for the nomenclature? I
feel like `apache` is more well-known though `httpd` is more accurate. Answer:
standardize on `apache`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting, we'll want apache_metrics for clarity.

Copy link
Member Author

@jszwedko jszwedko Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that calling it apache_metrics rather than just apache, while making it clearer that the source ingests metrics, also makes it stand out against the other sources which are not suffixed with their data type. For example, we have statsd rather than statsd_metrics and docker rather than docker_logs. One could argue that statsd is less ambiguously metrics, but docker, in my mind, could be ambiguous: it could be either docker metrics or docker logs (or both).

This is actually standing out to me as something where the precedent will stick around for a while so it's probably worth having some explicit guidance for source naming. Sources could be one of, more than one of, or all of metrics, logs, and traces. It seems like the vision of vector is to treat all observability data (logs, metrics, and traces) as first-class and so it seems like we should reflect that in the component naming and terminology.

I actually kinda like Datadog's model where they just have one apache integration that ingests both metrics and logs from Apache. They configure it via a separate log section: https://github.com/DataDog/integrations-core/blob/master/apache/datadog_checks/apache/data/conf.yaml.example#L326-L347 . Given, Datadag started with metrics and moved to logs, which I think is reflected in the fact that logs is a separate subsection rather than metrics being so. For us, we might want to have metrics/log/trace configuration at the same level. For users forwarding to datadog, they may want to have similar functionality as what dd-agent currently supports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points.

or example, we have statsd rather than statsd_metrics and docker rather than docker_logs

Then we should rename these sources to statsd_metrics and docker_logs or change our model.

Sources could be one of, more than one of, or all of metrics logs and traces. It seems like the vision of vector is to treat all observability data (logs, metrics, and traces) as first-class and so it seems like we should reflect that in the component naming and terminology.

It's a good point. From a UX perspective, drawing lines around components based on types makes it easier to build pipelines. For example, what happens if you connected a multi-type docker source to regex_parser that only operates on logs? Setting aside the fact that we can't technically do this, I think it complicates the UX. I am very happy to discuss this change in a separate issue, and then an RFC if we deem it worthy. Would also like other people's thoughts on the matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been going back and forth on this all afternoon. On one hand, having a single source name, docker/apache/etc, references that single point of observability. But I also started thinking "are there circumstances in which I'd only collect metrics from a service and I'd find apache on its own confusing?" Also, I wonder whether having differently named sources might make pipeline configurations more clear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wonder whether having differently named sources might make pipeline configurations more clear?

This is the main concern for me. Combined sources may seem convenient, but how often do you want to route logs and metrics through the same pipeline?

Copy link
Member Author

@jszwedko jszwedko Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main concern for me. Combined sources may seem convenient, but how often do you want to route logs and metrics through the same pipeline?

It is a good question. I note that we do have a one source (vector), a few sinks (vector, console, blackhole), and a few transforms (filter, lua remap (soon) that can handle both logs and metrics, but I can imagine it would be common to need to transform them differently.

I realize this is going off on quite a tangent; we could break this discussion off to a separate issue. I think it is important though.

I currently see 3 approaches:

  1. Call this apache_metrics and move towards renaming all sources / sinks that only handle a single type to have a suffix (like sematext_logs). If/when we add trace as another type, we'd add things like sematex_traces.
  2. Move towards one component for each "integration type' for sinks/sources and do what I describe below to control the flow of event types through the pipeline when needed.
  3. Keep doing what we are doing which seems like a mix of strategies with respect to a type handling; defining some sort of "test" to determine when the type suffix should be applied.

I find the current state to be a little confusing. Some sources/sinks integrations can handle multiple metric types (vector, console) and other ones are split into _metrics and _logs (like datadog_logs/datadog_metrics). The clearest rule I can pull out for deciding is "can the integration handle pulling or pushing both logs and metrics with the same configuration? If yes, no suffix, if no, two components with _logs and _metrics". On issue with that rule is that downstream support for observaliblity types can change, making it difficult to predict if they will add metrics or logs support in the future.

I'd personally would appreciate some guidance laid out for component naming to avoid this question cropping up in the future.

As a thought experiment, maybe it is useful to imagining we add metrics support for clickhouse (#3435). Where would that go? A new component (clickhouse_metrics)? If so, would we rename clickhouse to clickhouse_logs? Or would we add it as part of the existing clickhouse sink?

Expanding on approach 2:

One way we could approach that with a single source, is allow transforms and sinks to ask for only certain datatypes with something like:

[transforms.regex]
type = "regex_parser"
inputs = ["mysource.logs"]

Where it would only receive logs from the source and not metrics (or traces). If there was no suffix, it would simply receive both. In the case of a transform that only works with a subset of "observability types" we could error at start-up if there is an invalid pipeline.

Another way to limit the event types if you only want certain types from the source would be something like:

[sources.apache]
type = "apache"
event_types = ["log", "metric"]

Perhaps with syntax sugar for one type:

[sources.apache]
type = "apache.logs"

I think there is something attractive about one component to observe one "integration" (like apache, nginx, etc.).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clearest rule I can pull out for deciding is "can the integration handle pulling or pushing both logs and metrics with the same configuration?

I agree we should have a stronger convention here and this is a good start. Right now, we handle multi-type sources by automatically inserting a type filter where necessary in front of downstream components. We should make sure that doesn't lead to confusing scenarios if we start using it more widely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we handle multi-type sources by automatically inserting a type filter where necessary in front of downstream components.

Interesting, I wasn't aware of that. I could see that be a bit confusing if it is implicit.

As for the convention, I'm ok with starting with that one, but it does mean we'll need to deprecate sources/sinks occasionally. I'm imagining, had vector existed a few years ago, before Datadog had support for logs, we may have just called the sink datadog and then had to deprecated it later in-lieu of datadog_metrics when datadog_logs was added.

I'm ok with just calling this apache_metrics for now since, prompted by this discussion, I've observed we already have a mix of suffixed and unsuffixed sinks and sources so this doesn't seem to make the situation noticeably worse and my concern about precedent is unfounded. I can create a new issue to come up with a convention for component naming. I'm still curious to hear thoughts on separate vs. integrated sources/sinks as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create a new issue to come up with a convention for component naming. I'm still curious to hear thoughts on separate vs. integrated sources/sinks as well.

👍 it's worth a discussion. I'm not leaning strongly in either direction, but as you can see there are scenarios we should discuss before making a decision. I like the simplicity of single, multi-type components from a documentation standpoint, but I want to make sure it doesn't over complicate the actual UX.

- Should the `host` key include the port from the `endpoint` , if any? Or just
the hostname. Answer: include the port.

## Plan Of Attack

Incremental steps that execute this change. Generally this is in the form of:

- [ ] Sumbit a PR with the initial sink implementation

## Future Work

### Refactor HTTP-scraping-based sources

I think one thing that would make sense would be to refactor the sources based
on HTTP scraping to share a base similar to how our sinks that rely on `http`
are factored (`splunk_hec`, `http`, `loki`, etc.). This allows them to share
common configuration options for their behavior.

My recommendation is to implement this and the
[`nginx`](https://github.com/timberio/vector/issues/3091) metrics source and
then figure out where the seams our to pull out an `HttpScrapeSource` module
that could be used by this source, the `nginix` source, and the `prometheus`
source.
jszwedko marked this conversation as resolved.
Show resolved Hide resolved