Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Draft host-metrics RFC #3581

Merged
merged 14 commits into from
Aug 29, 2020
169 changes: 169 additions & 0 deletions rfcs/2020-08-26-3191-host-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# RFC 3191 - 2020-08-26 - Collecting host-based metrics

This RFC is to introduce a new metrics source to consume host-based metrics. The high level plan is to implement one (or more?) sources that collect CPU, disk, and memory metrics from Linux-based hosts.

## Scope

This RFC will cover:

- A new source for host-based metrics, specifically:
- CPU
- Memory
- Disk
- Collection on Linux, Windows, OSX.

This RFC will not cover:

- Other host metrics.
- Other platforms

## Motivation

Users want to collect, transform, and forward metrics to better observe how their hosts are performing.

## Internal Proposal

Build a single source called `host_metrics` (name to be confirmed) to collect host/system level metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍 on host_metrics unless others disagree. An alternative could be node_metrics, but that seems less precise to me. I'm curious why Prometheus adopted this nomenclature.


I've found a number of possible Rust-based solutions for implementing this collection, cross-platform.

- https://crates.io/crates/heim (Rust)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heim looks very promising and would likely be a good starting point.

- https://lib.rs/crates/sys-info (Rust + C)
- https://github.com/myfreeweb/systemstat (pure Rust)
- https://docs.rs/sysinfo/0.3.19/sysinfo/index.html

The Heim crate has a useful comparison doc showing the available tools and their capabilities.

- https://github.com/heim-rs/heim/blob/master/COMPARISON.md

For this implementation we're recommending the Heim crate based on platform and feature coverage.

We'd use one of these to collect the following metrics:

- `cpu_seconds_total` by mode (idle, nice, system, user) per CPU. (counter)
jszwedko marked this conversation as resolved.
Show resolved Hide resolved
- `disk_read_bytes_total` by disk (counter)
- `disk_read_errors_total` by disk (counter)
- `disk_read_retries_total` by disk (counter)
- `disk_read_sectors_total` by disk (counter)
- `disk_read_time_seconds_total` by disk (counter)
- `disk_reads_completed_total` by disk (counter)
- `disk_write_errors_total` by disk (counter)
- `disk_write_retries_total` by disk (counter)
- `disk_write_time_seconds_total` by disk (counter)
- `disk_writes_completed_total` by disk (counter)
- `disk_written_bytes_total` by disk (counter)
- `disk_written_sectors_total` by disk (counter)
- `filesystem_avail_bytes` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_device_error` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_total_file_nodes` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_free_file_nodes` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_free_bytes` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_size_bytes` labeled with device, filesystem type, and mountpoint (gauge)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the distinction between avail and free or size? Also, there seems to be an inconsistency between avail/free/size for bytes, but free/total for file nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've drawn these from Prometheus' approach specifically.

  • filesystem_avail_bytes = Filesystem space available to non-root users in bytes.
  • filesystem_free_bytes = Filesystem free space in bytes.
  • filesystem_size_bytes = Filesystem size in bytes.

See https://www.robustperception.io/filesystem-metrics-from-the-node-exporter. I'll add an explainer to the RFC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment about the inconsistency was to suggest, for example, we could use avail/free/total for both:

filesystem_avail_bytes
filesystem_free_bytes
filesystem_total_bytes
filesystem_avail_file_nodes
filesystem_free_file_nodes
filesystem_total_file_nodes

Unless, of course, there is a preference for sticking to Prometheus' terminology.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting question - I replicated Prometheus' naming as a lot of folks are familiar with that. Also if folks have dashboards/alerts/etc that calculate things, like free space on a filesystem, then they don't need to rewrite them with new metric names.

Copy link
Member

@jszwedko jszwedko Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think the naming you (and prometheus) have feels natural to me. I'm used to referring to filesystems as having a size, probably because of my association with a physical disk and the fact that files themselves are referred to as having "sizes" (and not "total bytes"). Inodes, on the other hand, feels more like a "discrete, countable" thing so total makes sense to me for the cap.

I can see an argument for consistency though.

- `load1` (gauge)
- `load5` (gauge)
- `load15` (gauge)
- `memory_active_bytes` (gauge)
- `memory_compressed_bytes` (gauge)
- `memory_free_bytes` (gauge)
- `memory_inactive_bytes` (gauge)
- `memory_swap_total_bytes` (gauge)
- `memory_swap_used_bytes` (gauge)
- `memory_swapped_in_bytes_total` (gauge)
- `memory_swapped_out_bytes_total` (gauge)
- `memory_total_bytes` (gauge)
- `memory_wired_bytes` (gauge)

Users should be able to limit the collection of metrics to specific classes, here: `cpu`, `memory`, `disk/filesystem`, `load`.

Metrics will also be labeled with:

- `host`: the host name of the host being monitored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll probably need some more labels for some of the metrics to be useful:

  • device for disk based metrics
  • cpu for CPU based metrics
  • filesystem for filesystem based metrics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm realizing I was unclear. I think you've addressed this above though with the update to the metrics list. I had meant, for example, that disk based metrics should be labeled (or tagged; not sure which term we prefer) with the device the metric is associated with (e.g. device=/dev/sda).

I think I like these "collector" labels too though. I think they'd be better as simply collector so you'd have things like collector=disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Yeah - this is a terminology mismatch. I'll update to say tagged and make it clearer.

- `disk` for disk based metrics
- `cpu` for CPU based metrics
- `filesystem` for filesystem based metrics
- `memory` for memory based metrics.
- `load` for load based metrics.

## Doc-level Proposal

The following additional source configuration will be added:

```toml
[sources.my_source_id]
type = "host_metrics" # required
collecting = [ "all"] # optional, defaults collecting all metrics.
filesystem.mountpoints = [ "all" ] # optional, defaults to all mountpoints.
disk.devices = [ "all" ] # optional, defaults to all to disk devices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend against magic words within the list to represent collecting all. We could:

  1. not allow wildcard configuration like that, just say the default is all metrics;
  2. use an actual wildcard to represent all, like "*", which may be useful to allow for glob like "filesystem_read*"; or
  3. use a custom serializer to allow either "ALL" outside of the list for all, or a list to enumerite items (ie collecting = "ALL" vs collecting = [ "a", "b" ])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the globbing. +1.

scrape_interval_secs = 15 # optional, default, seconds
namespace = "host" # optional, default is "host", namespace to put metrics under
```

For `collecting` (name open to discussion) we'd specify an array of metric collections:

```toml
collecting = [ "cpu", "memory" ]
```

- We'd also add a guide for doing this on Docker (similar to https://github.com/prometheus/node_exporter#using-docker).

## Rationale

CPU, Memory, and Disk are the basic building blocks of host-based monitoring. They are considered "table stakes" for most metrics-based monitoring of hosts. Additionally, if we do not support ingesting metrics from them, it is likely to push people to use another tool

As part of Vector's vision to be the "one tool" for ingesting and shipping
observability data, it makes sense to add as many sources as possible to reduce
the likelihood that a user will not be able to ingest metrics from their tools.

## Prior Art
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved

- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mem
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/disk
- https://github.com/elastic/beats/tree/master/metricbeat
- https://github.com/prometheus/node_exporter
- https://docs.fluentbit.io/manual/pipeline/inputs/cpu-metrics
- https://docs.fluentbit.io/manual/pipeline/inputs/disk-io-metrics
- https://docs.fluentbit.io/manual/pipeline/inputs/memory-metrics

## Drawbacks

- Additional maintenance and integration testing burden of a new source

## Alternatives

### Having users run telegraf or Prom node exporter and using Vector's prometheus source to scrape it

We could not add the source directly to Vector and instead instruct users to run
Telegraf or Prometheus' node exporter and point Vector at the exposed Prometheus scrape endpoint. This would leverage the already supported inputs from those projects.

I decided against this as it would be in contrast with one of the listed
principles of Vector:

> One Tool. All Data. - One simple tool gets your logs, metrics, and traces
> (coming soon) from A to B.

[Vector
principles](https://vector.dev/docs/about/what-is-vector/#who-should-use-vector)

On the same page, it is mentioned that Vector should be a replacement for
Telegraf.

> You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or
> similar tools.

If users are already running Telegraf or Node Exporter though, they could opt for this path.

## Outstanding Questions
jamtur01 marked this conversation as resolved.
Show resolved Hide resolved

- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the big one.

I personally like your approach of a single source though. I think we could allow for fine grained control over the metric "families" via the table-based TOML configuration; like:

[sources.my_source_id]
  type = "host_metrics"

  disk.devices = ["/dev/sda"]

  filesystem.mountpoints = ["/home"]

To exclude trying to collect all of them (though you could also do this in a filter transform).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of user experience, I think a single source is better. It will end up being a large in terms of source code, but we can manage that with sub-modules internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree on having a single source


## Plan Of Attack

Incremental steps that execute this change. Generally this is in the form of:

- [ ] Submit a PR with the initial source implementation

## Future Work

- Extend source to collect additional system-level metrics.
- Identify additional potential platforms.