-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Draft host-metrics RFC #3581
Changes from 13 commits
a7d0676
fdf155f
6ee375a
857bab3
0ff8da9
7736915
f6009af
7468c21
0a7a4c3
e48aff8
f2fdef6
7e8fab7
d43d9fc
abca508
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
# RFC 3191 - 2020-08-26 - Collecting host-based metrics | ||
|
||
This RFC is to introduce a new metrics source to consume host-based metrics. The high level plan is to implement one (or more?) sources that collect CPU, disk, network, and memory metrics from Linux-based hosts. | ||
|
||
## Scope | ||
|
||
This RFC will cover: | ||
|
||
- A new source for host-based metrics, specifically: | ||
- CPU | ||
- Memory | ||
- Disk | ||
- Network | ||
- Collection on Linux, Windows, OSX. | ||
|
||
This RFC will not cover: | ||
|
||
- Other host metrics. | ||
- Other platforms | ||
|
||
## Motivation | ||
|
||
Users want to collect, transform, and forward metrics to better observe how their hosts are performing. | ||
|
||
## Internal Proposal | ||
|
||
Build a single source called `host_metrics` (name to be confirmed) to collect host/system level metrics. | ||
|
||
I've found a number of possible Rust-based solutions for implementing this collection, cross-platform. | ||
|
||
- https://crates.io/crates/heim (Rust) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
- https://lib.rs/crates/sys-info (Rust + C) | ||
- https://github.com/myfreeweb/systemstat (pure Rust) | ||
- https://docs.rs/sysinfo/0.3.19/sysinfo/index.html | ||
|
||
The Heim crate has a useful comparison doc showing the available tools and their capabilities. | ||
|
||
- https://github.com/heim-rs/heim/blob/master/COMPARISON.md | ||
|
||
For this implementation we're recommending the Heim crate based on platform and feature coverage. | ||
|
||
We'd use one of these to collect the following metrics: | ||
|
||
- `cpu_seconds_total` tagged with mode (idle, nice, system, user) and CPU. (counter) | ||
- `disk_read_bytes_total` tagged with the disk (counter) | ||
- `disk_read_errors_total` tagged with the disk (counter) | ||
- `disk_read_retries_total` tagged with the disk (counter) | ||
- `disk_read_sectors_total` tagged with the disk (counter) | ||
- `disk_read_time_seconds_total` tagged with the disk (counter) | ||
- `disk_reads_completed_total` tagged with the disk (counter) | ||
- `disk_write_errors_total` tagged with the disk (counter) | ||
- `disk_write_retries_total` tagged with the disk (counter) | ||
- `disk_write_time_seconds_total` tagged with the disk (counter) | ||
- `disk_writes_completed_total` tagged with the disk (counter) | ||
- `disk_written_bytes_total` tagged with the disk (counter) | ||
- `disk_written_sectors_total` tagged with the disk (counter) | ||
- `filesystem_avail_bytes` tagged with the device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_device_error` tagged with the device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_free_bytes` tagged with the device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_size_bytes` tagged with the device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_total_file_nodes` tagged with the device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_free_file_nodes` tagged with the device, filesystem type, and mountpoint (gauge) | ||
- `load1` (gauge) | ||
- `load5` (gauge) | ||
- `load15` (gauge) | ||
- `memory_active_bytes` (gauge) | ||
- `memory_compressed_bytes` (gauge) | ||
- `memory_free_bytes` (gauge) | ||
- `memory_inactive_bytes` (gauge) | ||
- `memory_swap_total_bytes` (gauge) | ||
- `memory_swap_used_bytes` (gauge) | ||
- `memory_swapped_in_bytes_total` (gauge) | ||
- `memory_swapped_out_bytes_total` (gauge) | ||
- `memory_total_bytes` (gauge) | ||
- `memory_wired_bytes` (gauge) | ||
- `network_receive_bytes_total` tagged with device (counter) | ||
- `network_receive_errs_total` tagged with device (counter) | ||
- `network_receive_multicast_total` tagged with device (counter) | ||
- `network_receive_packets_total` tagged with device (counter) | ||
- `network_transmit_bytes_total` tagged with device (counter) | ||
- `network_transmit_errs_total` tagged with device (counter) | ||
- `network_transmit_multicast_total` tagged with device (counter) | ||
- `network_transmit_packets_total` tagged with device (counter) | ||
|
||
Users should be able to limit the collection of metrics to specific classes, here: `cpu`, `memory`, `disk`, `filesystem`, `load`, and `network`. | ||
|
||
Metrics will also be tagged with: | ||
|
||
- `host`: the host name of the host being monitored. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we'll probably need some more labels for some of the metrics to be useful:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I'm realizing I was unclear. I think you've addressed this above though with the update to the metrics list. I had meant, for example, that disk based metrics should be labeled (or tagged; not sure which term we prefer) with the device the metric is associated with (e.g. I think I like these "collector" labels too though. I think they'd be better as simply There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah. Yeah - this is a terminology mismatch. I'll update to say tagged and make it clearer. |
||
|
||
And `collector` for type of metric: | ||
|
||
- `disk` for disk based metrics | ||
- `cpu` for CPU based metrics | ||
- `filesystem` for filesystem based metrics | ||
- `memory` for memory based metrics. | ||
- `load` for load based metrics. | ||
- `network` for network based metrics. | ||
|
||
Specific explanation of some of the filesystem metrics: | ||
|
||
- `filesystem_avail_bytes` = Filesystem space available to non-root users in bytes (including reserved blocks). | ||
- `filesystem_free_bytes` = Filesystem free space in bytes (excluding reserved blocks). | ||
- `filesystem_size_bytes` = Filesystem size in bytes. | ||
|
||
|
||
## Doc-level Proposal | ||
|
||
The following additional source configuration will be added: | ||
|
||
```toml | ||
[sources.my_source_id] | ||
type = "host_metrics" # required | ||
collectors = [ "all"] # optional, defaults collecting all metrics. | ||
filesystem.mountpoints = [ "*" ] # optional, defaults to all mountpoints. | ||
disk.devices = [ "*" ] # optional, defaults to all to disk devices. | ||
network.devices = [ "*" ] # optional, defaults to all to network devices. | ||
scrape_interval_secs = 15 # optional, default, seconds | ||
namespace = "host" # optional, default is "host", namespace to put metrics under | ||
``` | ||
|
||
For `collector` (name open to discussion) we'd specify an array of metric collectors: | ||
|
||
```toml | ||
collectors = [ "cpu", "memory", "network" ] | ||
``` | ||
|
||
For disk and network devices or filesystem mountpoints the default is to collect for all ("*") devices and mountpoints. Or you can configure Vector to only collect from specific devices, for example: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm wondering if will be useful to also allow users to specify a denylist rather than allowlist, in the future. Maybe we could model this like |
||
|
||
```toml | ||
network.devices = [ "eth0" ] | ||
``` | ||
|
||
Or, if we think its feasible, using globbing like so: | ||
|
||
```toml | ||
network.devices = [ "eth*" ] | ||
``` | ||
|
||
- We'd also add a guide for doing this on Docker (similar to https://github.com/prometheus/node_exporter#using-docker). | ||
|
||
## Rationale | ||
|
||
CPU, Memory, Disk, and Network are the basic building blocks of host-based monitoring. They are considered "table stakes" for most metrics-based monitoring of hosts. Additionally, if we do not support ingesting metrics from them, it is likely to push people to use another tool | ||
|
||
As part of Vector's vision to be the "one tool" for ingesting and shipping | ||
observability data, it makes sense to add as many sources as possible to reduce | ||
the likelihood that a user will not be able to ingest metrics from their tools. | ||
|
||
## Prior Art | ||
jamtur01 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu | ||
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mem | ||
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/disk | ||
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/net | ||
- https://github.com/elastic/beats/tree/master/metricbeat | ||
- https://github.com/prometheus/node_exporter | ||
- https://docs.fluentbit.io/manual/pipeline/inputs/cpu-metrics | ||
- https://docs.fluentbit.io/manual/pipeline/inputs/disk-io-metrics | ||
- https://docs.fluentbit.io/manual/pipeline/inputs/memory-metrics | ||
- https://docs.fluentbit.io/manual/pipeline/inputs/network-io-metrics | ||
|
||
## Drawbacks | ||
|
||
- Additional maintenance and integration testing burden of a new source | ||
|
||
## Alternatives | ||
|
||
### Having users run telegraf or Prom node exporter and using Vector's prometheus source to scrape it | ||
|
||
We could not add the source directly to Vector and instead instruct users to run | ||
Telegraf or Prometheus' node exporter and point Vector at the exposed Prometheus scrape endpoint. This would leverage the already supported inputs from those projects. | ||
|
||
I decided against this as it would be in contrast with one of the listed | ||
principles of Vector: | ||
|
||
> One Tool. All Data. - One simple tool gets your logs, metrics, and traces | ||
> (coming soon) from A to B. | ||
|
||
[Vector | ||
principles](https://vector.dev/docs/about/what-is-vector/#who-should-use-vector) | ||
|
||
On the same page, it is mentioned that Vector should be a replacement for | ||
Telegraf. | ||
|
||
> You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or | ||
> similar tools. | ||
|
||
If users are already running Telegraf or Node Exporter though, they could opt for this path. | ||
|
||
## Outstanding Questions | ||
jamtur01 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is the big one. I personally like your approach of a single source though. I think we could allow for fine grained control over the metric "families" via the table-based TOML configuration; like:
To exclude trying to collect all of them (though you could also do this in a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From the point of view of user experience, I think a single source is better. It will end up being a large in terms of source code, but we can manage that with sub-modules internally. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Definitely agree on having a single source |
||
|
||
## Plan Of Attack | ||
|
||
Incremental steps that execute this change. Generally this is in the form of: | ||
|
||
- [ ] Submit a PR with the initial source implementation | ||
|
||
## Future Work | ||
|
||
- Extend source to collect additional system-level metrics. | ||
- Identify additional potential platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 👍 on
host_metrics
unless others disagree. An alternative could benode_metrics
, but that seems less precise to me. I'm curious why Prometheus adopted this nomenclature.