diff --git a/rfcs/2020-08-26-3191-host-metrics.md b/rfcs/2020-08-26-3191-host-metrics.md new file mode 100644 index 0000000000000..b00e3af634f26 --- /dev/null +++ b/rfcs/2020-08-26-3191-host-metrics.md @@ -0,0 +1,206 @@ +# RFC 3191 - 2020-08-26 - Collecting host-based metrics + +This RFC is to introduce a new metrics source to consume host-based metrics. The high level plan is to implement one (or more?) sources that collect CPU, disk, network, and memory metrics from Linux-based hosts. + +## Scope + +This RFC will cover: + +- A new source for host-based metrics, specifically: + - CPU + - Memory + - Disk + - Network +- Collection on Linux, Windows, OSX. + +This RFC will not cover: + +- Other host metrics. +- Other platforms + +## Motivation + +Users want to collect, transform, and forward metrics to better observe how their hosts are performing. + +## Internal Proposal + +Build a single source called `host_metrics` (name to be confirmed) to collect host/system level metrics. + +I've found a number of possible Rust-based solutions for implementing this collection, cross-platform. + +- https://crates.io/crates/heim (Rust) +- https://lib.rs/crates/sys-info (Rust + C) +- https://github.com/myfreeweb/systemstat (pure Rust) +- https://docs.rs/sysinfo/0.3.19/sysinfo/index.html + +The Heim crate has a useful comparison doc showing the available tools and their capabilities. + +- https://github.com/heim-rs/heim/blob/master/COMPARISON.md + +For this implementation we're recommending the Heim crate based on platform and feature coverage. + +We'd use one of these to collect the following metrics: + +- `cpu_seconds_total` tagged with mode (idle, nice, system, user) and CPU. (counter) +- `disk_read_bytes_total` tagged with the disk (counter) +- `disk_read_errors_total` tagged with the disk (counter) +- `disk_read_retries_total` tagged with the disk (counter) +- `disk_read_sectors_total` tagged with the disk (counter) +- `disk_read_time_seconds_total` tagged with the disk (counter) +- `disk_reads_completed_total` tagged with the disk (counter) +- `disk_write_errors_total` tagged with the disk (counter) +- `disk_write_retries_total` tagged with the disk (counter) +- `disk_write_time_seconds_total` tagged with the disk (counter) +- `disk_writes_completed_total` tagged with the disk (counter) +- `disk_written_bytes_total` tagged with the disk (counter) +- `disk_written_sectors_total` tagged with the disk (counter) +- `filesystem_avail_bytes` tagged with the device, filesystem type, and mountpoint (gauge) +- `filesystem_device_error` tagged with the device, filesystem type, and mountpoint (gauge) +- `filesystem_free_bytes` tagged with the device, filesystem type, and mountpoint (gauge) +- `filesystem_size_bytes` tagged with the device, filesystem type, and mountpoint (gauge) +- `filesystem_total_file_nodes` tagged with the device, filesystem type, and mountpoint (gauge) +- `filesystem_free_file_nodes` tagged with the device, filesystem type, and mountpoint (gauge) +- `load1` (gauge) +- `load5` (gauge) +- `load15` (gauge) +- `memory_active_bytes` (gauge) +- `memory_compressed_bytes` (gauge) +- `memory_free_bytes` (gauge) +- `memory_inactive_bytes` (gauge) +- `memory_swap_total_bytes` (gauge) +- `memory_swap_used_bytes` (gauge) +- `memory_swapped_in_bytes_total` (gauge) +- `memory_swapped_out_bytes_total` (gauge) +- `memory_total_bytes` (gauge) +- `memory_wired_bytes` (gauge) +- `network_receive_bytes_total` tagged with device (counter) +- `network_receive_errs_total` tagged with device (counter) +- `network_receive_multicast_total` tagged with device (counter) +- `network_receive_packets_total` tagged with device (counter) +- `network_transmit_bytes_total` tagged with device (counter) +- `network_transmit_errs_total` tagged with device (counter) +- `network_transmit_multicast_total` tagged with device (counter) +- `network_transmit_packets_total` tagged with device (counter) + +Users should be able to limit the collection of metrics to specific classes, here: `cpu`, `memory`, `disk`, `filesystem`, `load`, and `network`. + +Metrics will also be tagged with: + +- `host`: the host name of the host being monitored. + +And `collector` for type of metric: + +- `disk` for disk based metrics +- `cpu` for CPU based metrics +- `filesystem` for filesystem based metrics +- `memory` for memory based metrics. +- `load` for load based metrics. +- `network` for network based metrics. + +Specific explanation of some of the filesystem metrics: + +- `filesystem_avail_bytes` = Filesystem space available to non-root users in bytes (including reserved blocks). +- `filesystem_free_bytes` = Filesystem free space in bytes (excluding reserved blocks). +- `filesystem_size_bytes` = Filesystem size in bytes. + + +## Doc-level Proposal + +The following additional source configuration will be added: + +```toml +[sources.my_source_id] + type = "host_metrics" # required + collectors = [ "all"] # optional, defaults collecting all metrics. + filesystem.mountpoints = [ "*" ] # optional, defaults to all mountpoints. + disk.devices = [ "*" ] # optional, defaults to all to disk devices. + network.devices = [ "*" ] # optional, defaults to all to network devices. + scrape_interval_secs = 15 # optional, default, seconds + namespace = "host" # optional, default is "host", namespace to put metrics under +``` + +For `collector` (name open to discussion) we'd specify an array of metric collectors: + +```toml +collectors = [ "cpu", "memory", "network" ] +``` + +For disk and network devices or filesystem mountpoints the default is to collect for all ("*") devices and mountpoints. Or you can configure Vector to only collect from specific devices, for example: + +```toml +network.devices = [ "eth0" ] +``` + +Or, if we think its feasible, using globbing like so: + +```toml +network.devices = [ "eth*" ] +``` + +And, if feasible, have syntax for exclusion of resources. + +- We'd also add a guide for doing this on Docker (similar to https://github.com/prometheus/node_exporter#using-docker). + +## Rationale + +CPU, Memory, Disk, and Network are the basic building blocks of host-based monitoring. They are considered "table stakes" for most metrics-based monitoring of hosts. Additionally, if we do not support ingesting metrics from them, it is likely to push people to use another tool + +As part of Vector's vision to be the "one tool" for ingesting and shipping +observability data, it makes sense to add as many sources as possible to reduce +the likelihood that a user will not be able to ingest metrics from their tools. + +## Prior Art + +- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu +- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mem +- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/disk +- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/net +- https://github.com/elastic/beats/tree/master/metricbeat +- https://github.com/prometheus/node_exporter +- https://docs.fluentbit.io/manual/pipeline/inputs/cpu-metrics +- https://docs.fluentbit.io/manual/pipeline/inputs/disk-io-metrics +- https://docs.fluentbit.io/manual/pipeline/inputs/memory-metrics +- https://docs.fluentbit.io/manual/pipeline/inputs/network-io-metrics + +## Drawbacks + +- Additional maintenance and integration testing burden of a new source + +## Alternatives + +### Having users run telegraf or Prom node exporter and using Vector's prometheus source to scrape it + +We could not add the source directly to Vector and instead instruct users to run +Telegraf or Prometheus' node exporter and point Vector at the exposed Prometheus scrape endpoint. This would leverage the already supported inputs from those projects. + +I decided against this as it would be in contrast with one of the listed +principles of Vector: + +> One Tool. All Data. - One simple tool gets your logs, metrics, and traces +> (coming soon) from A to B. + +[Vector +principles](https://vector.dev/docs/about/what-is-vector/#who-should-use-vector) + +On the same page, it is mentioned that Vector should be a replacement for +Telegraf. + +> You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or +> similar tools. + +If users are already running Telegraf or Node Exporter though, they could opt for this path. + +## Outstanding Questions + +- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`? + +## Plan Of Attack + +Incremental steps that execute this change. Generally this is in the form of: + +- [ ] Submit a PR with the initial source implementation + +## Future Work + +- Extend source to collect additional system-level metrics. +- Identify additional potential platforms.