-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Draft host-metrics RFC #3581
Conversation
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 I think this will be a very useful component.
|
||
Metrics will also be labeled with: | ||
|
||
- `host`: the host name of the host being monitored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll probably need some more labels for some of the metrics to be useful:
device
for disk based metricscpu
for CPU based metricsfilesystem
for filesystem based metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm realizing I was unclear. I think you've addressed this above though with the update to the metrics list. I had meant, for example, that disk based metrics should be labeled (or tagged; not sure which term we prefer) with the device the metric is associated with (e.g. device=/dev/sda
).
I think I like these "collector" labels too though. I think they'd be better as simply collector
so you'd have things like collector=disk
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Yeah - this is a terminology mismatch. I'll update to say tagged and make it clearer.
|
||
## Outstanding Questions | ||
|
||
- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the big one.
I personally like your approach of a single source though. I think we could allow for fine grained control over the metric "families" via the table-based TOML configuration; like:
[sources.my_source_id]
type = "host_metrics"
disk.devices = ["/dev/sda"]
filesystem.mountpoints = ["/home"]
To exclude trying to collect all of them (though you could also do this in a filter
transform).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the point of view of user experience, I think a single source is better. It will end up being a large in terms of source code, but we can manage that with sub-modules internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree on having a single source
Signed-off-by: James Turnbull <james@lovedthanlost.net>
@jszwedko Thanks! Updated to reflect your feedback. |
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
rfcs/2020-08-26-3191-host-metrics.md
Outdated
- `filesystem_avail_bytes` labeled with device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_device_error` labeled with device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_total_file_nodes` labeled with device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_free_file_nodes` labeled with device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_free_bytes` labeled with device, filesystem type, and mountpoint (gauge) | ||
- `filesystem_size_bytes` labeled with device, filesystem type, and mountpoint (gauge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the distinction between avail
and free
or size
? Also, there seems to be an inconsistency between avail/free/size for bytes, but free/total for file nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've drawn these from Prometheus' approach specifically.
filesystem_avail_bytes
= Filesystem space available to non-root users in bytes.filesystem_free_bytes
= Filesystem free space in bytes.filesystem_size_bytes
= Filesystem size in bytes.
See https://www.robustperception.io/filesystem-metrics-from-the-node-exporter. I'll add an explainer to the RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment about the inconsistency was to suggest, for example, we could use avail/free/total for both:
filesystem_avail_bytes
filesystem_free_bytes
filesystem_total_bytes
filesystem_avail_file_nodes
filesystem_free_file_nodes
filesystem_total_file_nodes
Unless, of course, there is a preference for sticking to Prometheus' terminology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting question - I replicated Prometheus' naming as a lot of folks are familiar with that. Also if folks have dashboards/alerts/etc that calculate things, like free space on a filesystem, then they don't need to rewrite them with new metric names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I think the naming you (and prometheus) have feels natural to me. I'm used to referring to filesystems as having a size, probably because of my association with a physical disk and the fact that files themselves are referred to as having "sizes" (and not "total bytes"). Inodes, on the other hand, feels more like a "discrete, countable" thing so total makes sense to me for the cap.
I can see an argument for consistency though.
rfcs/2020-08-26-3191-host-metrics.md
Outdated
collecting = [ "all"] # optional, defaults collecting all metrics. | ||
filesystem.mountpoints = [ "all" ] # optional, defaults to all mountpoints. | ||
disk.devices = [ "all" ] # optional, defaults to all to disk devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend against magic words within the list to represent collecting all. We could:
- not allow wildcard configuration like that, just say the default is all metrics;
- use an actual wildcard to represent all, like
"*"
, which may be useful to allow for glob like"filesystem_read*"
; or - use a custom serializer to allow either
"ALL"
outside of the list for all, or a list to enumerite items (iecollecting = "ALL"
vscollecting = [ "a", "b" ]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the globbing. +1.
|
||
## Outstanding Questions | ||
|
||
- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the point of view of user experience, I think a single source is better. It will end up being a large in terms of source code, but we can manage that with sub-modules internally.
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
|
||
## Outstanding Questions | ||
|
||
- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree on having a single source
|
||
I've found a number of possible Rust-based solutions for implementing this collection, cross-platform. | ||
|
||
- https://crates.io/crates/heim (Rust) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heim
looks very promising and would likely be a good starting point.
|
||
## Internal Proposal | ||
|
||
Build a single source called `host_metrics` (name to be confirmed) to collect host/system level metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 👍 on host_metrics
unless others disagree. An alternative could be node_metrics
, but that seems less precise to me. I'm curious why Prometheus adopted this nomenclature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on this. Only one comment on the collector
name, otherwise looks great.
Signed-off-by: James Turnbull <james@lovedthanlost.net>
collectors = [ "cpu", "memory", "network" ] | ||
``` | ||
|
||
For disk and network devices or filesystem mountpoints the default is to collect for all ("*") devices and mountpoints. Or you can configure Vector to only collect from specific devices, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if will be useful to also allow users to specify a denylist rather than allowlist, in the future. Maybe we could model this like network.devices = ["*", "!eth0"]
to monitor everything but eth0? We could expand this pattern generally to this type of "filter" config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had one last thought around letting people exclude resources rather than an allowlist (#3581 (comment)).
I think that could be tackled separately though, if we want. This looks good to me.
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: Brian Menges <brian.menges@anaplan.com>
Signed-off-by: James Turnbull james@lovedthanlost.net