Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Draft host-metrics RFC #3581

Merged
merged 14 commits into from
Aug 29, 2020
Merged

chore: Draft host-metrics RFC #3581

merged 14 commits into from
Aug 29, 2020

Conversation

jamtur01
Copy link
Contributor

Signed-off-by: James Turnbull james@lovedthanlost.net

Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
@jamtur01 jamtur01 self-assigned this Aug 26, 2020
@jamtur01 jamtur01 added domain: metrics Anything related to Vector's metrics events needs: approval Needs review & approval before work can begin. labels Aug 26, 2020
@jamtur01 jamtur01 marked this pull request as ready for review August 26, 2020 18:30
@jamtur01 jamtur01 requested a review from jszwedko August 26, 2020 18:30
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 I think this will be a very useful component.

rfcs/2020-08-26-3191-host-metrics.md Show resolved Hide resolved

Metrics will also be labeled with:

- `host`: the host name of the host being monitored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll probably need some more labels for some of the metrics to be useful:

  • device for disk based metrics
  • cpu for CPU based metrics
  • filesystem for filesystem based metrics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm realizing I was unclear. I think you've addressed this above though with the update to the metrics list. I had meant, for example, that disk based metrics should be labeled (or tagged; not sure which term we prefer) with the device the metric is associated with (e.g. device=/dev/sda).

I think I like these "collector" labels too though. I think they'd be better as simply collector so you'd have things like collector=disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Yeah - this is a terminology mismatch. I'll update to say tagged and make it clearer.

rfcs/2020-08-26-3191-host-metrics.md Outdated Show resolved Hide resolved

## Outstanding Questions

- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the big one.

I personally like your approach of a single source though. I think we could allow for fine grained control over the metric "families" via the table-based TOML configuration; like:

[sources.my_source_id]
  type = "host_metrics"

  disk.devices = ["/dev/sda"]

  filesystem.mountpoints = ["/home"]

To exclude trying to collect all of them (though you could also do this in a filter transform).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of user experience, I think a single source is better. It will end up being a large in terms of source code, but we can manage that with sub-modules internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree on having a single source

rfcs/2020-08-26-3191-host-metrics.md Show resolved Hide resolved
rfcs/2020-08-26-3191-host-metrics.md Outdated Show resolved Hide resolved
Signed-off-by: James Turnbull <james@lovedthanlost.net>
@jamtur01
Copy link
Contributor Author

@jszwedko Thanks! Updated to reflect your feedback.

Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Comment on lines 56 to 61
- `filesystem_avail_bytes` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_device_error` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_total_file_nodes` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_free_file_nodes` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_free_bytes` labeled with device, filesystem type, and mountpoint (gauge)
- `filesystem_size_bytes` labeled with device, filesystem type, and mountpoint (gauge)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the distinction between avail and free or size? Also, there seems to be an inconsistency between avail/free/size for bytes, but free/total for file nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've drawn these from Prometheus' approach specifically.

  • filesystem_avail_bytes = Filesystem space available to non-root users in bytes.
  • filesystem_free_bytes = Filesystem free space in bytes.
  • filesystem_size_bytes = Filesystem size in bytes.

See https://www.robustperception.io/filesystem-metrics-from-the-node-exporter. I'll add an explainer to the RFC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment about the inconsistency was to suggest, for example, we could use avail/free/total for both:

filesystem_avail_bytes
filesystem_free_bytes
filesystem_total_bytes
filesystem_avail_file_nodes
filesystem_free_file_nodes
filesystem_total_file_nodes

Unless, of course, there is a preference for sticking to Prometheus' terminology.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting question - I replicated Prometheus' naming as a lot of folks are familiar with that. Also if folks have dashboards/alerts/etc that calculate things, like free space on a filesystem, then they don't need to rewrite them with new metric names.

Copy link
Member

@jszwedko jszwedko Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think the naming you (and prometheus) have feels natural to me. I'm used to referring to filesystems as having a size, probably because of my association with a physical disk and the fact that files themselves are referred to as having "sizes" (and not "total bytes"). Inodes, on the other hand, feels more like a "discrete, countable" thing so total makes sense to me for the cap.

I can see an argument for consistency though.

Comment on lines 94 to 96
collecting = [ "all"] # optional, defaults collecting all metrics.
filesystem.mountpoints = [ "all" ] # optional, defaults to all mountpoints.
disk.devices = [ "all" ] # optional, defaults to all to disk devices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend against magic words within the list to represent collecting all. We could:

  1. not allow wildcard configuration like that, just say the default is all metrics;
  2. use an actual wildcard to represent all, like "*", which may be useful to allow for glob like "filesystem_read*"; or
  3. use a custom serializer to allow either "ALL" outside of the list for all, or a list to enumerite items (ie collecting = "ALL" vs collecting = [ "a", "b" ])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the globbing. +1.


## Outstanding Questions

- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of user experience, I think a single source is better. It will end up being a large in terms of source code, but we can manage that with sub-modules internally.

Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>
Signed-off-by: James Turnbull <james@lovedthanlost.net>

## Outstanding Questions

- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree on having a single source


I've found a number of possible Rust-based solutions for implementing this collection, cross-platform.

- https://crates.io/crates/heim (Rust)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heim looks very promising and would likely be a good starting point.


## Internal Proposal

Build a single source called `host_metrics` (name to be confirmed) to collect host/system level metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍 on host_metrics unless others disagree. An alternative could be node_metrics, but that seems less precise to me. I'm curious why Prometheus adopted this nomenclature.

@binarylogic binarylogic self-requested a review August 28, 2020 01:05
Copy link
Contributor

@binarylogic binarylogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this. Only one comment on the collector name, otherwise looks great.

Signed-off-by: James Turnbull <james@lovedthanlost.net>
collectors = [ "cpu", "memory", "network" ]
```

For disk and network devices or filesystem mountpoints the default is to collect for all ("*") devices and mountpoints. Or you can configure Vector to only collect from specific devices, for example:
Copy link
Member

@jszwedko jszwedko Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if will be useful to also allow users to specify a denylist rather than allowlist, in the future. Maybe we could model this like network.devices = ["*", "!eth0"] to monitor everything but eth0? We could expand this pattern generally to this type of "filter" config.

Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had one last thought around letting people exclude resources rather than an allowlist (#3581 (comment)).

I think that could be tackled separately though, if we want. This looks good to me.

Signed-off-by: James Turnbull <james@lovedthanlost.net>
@jamtur01 jamtur01 merged commit 3cbf4af into master Aug 29, 2020
@jamtur01 jamtur01 deleted the host_metrics_rfc branch August 29, 2020 02:30
@binarylogic binarylogic mentioned this pull request Aug 30, 2020
3 tasks
mengesb pushed a commit to jacobbraaten/vector that referenced this pull request Dec 9, 2020
Signed-off-by: Brian Menges <brian.menges@anaplan.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: metrics Anything related to Vector's metrics events needs: approval Needs review & approval before work can begin.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants