Specification for metrics collection #2007

binarylogic · 2020-03-09T15:36:48Z

As a follow up to #1761 we'll need a spec on which metrics we want to collect. I would like to inventory all of these metrics so we can obtain consensus and document them properly. This can be as simple as a list with metric names and labels.

lukesteensen · 2020-03-13T03:53:13Z

Metrics Collection

Philosophy

For inspiration, we'll look at the RED and USE methodologies. Rate and errors are virtually always relevant, and depending on the component, utilization and duration can be as well.

For rate, we should record an events counter that records each event passing through a component. This will provide a baseline against which to compare other numbers, let us derive event rates per time period, etc. Where relevant, we should also record a total_bytes counter. This will give us a rough idea of total data volume, let us calculate average event size, etc.

Errors are pretty self-explanatory. We don't need metrics for every possible kind, but we should look over error! locations in the codebase and instrument those that have a chance to happen repeatedly over time.

Utilization is important for components that take a significant amount of some non-CPU system resources. For example, we measure the memory utilization of the Lua transform. Memory utilization of in-flight batches is another good example.

Duration applies to components like HTTP-based sinks, where there's some request-response cycle we want to time. It can also be used around things like the Lua transform where runtime can depend heavily on the configuration.

Implementation

Naming

As much as possible, names for the same type of metrics should be consistent across components. They should be namespaced by the category of component (i.e. source, transform, sink, internal) and use common suffixes for the same data (e.g. events and total_bytes.

The example instrumentation so far in #1953 uses a rough {category}.{type}.{name} scheme. We could alternatively break out one or more of the namespacing components into tags. I think this could make sense for type especially. Opinions wanted.

Shared components

The naming scheme above runs into some complications with shared subcomponents or those that are simple wrappers around another. Since we don't know the whole runtime context at the callsite, we can't include things like type.

The current examples simply omit that portion of the key and rely on the name. A perhaps better alternative is to make type always a tag (as discussed above) so that we can add it seamlessly later with a tracing-based metrics backend.

Durations

In certain areas of the code, measuring durations is currently very complex due to the pre-async/await style. There are two ongoing pieces of work that should simplify them greatly: refactoring to use async/await and building the tracing-backed metrics backend. Where possible, we should prefer advancing one of those two items over doing the hard work of wiring timestamps through the existing structure.

Checklist

This is a rough skeleton of an instrumentation plan. Once we settle on a naming scheme we can go through and expand each item into the actual names of the metrics we want to gather. We can also drop the optional Utilization and Duration bits where they're not relevant.

For now, I've checked off the ones added as examples in #1953.

Sources

`docker`

Rate
Errors
Utilization?
Duration?

`file`

Rate
Errors
Utilization?
Duration?

`journald`

Rate
Errors
Utilization?
Duration?

`kafka`

Rate
Errors
Utilization?
Duration?

`socket`

Rate
Errors
Utilization?
Duration?

`stdin`

Rate
Errors
Utilization?
Duration?

`syslog`

Rate
Errors
Utilization?
Duration?

`vector`

Rate
Errors
Utilization?
Duration?

Transforms

`add_fields`

Rate
Errors
Utilization?
Duration?

`add_tags`

Rate
Errors
Utilization?
Duration?

`ansi_stripper`

Rate
Errors
Utilization?
Duration?

`aws_ec2_metadata`

Rate
Errors
Utilization?
Duration?

`coercer`

Rate
Errors
Utilization?
Duration?

`concat`

Rate
Errors
Utilization?
Duration?

`field_filter`

Rate
Errors
Utilization?
Duration?

`grok_parser`

Rate
Errors
Utilization?
Duration?

`json_parser`

Rate
Errors
Utilization?
Duration?

`log_to_metric`

Rate
Errors
Utilization?
Duration?

`logfmt_parser`

Rate
Errors
Utilization?
Duration?

`lua`

Rate
Errors
Utilization?
Duration?

`merge`

Rate
Errors
Utilization?
Duration?

`regex_parser`

Rate
Errors
Utilization?
Duration?

`remove_fields`

Rate
Errors
Utilization?
Duration?

`remove_tags`

Rate
Errors
Utilization?
Duration?

`rename_fields`

Rate
Errors
Utilization?
Duration?

`sampler`

Rate
Errors
Utilization?
Duration?

`split`

Rate
Errors
Utilization?
Duration?

`swimlanes`

Rate
Errors
Utilization?
Duration?

`tokenizer`

Rate
Errors
Utilization?
Duration?

Sinks

`aws_cloudwatch_metrics`

Rate
Errors
Utilization?
Duration?

`aws_kinesis_firehose`

Rate
Errors
Utilization?
Duration?

`aws_kinesis_streams`

Rate
Errors
Utilization?
Duration?

`aws_s3`

Rate
Errors
Utilization?
Duration?

`blackhole`

Rate
Errors
Utilization?
Duration?

`clickhouse`

Rate
Errors
Utilization?
Duration?

`console`

Rate
Errors
Utilization?
Duration?

`datadog_metrics`

Rate
Errors
Utilization?
Duration?

`elasticsearch`

Rate
Errors
Utilization?
Duration?

`gcp_cloud_storage`

Rate
Errors
Utilization?
Duration?

`gcp_pubsub`

Rate
Errors
Utilization?
Duration?

`gcp_stackdriver_logging`

Rate
Errors
Utilization?
Duration?

`http`

Rate
Errors
Utilization?
Duration?

`humio_logs`

Rate
Errors
Utilization?
Duration?

`influxdb_metrics`

Rate
Errors
Utilization?
Duration?

`kafka`

Rate
Errors
Utilization?
Duration?

`logdna`

Rate
Errors
Utilization?
Duration?

`loki`

Rate
Errors
Utilization?
Duration?

`new_relic_logs`

Rate
Errors
Utilization?
Duration?

`prometheus`

Rate
Errors
Utilization?
Duration?

`sematext`

Rate
Errors
Utilization?
Duration?

`socket`

Rate
Errors
Utilization?
Duration?

`splunk_hec`

Rate
Errors
Utilization?
Duration?

`statsd`

Rate
Errors
Utilization?
Duration?

`vector`

Rate
Errors
Utilization?
Duration?

Hoverbear · 2020-04-05T22:55:58Z

@lukesteensen Can you help me understand the status of this ticket? It hasn't seen activity in 3 weeks.

lukesteensen · 2020-04-06T15:48:56Z

@Hoverbear This is step 3 in the plan of attack as laid out in the RFC. Once #1953 is merged, we will update this issue with a list of events to be added.

lukesteensen · 2020-04-08T20:49:36Z

The initial implementation has been merged! 🎉

So far, the "spec" had been pretty minimal:

Each component (source, transform, or sink), should add to the events_processed counter for each event it encounters. Where possible, it should use component_kind ("source", "transform", or "sink") and component_type (e.g. "syslog" or "file") labels for that counter.
Transforms that can drop events in certain circumstances (e.g. missing field, failed regex match) should add to a processing_errors counter with the same labels as above and an additional error_type label.
Other types of errors should increment their own counters, also with the relevant component labels.

These all focus on the metrics that are generated via events, since those are user-facing and need to be consistent. We will expand this moving forward, including more patterns for the events themselves.

Alexx-G · 2020-04-21T09:22:44Z

Currently I'm using 0.8.2 (hopefully will have bandwidth for upgrade to 0.9.0 in a week or so) and while it's enough for a validation phase, it's good to see there's going to be a more detailed monitoring.
Recently I had a pretty unpleasant surprise and investigation that involved fluent* stack, and my requirements for metrics mostly derive from this experience.
There was a significant discrepancy between logs emitted by a source (a huge number of k8s pods) and logs that actually got indexed by the output (splunk).
The "high availability config" that is recommended by fluentd adds more problems for such investigations, since it adds a possible point of failure.
Thus, it's quite important to be able to validate that the number of collected events matches the number records successfully emitted by the sink (with possibility to take into account events dropped by a transform). Another useful thing is being able to input/output traffic and being able to compare it to events count.
Yet another problem is that all log forwarders usually exclude their own logs (for obvious reasons), thus it's quite important to have any internal errors (that usually go to logs) represented as some metrics that can be used to define alerts. Since normally once the logging operator is deployed and validated, its logs are checked only when there are some observable problems, thus its really important to be able to define alerts for errors that are logged (e.g. sink X returned Y errors during last 30min).

IIRC proposed metrics cover almost all things I would need for a similar investigation with vector. Not sure about topologies other than "distributed", but since all components share same metrics, I guess it's a question of labels.

binarylogic · 2020-04-21T14:25:08Z

Thanks @Alexx-G, that's helpful. And what you described will be possible with our approach. At the minimum, Vector will expose 2 counts:

events_processed{component_type, component_name}
events_errored{component_type, component_name, type, dropped}

These names and labels will change. That's what we're working through. If you have any other requirements please let us know. I'll also ping you on the PR that introduces all of this.

Alexx-G · 2020-04-23T09:50:46Z

I've checked exactly what metrics helped a lot with fluentd investigation:

events emitted (at output/sink level)
events failed (at output/sink level)
event flush retries (also at output level)
flush duration (a histogram)
It's quite helpful when these metrics are available not only per sink type (e.g. slunk_hec) but per sink id as well.

There's a metric that might not be required but helps a lot to do some fine tuning and avoid hitting hard limits on events destination:

output/sink queue length (I guess it's a gauge)

Also, the ability to define custom metrics (e.g. an event counter for a specific source/transform/sink) and add it to built-in metrics is highly valuable.
In my case the problem was in lack of auto-scaling for "log aggregators", however I needed few custom metrics to confirm it and find a temporary solution.

IIRC all of these, except flush retries and queue length are covered by this spec. I didn't get a chance to check it yet, but "log_to_metric" transform seems to cover the "custom metric" use case.

binarylogic · 2020-04-23T15:23:39Z

That's very helpful. We'll take all of that into account when defining all of the metrics.

Alexx-G · 2020-04-28T18:11:53Z

@binarylogic One quick question, is there a metric (existing or planned) to track exceeded rate limit?
I noticed that we're receiving Apr 28 07:08:36.116 TRACE sink{name=splunk type=splunk_hec}: tower_limit::rate::service: rate limit exceeded, disabling service , thus it should be used for a dashboard for sure.

binarylogic · 2020-04-28T18:24:45Z

Definitely. We're also addressing the higher level problem of picking rate limits with #2329.

Alexx-G · 2020-04-28T18:55:06Z

Oh, I somehow missed rate being mentioned so many times in that comment with metrics attack plan. Forgive my ignorance :)
Awesome, thanks! I'm following that RFC, it's quite promising.

Alexx-G · 2020-05-06T10:32:18Z

Hey @binarylogic @lukesteensen ,
I can lend a hand on adding rate/errors counters to some sinks (splunk aws_kinesis, prometheus) and maybe for some transforms. Luke has done a great job in the initial implementation, thus it should be easy to contribute.
Do you mind if I create a PR for a couple of components?

lukesteensen · 2020-05-06T15:38:14Z

@Alexx-G that would be wonderful, thank you!

ref: #2007 Signed-off-by: Jean Mertz <git@jeanmertz.com>

binarylogic · 2020-07-25T18:14:37Z

Closing this since it will be superseded by the work in #3192. We've since switched to an event driven system, and we need specific issues for implementing the remaining events. We are defining the remaining work now.

ref: #2007 Signed-off-by: Jean Mertz <git@jeanmertz.com>

binarylogic added domain: observability Anything related to monitoring/observing Vector type: task Generic non-code related tasks event type: metric labels Mar 9, 2020

binarylogic added this to the Improve observability milestone Mar 9, 2020

binarylogic assigned lukesteensen Mar 9, 2020

lukesteensen mentioned this issue Mar 13, 2020

feat(new source): add internal metrics source #1953

Merged

binarylogic mentioned this issue Mar 14, 2020

RFC for event driven observability #2064

Closed

lukesteensen assigned binarylogic and unassigned lukesteensen Apr 8, 2020

Alexx-G mentioned this issue May 16, 2020

feat(internal_metrics source): Instrument few more components with metrics #2620

Merged

binarylogic removed this from the Vector Observability milestone Jul 20, 2020

JeanMertz mentioned this issue Jul 22, 2020

enhancement(internal_metrics source): instrument "stdin" source #3151

Merged

JeanMertz self-assigned this Jul 22, 2020

JeanMertz added a commit that referenced this issue Jul 24, 2020

feat(internal_metrics source): instrument "kafka" source

c2faec3

ref: #2007 Signed-off-by: Jean Mertz <git@jeanmertz.com>

JeanMertz mentioned this issue Jul 24, 2020

feat(observability): instrument "kafka" source #3187

Merged

JeanMertz added a commit that referenced this issue Jul 24, 2020

feat(internal_metrics source): instrument "kafka" source

a3be286

ref: #2007 Signed-off-by: Jean Mertz <git@jeanmertz.com>

binarylogic closed this as completed Jul 25, 2020

lukesteensen pushed a commit that referenced this issue Jul 30, 2020

feat(internal_metrics source): instrument "kafka" source

feb9e3e

ref: #2007 Signed-off-by: Jean Mertz <git@jeanmertz.com>

binarylogic added domain: data model Anything related to Vector's internal data model domain: metrics Anything related to Vector's metrics events and removed event type: metric labels Aug 6, 2020

binarylogic mentioned this issue Sep 25, 2020

Guidelines on autoscaling Vector #4146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specification for metrics collection #2007

Specification for metrics collection #2007

binarylogic commented Mar 9, 2020

lukesteensen commented Mar 13, 2020

Hoverbear commented Apr 5, 2020

lukesteensen commented Apr 6, 2020

lukesteensen commented Apr 8, 2020

Alexx-G commented Apr 21, 2020

binarylogic commented Apr 21, 2020 •

edited

Loading

Alexx-G commented Apr 23, 2020 •

edited

Loading

binarylogic commented Apr 23, 2020

Alexx-G commented Apr 28, 2020

binarylogic commented Apr 28, 2020

Alexx-G commented Apr 28, 2020

Alexx-G commented May 6, 2020

lukesteensen commented May 6, 2020

binarylogic commented Jul 25, 2020

Specification for metrics collection #2007

Specification for metrics collection #2007

Comments

binarylogic commented Mar 9, 2020

lukesteensen commented Mar 13, 2020

Metrics Collection

Philosophy

Implementation

Naming

Shared components

Durations

Checklist

Sources

docker

file

journald

kafka

socket

stdin

syslog

vector

Transforms

add_fields

add_tags

ansi_stripper

aws_ec2_metadata

coercer

concat

field_filter

grok_parser

json_parser

log_to_metric

logfmt_parser

lua

merge

regex_parser

remove_fields

remove_tags

rename_fields

sampler

split

swimlanes

tokenizer

Sinks

aws_cloudwatch_metrics

aws_kinesis_firehose

aws_kinesis_streams

aws_s3

blackhole

clickhouse

console

datadog_metrics

elasticsearch

gcp_cloud_storage

gcp_pubsub

gcp_stackdriver_logging

http

humio_logs

influxdb_metrics

kafka

logdna

loki

new_relic_logs

prometheus

sematext

socket

splunk_hec

statsd

vector

Hoverbear commented Apr 5, 2020

lukesteensen commented Apr 6, 2020

lukesteensen commented Apr 8, 2020

Alexx-G commented Apr 21, 2020

binarylogic commented Apr 21, 2020 • edited Loading

Alexx-G commented Apr 23, 2020 • edited Loading

binarylogic commented Apr 23, 2020

Alexx-G commented Apr 28, 2020

binarylogic commented Apr 28, 2020

Alexx-G commented Apr 28, 2020

Alexx-G commented May 6, 2020

`docker`

`file`

`journald`

`kafka`

`socket`

`stdin`

`syslog`

`vector`

`add_fields`

`add_tags`

`ansi_stripper`

`aws_ec2_metadata`

`coercer`

`concat`

`field_filter`

`grok_parser`

`json_parser`

`log_to_metric`

`logfmt_parser`

`lua`

`merge`

`regex_parser`

`remove_fields`

`remove_tags`

`rename_fields`

`sampler`

`split`

`swimlanes`

`tokenizer`

`aws_cloudwatch_metrics`

`aws_kinesis_firehose`

`aws_kinesis_streams`

`aws_s3`

`blackhole`

`clickhouse`

`console`

`datadog_metrics`

`elasticsearch`

`gcp_cloud_storage`

`gcp_pubsub`

`gcp_stackdriver_logging`

`http`

`humio_logs`

`influxdb_metrics`

`kafka`

`logdna`

`loki`

`new_relic_logs`

`prometheus`

`sematext`

`socket`

`splunk_hec`

`statsd`

`vector`

binarylogic commented Apr 21, 2020 •

edited

Loading

Alexx-G commented Apr 23, 2020 •

edited

Loading