Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification for metrics collection #2007

Closed
binarylogic opened this issue Mar 9, 2020 · 14 comments
Closed

Specification for metrics collection #2007

binarylogic opened this issue Mar 9, 2020 · 14 comments
Assignees
Labels
domain: data model Anything related to Vector's internal data model domain: metrics Anything related to Vector's metrics events domain: observability Anything related to monitoring/observing Vector type: task Generic non-code related tasks

Comments

@binarylogic
Copy link
Contributor

As a follow up to #1761 we'll need a spec on which metrics we want to collect. I would like to inventory all of these metrics so we can obtain consensus and document them properly. This can be as simple as a list with metric names and labels.

@binarylogic binarylogic added domain: observability Anything related to monitoring/observing Vector type: task Generic non-code related tasks event type: metric labels Mar 9, 2020
@binarylogic binarylogic added this to the Improve observability milestone Mar 9, 2020
@lukesteensen
Copy link
Member

Metrics Collection

Philosophy

For inspiration, we'll look at the RED and USE methodologies. Rate and errors are virtually always relevant, and depending on the component, utilization and duration can be as well.

For rate, we should record an events counter that records each event passing through a component. This will provide a baseline against which to compare other numbers, let us derive event rates per time period, etc. Where relevant, we should also record a total_bytes counter. This will give us a rough idea of total data volume, let us calculate average event size, etc.

Errors are pretty self-explanatory. We don't need metrics for every possible kind, but we should look over error! locations in the codebase and instrument those that have a chance to happen repeatedly over time.

Utilization is important for components that take a significant amount of some non-CPU system resources. For example, we measure the memory utilization of the Lua transform. Memory utilization of in-flight batches is another good example.

Duration applies to components like HTTP-based sinks, where there's some request-response cycle we want to time. It can also be used around things like the Lua transform where runtime can depend heavily on the configuration.

Implementation

Naming

As much as possible, names for the same type of metrics should be consistent across components. They should be namespaced by the category of component (i.e. source, transform, sink, internal) and use common suffixes for the same data (e.g. events and total_bytes.

The example instrumentation so far in #1953 uses a rough {category}.{type}.{name} scheme. We could alternatively break out one or more of the namespacing components into tags. I think this could make sense for type especially. Opinions wanted.

Shared components

The naming scheme above runs into some complications with shared subcomponents or those that are simple wrappers around another. Since we don't know the whole runtime context at the callsite, we can't include things like type.

The current examples simply omit that portion of the key and rely on the name. A perhaps better alternative is to make type always a tag (as discussed above) so that we can add it seamlessly later with a tracing-based metrics backend.

Durations

In certain areas of the code, measuring durations is currently very complex due to the pre-async/await style. There are two ongoing pieces of work that should simplify them greatly: refactoring to use async/await and building the tracing-backed metrics backend. Where possible, we should prefer advancing one of those two items over doing the hard work of wiring timestamps through the existing structure.

Checklist

This is a rough skeleton of an instrumentation plan. Once we settle on a naming scheme we can go through and expand each item into the actual names of the metrics we want to gather. We can also drop the optional Utilization and Duration bits where they're not relevant.

For now, I've checked off the ones added as examples in #1953.

Sources

docker

  • Rate
  • Errors
  • Utilization?
  • Duration?

file

  • Rate
  • Errors
  • Utilization?
  • Duration?

journald

  • Rate
  • Errors
  • Utilization?
  • Duration?

kafka

  • Rate
  • Errors
  • Utilization?
  • Duration?

socket

  • Rate
  • Errors
  • Utilization?
  • Duration?

stdin

  • Rate
  • Errors
  • Utilization?
  • Duration?

syslog

  • Rate
  • Errors
  • Utilization?
  • Duration?

vector

  • Rate
  • Errors
  • Utilization?
  • Duration?

Transforms

add_fields

  • Rate
  • Errors
  • Utilization?
  • Duration?

add_tags

  • Rate
  • Errors
  • Utilization?
  • Duration?

ansi_stripper

  • Rate
  • Errors
  • Utilization?
  • Duration?

aws_ec2_metadata

  • Rate
  • Errors
  • Utilization?
  • Duration?

coercer

  • Rate
  • Errors
  • Utilization?
  • Duration?

concat

  • Rate
  • Errors
  • Utilization?
  • Duration?

field_filter

  • Rate
  • Errors
  • Utilization?
  • Duration?

grok_parser

  • Rate
  • Errors
  • Utilization?
  • Duration?

json_parser

  • Rate
  • Errors
  • Utilization?
  • Duration?

log_to_metric

  • Rate
  • Errors
  • Utilization?
  • Duration?

logfmt_parser

  • Rate
  • Errors
  • Utilization?
  • Duration?

lua

  • Rate
  • Errors
  • Utilization?
  • Duration?

merge

  • Rate
  • Errors
  • Utilization?
  • Duration?

regex_parser

  • Rate
  • Errors
  • Utilization?
  • Duration?

remove_fields

  • Rate
  • Errors
  • Utilization?
  • Duration?

remove_tags

  • Rate
  • Errors
  • Utilization?
  • Duration?

rename_fields

  • Rate
  • Errors
  • Utilization?
  • Duration?

sampler

  • Rate
  • Errors
  • Utilization?
  • Duration?

split

  • Rate
  • Errors
  • Utilization?
  • Duration?

swimlanes

  • Rate
  • Errors
  • Utilization?
  • Duration?

tokenizer

  • Rate
  • Errors
  • Utilization?
  • Duration?

Sinks

aws_cloudwatch_metrics

  • Rate
  • Errors
  • Utilization?
  • Duration?

aws_kinesis_firehose

  • Rate
  • Errors
  • Utilization?
  • Duration?

aws_kinesis_streams

  • Rate
  • Errors
  • Utilization?
  • Duration?

aws_s3

  • Rate
  • Errors
  • Utilization?
  • Duration?

blackhole

  • Rate
  • Errors
  • Utilization?
  • Duration?

clickhouse

  • Rate
  • Errors
  • Utilization?
  • Duration?

console

  • Rate
  • Errors
  • Utilization?
  • Duration?

datadog_metrics

  • Rate
  • Errors
  • Utilization?
  • Duration?

elasticsearch

  • Rate
  • Errors
  • Utilization?
  • Duration?

gcp_cloud_storage

  • Rate
  • Errors
  • Utilization?
  • Duration?

gcp_pubsub

  • Rate
  • Errors
  • Utilization?
  • Duration?

gcp_stackdriver_logging

  • Rate
  • Errors
  • Utilization?
  • Duration?

http

  • Rate
  • Errors
  • Utilization?
  • Duration?

humio_logs

  • Rate
  • Errors
  • Utilization?
  • Duration?

influxdb_metrics

  • Rate
  • Errors
  • Utilization?
  • Duration?

kafka

  • Rate
  • Errors
  • Utilization?
  • Duration?

logdna

  • Rate
  • Errors
  • Utilization?
  • Duration?

loki

  • Rate
  • Errors
  • Utilization?
  • Duration?

new_relic_logs

  • Rate
  • Errors
  • Utilization?
  • Duration?

prometheus

  • Rate
  • Errors
  • Utilization?
  • Duration?

sematext

  • Rate
  • Errors
  • Utilization?
  • Duration?

socket

  • Rate
  • Errors
  • Utilization?
  • Duration?

splunk_hec

  • Rate
  • Errors
  • Utilization?
  • Duration?

statsd

  • Rate
  • Errors
  • Utilization?
  • Duration?

vector

  • Rate
  • Errors
  • Utilization?
  • Duration?

@Hoverbear
Copy link
Contributor

@lukesteensen Can you help me understand the status of this ticket? It hasn't seen activity in 3 weeks.

@lukesteensen
Copy link
Member

@Hoverbear This is step 3 in the plan of attack as laid out in the RFC. Once #1953 is merged, we will update this issue with a list of events to be added.

@lukesteensen
Copy link
Member

The initial implementation has been merged! 🎉

So far, the "spec" had been pretty minimal:

  1. Each component (source, transform, or sink), should add to the events_processed counter for each event it encounters. Where possible, it should use component_kind ("source", "transform", or "sink") and component_type (e.g. "syslog" or "file") labels for that counter.

  2. Transforms that can drop events in certain circumstances (e.g. missing field, failed regex match) should add to a processing_errors counter with the same labels as above and an additional error_type label.

  3. Other types of errors should increment their own counters, also with the relevant component labels.

These all focus on the metrics that are generated via events, since those are user-facing and need to be consistent. We will expand this moving forward, including more patterns for the events themselves.

@Alexx-G
Copy link
Contributor

Alexx-G commented Apr 21, 2020

Currently I'm using 0.8.2 (hopefully will have bandwidth for upgrade to 0.9.0 in a week or so) and while it's enough for a validation phase, it's good to see there's going to be a more detailed monitoring.
Recently I had a pretty unpleasant surprise and investigation that involved fluent* stack, and my requirements for metrics mostly derive from this experience.
There was a significant discrepancy between logs emitted by a source (a huge number of k8s pods) and logs that actually got indexed by the output (splunk).
The "high availability config" that is recommended by fluentd adds more problems for such investigations, since it adds a possible point of failure.
Thus, it's quite important to be able to validate that the number of collected events matches the number records successfully emitted by the sink (with possibility to take into account events dropped by a transform). Another useful thing is being able to input/output traffic and being able to compare it to events count.
Yet another problem is that all log forwarders usually exclude their own logs (for obvious reasons), thus it's quite important to have any internal errors (that usually go to logs) represented as some metrics that can be used to define alerts. Since normally once the logging operator is deployed and validated, its logs are checked only when there are some observable problems, thus its really important to be able to define alerts for errors that are logged (e.g. sink X returned Y errors during last 30min).

IIRC proposed metrics cover almost all things I would need for a similar investigation with vector. Not sure about topologies other than "distributed", but since all components share same metrics, I guess it's a question of labels.

@binarylogic
Copy link
Contributor Author

binarylogic commented Apr 21, 2020

Thanks @Alexx-G, that's helpful. And what you described will be possible with our approach. At the minimum, Vector will expose 2 counts:

  1. events_processed{component_type, component_name}
  2. events_errored{component_type, component_name, type, dropped}

These names and labels will change. That's what we're working through. If you have any other requirements please let us know. I'll also ping you on the PR that introduces all of this.

@Alexx-G
Copy link
Contributor

Alexx-G commented Apr 23, 2020

I've checked exactly what metrics helped a lot with fluentd investigation:

  • events emitted (at output/sink level)
  • events failed (at output/sink level)
  • event flush retries (also at output level)
  • flush duration (a histogram)
    It's quite helpful when these metrics are available not only per sink type (e.g. slunk_hec) but per sink id as well.

There's a metric that might not be required but helps a lot to do some fine tuning and avoid hitting hard limits on events destination:

  • output/sink queue length (I guess it's a gauge)

Also, the ability to define custom metrics (e.g. an event counter for a specific source/transform/sink) and add it to built-in metrics is highly valuable.
In my case the problem was in lack of auto-scaling for "log aggregators", however I needed few custom metrics to confirm it and find a temporary solution.

IIRC all of these, except flush retries and queue length are covered by this spec. I didn't get a chance to check it yet, but "log_to_metric" transform seems to cover the "custom metric" use case.

@binarylogic
Copy link
Contributor Author

That's very helpful. We'll take all of that into account when defining all of the metrics.

@Alexx-G
Copy link
Contributor

Alexx-G commented Apr 28, 2020

@binarylogic One quick question, is there a metric (existing or planned) to track exceeded rate limit?
I noticed that we're receiving Apr 28 07:08:36.116 TRACE sink{name=splunk type=splunk_hec}: tower_limit::rate::service: rate limit exceeded, disabling service , thus it should be used for a dashboard for sure.

@binarylogic
Copy link
Contributor Author

Definitely. We're also addressing the higher level problem of picking rate limits with #2329.

@Alexx-G
Copy link
Contributor

Alexx-G commented Apr 28, 2020

Oh, I somehow missed rate being mentioned so many times in that comment with metrics attack plan. Forgive my ignorance :)
Awesome, thanks! I'm following that RFC, it's quite promising.

@Alexx-G
Copy link
Contributor

Alexx-G commented May 6, 2020

Hey @binarylogic @lukesteensen ,
I can lend a hand on adding rate/errors counters to some sinks (splunk aws_kinesis, prometheus) and maybe for some transforms. Luke has done a great job in the initial implementation, thus it should be easy to contribute.
Do you mind if I create a PR for a couple of components?

@lukesteensen
Copy link
Member

@Alexx-G that would be wonderful, thank you!

@binarylogic binarylogic removed this from the Vector Observability milestone Jul 20, 2020
@JeanMertz JeanMertz self-assigned this Jul 22, 2020
JeanMertz added a commit that referenced this issue Jul 24, 2020
ref: #2007
Signed-off-by: Jean Mertz <git@jeanmertz.com>
JeanMertz added a commit that referenced this issue Jul 24, 2020
ref: #2007
Signed-off-by: Jean Mertz <git@jeanmertz.com>
@binarylogic
Copy link
Contributor Author

Closing this since it will be superseded by the work in #3192. We've since switched to an event driven system, and we need specific issues for implementing the remaining events. We are defining the remaining work now.

lukesteensen pushed a commit that referenced this issue Jul 30, 2020
ref: #2007
Signed-off-by: Jean Mertz <git@jeanmertz.com>
@binarylogic binarylogic added domain: data model Anything related to Vector's internal data model domain: metrics Anything related to Vector's metrics events and removed event type: metric labels Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: data model Anything related to Vector's internal data model domain: metrics Anything related to Vector's metrics events domain: observability Anything related to monitoring/observing Vector type: task Generic non-code related tasks
Projects
None yet
Development

No branches or pull requests

5 participants