RFC for event driven observability #2064

binarylogic · 2020-03-14T16:39:07Z

We need a simple RFC to explore an event-driven observability strategy. I'll try to lay the foundation for that RFC here.

Motivation

#1953 spikes out an internal_metrics source. This serves as the basis for our metrics collection, but there are still decisions to be made around metric collection philosophy. #2007 starts to address this and is a good start, but I have a few concerns:

Backward compatibility

The metrics we expose are part of our public API and are very sensitive to backward compatibility. This is because users will setup dashboards and alarms on these metrics. Generally, I am a fan of moving fast, learning more, and iterating, but this is one area I want to think through before shipping the first version since it is very difficult to make changes later.

Following our own advice

We're an observability company 😄, and we have opinions on observability practices and tactics. I am a strong believer in event-driven observability. Generally, this is discussed in the context of in-house applications, not black-box services. These events are the actual pieces of data used to observe apps. Engineers have mental maps of these events and how they relate to their code, making it easy to ask questions against the raw event data. This is obviously not the case for Vector users, and therefore metrics become the critical means to monitoring Vector (as is the case for all black-box services).

But there is another large benefit of event-driven observability that still applies to black-box services, and that is deriving other observability data from these events, such as logs, metrics, and traces. The benefits of this are discussed in more detail in the following sections.

Consistency

As the Vector project matures we'll inevitably have more contributors. It'll be very difficult to align contributors behind a metrics collection strategy. Not only will it be difficult to ensure consistent metric naming and labeling, but it'll be even more difficult to ensure a consistent collection methodology, such as those outlined in the RED and USE. And let's not forget structured logging. We're already having trouble with this in Vector with just logging.

Events solve this. Events are a single, simple paradigm to think through. "This thing happened, emit an event". Emitting events is much easier, and we can control how we derive other data, such as metrics. Metric naming and labeling achieve consistency through this pattern.

Maintaining quality

As discussed in the previous section, consistency largely contributes to quality, but events will also force us to improve the quality of our logs and metrics. Here's why:

Correlation. Logs and metrics will be correlated. For example, the metric tags will be derived from event attributes. And events are logs. This will make it easier for users to jump from Vector metrics to logs.
Richer data. Unlike metrics, events contain high cardinality data. I also believe that forcing Vector developers to think through the paradigm of events will produce higher quality data as opposed to thinking through logs and metrics.
Cleaner metric inventory. If we maintain event structs we'll have a birds-eye view of all events that Vector emits. This allows us to consider new events in the broader context; it helps to ensure a thoughtful, noise-free metrics catalog. This heavily contributes to operator friendliness.
Easier to review. Adding an event will be more obvious and easier to review, versus remembering to review and consider every log and metrics statement.

Long-term maintainability

Peppering events throughout our codebase decouples instrumentation from any specific observability strategy. For example, here are a few scenarios that will be much easier to achieve if we are emitting events:

Deriving new observability data in the future. Such as traces or transactions.
Changing anything related to how we collect metrics, such as naming and labeling schemes, libraries or function calls used to capture metrics, etc.

User/Operator experience

This is covered in the previous sections, but the UX of Vector will be vastly improved by all of the above.

Guide-level Proposal

I'll leave this to @lukesteensen, but I'm a fan of starting simple. Instead of building out a subscriber, like what @LucioFranco did in tracing-limit, we can build a layer above all of this by adding a simple emit_event(my_event) function that we call throughout the codebase (or whatever we want to name it). This would simply call the logging and metrics functions inline.

There are a few things I would like to propose:

This might seem overkill, but I would like to maintain an event catalog. That is, an events folder with a bunch of structs. I believe this will pay dividends and ensure we maintain a quality catalog of events. See all of the points above about why.
Think carefully about the performance of instantiating event structs. I don't know enough about Rust, but I assume we can accomplish this.
Event naming should follow a pattern. I recommend this style (<noun>_<past-tense-verb>).
A single consistent attribute name for durations. I recommend duration_ms.
Suffixing number attributes with their unit. For example, instead of duration we should be using duration_ms.

These are bike-shedding a little bit, but we might as well agree on something. Regardless, I'd be very happy if we just have a catalog of events.

The text was updated successfully, but these errors were encountered:

binarylogic added domain: observability Anything related to monitoring/observing Vector type: task Generic non-code related tasks needs: rfc Needs an RFC before work can begin. labels Mar 14, 2020

binarylogic assigned lukesteensen Mar 14, 2020

binarylogic added this to the Improve observability milestone Mar 16, 2020

lukesteensen mentioned this issue Mar 18, 2020

chore: add event-driven observability rfc #2093

Merged

Hoverbear self-assigned this Mar 23, 2020

lukesteensen closed this as completed in #2093 Apr 1, 2020

binarylogic removed this from the Vector Observability milestone Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC for event driven observability #2064

RFC for event driven observability #2064

binarylogic commented Mar 14, 2020 •

edited

Loading

RFC for event driven observability #2064

RFC for event driven observability #2064

Comments

binarylogic commented Mar 14, 2020 • edited Loading

Motivation

Backward compatibility

Following our own advice

Consistency

Maintaining quality

Long-term maintainability

User/Operator experience

Guide-level Proposal

binarylogic commented Mar 14, 2020 •

edited

Loading