Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for event driven observability #2064

Closed
binarylogic opened this issue Mar 14, 2020 · 0 comments · Fixed by #2093
Closed

RFC for event driven observability #2064

binarylogic opened this issue Mar 14, 2020 · 0 comments · Fixed by #2093
Assignees
Labels
domain: observability Anything related to monitoring/observing Vector needs: rfc Needs an RFC before work can begin. type: task Generic non-code related tasks

Comments

@binarylogic
Copy link
Contributor

binarylogic commented Mar 14, 2020

We need a simple RFC to explore an event-driven observability strategy. I'll try to lay the foundation for that RFC here.

Motivation

#1953 spikes out an internal_metrics source. This serves as the basis for our metrics collection, but there are still decisions to be made around metric collection philosophy. #2007 starts to address this and is a good start, but I have a few concerns:

Backward compatibility

The metrics we expose are part of our public API and are very sensitive to backward compatibility. This is because users will setup dashboards and alarms on these metrics. Generally, I am a fan of moving fast, learning more, and iterating, but this is one area I want to think through before shipping the first version since it is very difficult to make changes later.

Following our own advice

We're an observability company 😄, and we have opinions on observability practices and tactics. I am a strong believer in event-driven observability. Generally, this is discussed in the context of in-house applications, not black-box services. These events are the actual pieces of data used to observe apps. Engineers have mental maps of these events and how they relate to their code, making it easy to ask questions against the raw event data. This is obviously not the case for Vector users, and therefore metrics become the critical means to monitoring Vector (as is the case for all black-box services).

But there is another large benefit of event-driven observability that still applies to black-box services, and that is deriving other observability data from these events, such as logs, metrics, and traces. The benefits of this are discussed in more detail in the following sections.

Consistency

As the Vector project matures we'll inevitably have more contributors. It'll be very difficult to align contributors behind a metrics collection strategy. Not only will it be difficult to ensure consistent metric naming and labeling, but it'll be even more difficult to ensure a consistent collection methodology, such as those outlined in the RED and USE. And let's not forget structured logging. We're already having trouble with this in Vector with just logging.

Events solve this. Events are a single, simple paradigm to think through. "This thing happened, emit an event". Emitting events is much easier, and we can control how we derive other data, such as metrics. Metric naming and labeling achieve consistency through this pattern.

Maintaining quality

As discussed in the previous section, consistency largely contributes to quality, but events will also force us to improve the quality of our logs and metrics. Here's why:

  1. Correlation. Logs and metrics will be correlated. For example, the metric tags will be derived from event attributes. And events are logs. This will make it easier for users to jump from Vector metrics to logs.
  2. Richer data. Unlike metrics, events contain high cardinality data. I also believe that forcing Vector developers to think through the paradigm of events will produce higher quality data as opposed to thinking through logs and metrics.
  3. Cleaner metric inventory. If we maintain event structs we'll have a birds-eye view of all events that Vector emits. This allows us to consider new events in the broader context; it helps to ensure a thoughtful, noise-free metrics catalog. This heavily contributes to operator friendliness.
  4. Easier to review. Adding an event will be more obvious and easier to review, versus remembering to review and consider every log and metrics statement.

Long-term maintainability

Peppering events throughout our codebase decouples instrumentation from any specific observability strategy. For example, here are a few scenarios that will be much easier to achieve if we are emitting events:

  1. Deriving new observability data in the future. Such as traces or transactions.
  2. Changing anything related to how we collect metrics, such as naming and labeling schemes, libraries or function calls used to capture metrics, etc.

User/Operator experience

This is covered in the previous sections, but the UX of Vector will be vastly improved by all of the above.

Guide-level Proposal

I'll leave this to @lukesteensen, but I'm a fan of starting simple. Instead of building out a subscriber, like what @LucioFranco did in tracing-limit, we can build a layer above all of this by adding a simple emit_event(my_event) function that we call throughout the codebase (or whatever we want to name it). This would simply call the logging and metrics functions inline.

There are a few things I would like to propose:

  1. This might seem overkill, but I would like to maintain an event catalog. That is, an events folder with a bunch of structs. I believe this will pay dividends and ensure we maintain a quality catalog of events. See all of the points above about why.
  2. Think carefully about the performance of instantiating event structs. I don't know enough about Rust, but I assume we can accomplish this.
  3. Event naming should follow a pattern. I recommend this style (<noun>_<past-tense-verb>).
  4. A single consistent attribute name for durations. I recommend duration_ms.
  5. Suffixing number attributes with their unit. For example, instead of duration we should be using duration_ms.

These are bike-shedding a little bit, but we might as well agree on something. Regardless, I'd be very happy if we just have a catalog of events.

@binarylogic binarylogic added domain: observability Anything related to monitoring/observing Vector type: task Generic non-code related tasks needs: rfc Needs an RFC before work can begin. labels Mar 14, 2020
@binarylogic binarylogic added this to the Improve observability milestone Mar 16, 2020
@Hoverbear Hoverbear self-assigned this Mar 23, 2020
@binarylogic binarylogic removed this from the Vector Observability milestone Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: observability Anything related to monitoring/observing Vector needs: rfc Needs an RFC before work can begin. type: task Generic non-code related tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants