RFC for event driven observability #2064
Labels
domain: observability
Anything related to monitoring/observing Vector
needs: rfc
Needs an RFC before work can begin.
type: task
Generic non-code related tasks
We need a simple RFC to explore an event-driven observability strategy. I'll try to lay the foundation for that RFC here.
Motivation
#1953 spikes out an
internal_metrics
source. This serves as the basis for our metrics collection, but there are still decisions to be made around metric collection philosophy. #2007 starts to address this and is a good start, but I have a few concerns:Backward compatibility
The metrics we expose are part of our public API and are very sensitive to backward compatibility. This is because users will setup dashboards and alarms on these metrics. Generally, I am a fan of moving fast, learning more, and iterating, but this is one area I want to think through before shipping the first version since it is very difficult to make changes later.
Following our own advice
We're an observability company 😄, and we have opinions on observability practices and tactics. I am a strong believer in event-driven observability. Generally, this is discussed in the context of in-house applications, not black-box services. These events are the actual pieces of data used to observe apps. Engineers have mental maps of these events and how they relate to their code, making it easy to ask questions against the raw event data. This is obviously not the case for Vector users, and therefore metrics become the critical means to monitoring Vector (as is the case for all black-box services).
But there is another large benefit of event-driven observability that still applies to black-box services, and that is deriving other observability data from these events, such as logs, metrics, and traces. The benefits of this are discussed in more detail in the following sections.
Consistency
As the Vector project matures we'll inevitably have more contributors. It'll be very difficult to align contributors behind a metrics collection strategy. Not only will it be difficult to ensure consistent metric naming and labeling, but it'll be even more difficult to ensure a consistent collection methodology, such as those outlined in the RED and USE. And let's not forget structured logging. We're already having trouble with this in Vector with just logging.
Events solve this. Events are a single, simple paradigm to think through. "This thing happened, emit an event". Emitting events is much easier, and we can control how we derive other data, such as metrics. Metric naming and labeling achieve consistency through this pattern.
Maintaining quality
As discussed in the previous section, consistency largely contributes to quality, but events will also force us to improve the quality of our logs and metrics. Here's why:
Long-term maintainability
Peppering events throughout our codebase decouples instrumentation from any specific observability strategy. For example, here are a few scenarios that will be much easier to achieve if we are emitting events:
User/Operator experience
This is covered in the previous sections, but the UX of Vector will be vastly improved by all of the above.
Guide-level Proposal
I'll leave this to @lukesteensen, but I'm a fan of starting simple. Instead of building out a subscriber, like what @LucioFranco did in
tracing-limit
, we can build a layer above all of this by adding a simpleemit_event(my_event)
function that we call throughout the codebase (or whatever we want to name it). This would simply call the logging and metrics functions inline.There are a few things I would like to propose:
events
folder with a bunch of structs. I believe this will pay dividends and ensure we maintain a quality catalog of events. See all of the points above about why.<noun>_<past-tense-verb>
).duration_ms
.duration
we should be usingduration_ms
.These are bike-shedding a little bit, but we might as well agree on something. Regardless, I'd be very happy if we just have a catalog of events.
The text was updated successfully, but these errors were encountered: