chore: add event-driven observability rfc #2093

lukesteensen · 2020-03-18T23:53:13Z

Closes #2064

Rendered

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>

github-actions · 2020-03-18T23:53:34Z

Great PR! Please pay attention to the following items before merging:

Files matching rfcs/**:

Have at least 3 team members approved this RFC?

This is an automatically generated QA checklist based on modified files

MOZGIII · 2020-03-19T00:58:02Z

First of all, great RFC! I like the idea of having structured events, as it encourages the similar thought process we apply to API design, but to the log and metrics gathering. No more it's just a write-and-forget thing, you now have to put in some thought into how to structure log and metrics reporting, attempt reusing existing events and so on.

I have a few questions.

So, do we use a mix of trace!/counter! and the emit! with structured events, or only emit!? Is it prohibited to simply use trace!/counter! in favor of using emit!? If not - when (and how) do we promote regular trace!/count! to structured events?

What about composition? I'm thinking, what if there's a specific event, that has to advance some global counter processed_events and an event-specific counter, for instance, merge_tranform_mereged_events. Would we do two emit!s there - one with a global event and one with a specific, or will implement a specific event in such a way that it internally advances both counters? This is different from FileEventReceived in the provided example in a way that the scope of metrics is different - processed_events represents all the events across the system, while merge_tranform_mereged_events is only relevant for a particular transform.

Also, I noticed lifetimes are omitted in struct definitions, making them way prettier than it's going to be "in real life". I think we should be accurate with those to not give people the wrong impression.

lukesteensen · 2020-03-19T01:33:25Z

So, do we use a mix of trace!/counter! and the emit! with structured events, or only emit!? Is it prohibited to simply use trace!/counter! in favor of using emit!? If not - when (and how) do we promote regular trace!/count! to structured events?

Great question! I actually meant to add a section on this. My proposal is that we can still have some standalone logging statements around things like startup, config loading, topology building, etc. That is, things that happen once where the user would obviously want to see output in the terminal and that's it. Anything that you'd even remotely consider collecting metrics about (i.e. things that happen during normal operation) should be events. So logs are ok in certain areas, but metrics only come from emitted events.

What about composition? I'm thinking, what if there's a specific event, that has to advance some global counter processed_events and an event-specific counter, for instance, merge_tranform_mereged_events. Would we do two emit!s there - one with a global event and one with a specific, or will implement a specific event in such a way that it internally advances both counters? This is different from FileEventReceived in the provided example in a way that the scope of metrics is different - processed_events represents all the events across the system, while merge_tranform_mereged_events is only relevant for a particular transform.

I think this is part of the tradeoff between generic and specific events. If you need to emit two events in one place, that means something is wrong. The events should match one-to-one with actual domain events, so this would be a sign that the generic event is too generic.

I gave some thought to the idea of event composition but didn't come up with an obvious, easy-to-use scheme. It is a very interesting idea, but I wonder if there's a simpler way to go about it (e.g. shared functions in the internal_events module).

Also, I noticed lifetimes are omitted in struct definitions, making them way prettier than it's going to be "in real life". I think we should be accurate with those to not give people wrong impression.

Fair point! I'll push a fix for that.

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>

MOZGIII · 2020-03-19T02:13:34Z

What about composition? I'm thinking, what if there's a specific event, that has to advance some global counter processed_events and an event-specific counter, for instance, merge_tranform_mereged_events. Would we do two emit!s there - one with a global event and one with a specific, or will implement a specific event in such a way that it internally advances both counters? This is different from FileEventReceived in the provided example in a way that the scope of metrics is different - processed_events represents all the events across the system, while merge_tranform_mereged_events is only relevant for a particular transform.

I think this is part of the tradeoff between generic and specific events. If you need to emit two events in one place, that means something is wrong. The events should match one-to-one with actual domain events, so this would be a sign that the generic event is too generic.

In this terminology, the event that manages processed_events metric would be a generic event and the event that manages merge_tranform_mereged_events - a specific event, right? There seems to be a problem with thinking in terms of events when we need to discuss individual components that constitute an event... I'm not sure if that makes sense yet...

So, if we have a metric, like a processed_events counter. Does it have to be adjusted via only a certain event - that event being the owning event in a sense for that metric - or is it allowed to adjust that counter from multiple events? There are events, in a high-level sense, that can be in different domains from one perspective (like different sinks/transforms/etc each have their own domain) but are at the same domain from the other perspective (all the things that can process an event are, in that sense, in one domain) - like the processed_events counter.
If there has to be a one-to-one correspondence between the emitable event struct and an actual domain event kind - then specifying a domain is a problem. If we want to have a count of all the processed events - then this "event" would have a "dedicated" global domain, right? And then we'd have to use emit! on that, and another emit! for an event that would handle merge_tranform_mereged_events counter advancement. This would make sense, but it conflicts with what you said:

If you need to emit two events in one place, that means something is wrong.

The only solution I see to satisfy all your criteria is to prohibit some forms of metrics altogether, like global metrics, that can be adjusted from multiple event sources. Not sure if that's what we want to do - as those limitations are on their own questionable. To summarize, I'd like to see a more in-depth explanation of this topic.

P.S. In my past experience attempts to generalize metrics and log emitting were always far from ideal, and I always resorted back to handling them separately. I do often use the approach of structured entities to track the set of existing metrics, not so much for logs though. The difference between now and my past experience is that I wasn't doing that in rust. 😄 So I'm willing to try again. That said, at this point, I suspect there are fundamental differences between (log) events and metrics, so I also expect a suboptimal solution at best.

binarylogic · 2020-03-19T03:04:55Z

In this terminology, the event that manages processed_events metric would be a generic event and the event that manages merge_tranform_mereged_events - a specific event, right? There seems to be a problem with thinking in terms of events when we need to discuss individual components that constitute an event... I'm not sure if that makes sense yet...

I think you're conflating metrics and events. There is not a mutually exclusive relationship between them:

A single event can emit multiple metrics.
Multiple events can co-manage the same metrics.

To put it more clearly, we should add events to the code base without even thinking about metrics to start. Then, once the events are in place, we can start to derive our metric catalog. And if we can't derive the metrics we want, then that's a strong sign our events are lacking. Either we're missing events, or our events are low quality.

So, if we have a metric, like a processed_events counter. Does it have to be adjusted via only a certain event - that event being the owning event in a sense for that metric - or is it allowed to adjust that counter from multiple events?

Multiple events.

There are events, in a high-level sense,
...
This would make sense, but it conflicts with what you said:

I think all of this moot given my comments above.

Also, trying to reason through a bunch of hazy conceptual scenarios is going to make this discussion difficult 😄. If we want to get into the weeds with this, I'd recommend that we start to add events and try it out. We'll probably spend less effort and time that way.

The only solution I see to satisfy all your criteria is to prohibit some forms of metrics altogether, like global metrics, that can be adjusted from multiple event sources. Not sure if that's what we want to do - as those limitations are on their own questionable.

I don't really understand what you're saying here, but I don't see any reason multiple events couldn't co-manage the same global metric. My two points above are probably applicable here.

P.S. In my past experience attempts to generalize metrics and log emitting were always far from ideal, and I always resorted back to handling them separately.

I've had a starkly opposite experience. Logs and metrics are notoriously low quality and messy. In my experience, apps, where this data was a joy to use, were apps that derived this data from events.

And there is a fairly strong consensus across the observability industry on this. I plan to write up a guide sharing my thoughts on this, with references, but, spoiler alert, I don't have hard and fast rules. The observability strategy should adjust to the app and the team. As long as events are mostly (not always) the single source of truth then that's what matters. I'm less concerned about event naming, organization, perfect rules, and so on.

That said, at this point, I suspect there are fundamental differences between (log) events and metrics, so I also expect a suboptimal solution at best.

Exactly, and conflating them produces suboptimal solutions.

Finally, this is never going to be perfect 😄. There are no perfect rules that will remove all thinking from this process. As we progress with this strategy things will become more clear and we'll evolve -- just like all code. It'll get easier as we go and I'm happy to put together more education to align our discussions on this topic. And, if we find that this solution does not work for Vector, then we can try something else 😄 .

MOZGIII · 2020-03-19T04:19:48Z

So, considering all of the above, we probably won't actually have events like EventProcessed, but rather EventProcessedAtFileSink and EventProcessedAtMergeTransform?
I feel like to resolve my uncertainties, concrete examples would be way more helpful than loosely specifying properties of the design.

In this terminology, the event that manages processed_events metric would be a generic event and the event that manages merge_tranform_mereged_events - a specific event, right? There seems to be a problem with thinking in terms of events when we need to discuss individual components that constitute an event... I'm not sure if that makes sense yet...

I think you're conflating metrics and events. There is not a mutually exclusive relationship between them:

A single event can emit multiple metrics.

Multiple events can co-manage the same metrics.

To put it more clearly, we should add events to the code base without even thinking about metrics to start. Then, once the events are in place, we can start to derive our metric catalog. And if we can't derive the metrics we want, then that's a strong sign our events are lacking. Either we're missing events, or our events are low quality.

So, if we have a metric, like a processed_events counter. Does it have to be adjusted via only a certain event - that event being the owning event in a sense for that metric - or is it allowed to adjust that counter from multiple events?

Multiple events. Again, events and metrics are not mutually exclusive.

This is helpful, I think it answers my initial question regarding composition:

What about composition? I'm thinking, what if there's a specific event, that has to advance some global counter processed_events and an event-specific counter, for instance, merge_tranform_mereged_events. Would we do two emit!s there - one with a global event and one with a specific, or will implement a specific event in such a way that it internally advances both counters? This is different from FileEventReceived in the provided example in a way that the scope of metrics is different - processed_events represents all the events across the system, while merge_tranform_mereged_events is only relevant for a particular transform.

So, to summarize - we do a single emit!, and that emit! has two counter! calls internally. There may be other event struct kinds that issue counter! with the same labels - and that's how we maintain the global counters.

The only solution I see to satisfy all your criteria is to prohibit some forms of metrics altogether, like global metrics, that can be adjusted from multiple event sources. Not sure if that's what we want to do - as those limitations are on their own questionable.

I don't really understand what you're saying here, but I don't see any reason multiple events couldn't co-manage the same global metric. My two points above are probably applicable here.

The context here is that this is my response to the following:

I think this is part of the tradeoff between generic and specific events. If you need to emit two events in one place, that means something is wrong. The events should match one-to-one with actual domain events, so this would be a sign that the generic event is too generic.

I am trying to understand how to mix events and logs/metrics together. Actually the part you quoted is a part of the bigger paragraph, it ended up being split cause I used quoting there myself. I didn't intend it to be a standalone point, but rather a summary of the above.

Here's the full "part":

So, if we have a metric, like a processed_events counter. Does it have to be adjusted via only a certain event - that event being the owning event in a sense for that metric - or is it allowed to adjust that counter from multiple events? There are events, in a high-level sense, that can be in different domains from one perspective (like different sinks/transforms/etc each have their own domain) but are at the same domain from the other perspective (all the things that can process an event are, in that sense, in one domain) - like the processed_events counter.
If there has to be a one-to-one correspondence between the emitable event struct and an actual domain event kind - then specifying a domain is a problem. If we want to have a count of all the processed events - then this "event" would have a "dedicated" global domain, right? And then we'd have to use emit! on that, and another emit! for an event that would handle merge_tranform_mereged_events counter advancement. This would make sense, but it conflicts with what you said: "If you need to emit two events in one place, that means something is wrong". The only solution I see to satisfy all your criteria is to prohibit some forms of metrics altogether, like global metrics, that can be adjusted from multiple event sources. Not sure if that's what we want to do - as those limitations are on their own questionable. To summarize, I'd like to see a more in-depth explanation of this topic.

So, what's wrong here is the presupposition that global events exist. I am realizing now that there simply should be no such thing as global events. The RFC has this notion of EventProcessed and EventReceived in the "Specific vs Reusable Events" section. I think this is what put me off because the main RFC body specifies the concrete FileEventReceived event. Now, global or shared events are problematic because it becomes hard to determine and specify the domains. I think this is, therefore, not moot, but it a valuable point, that allows us to conclude that we shouldn't have global events. It is further reinforced by the fact that the core idea of the design is to have events defined explicitly - thus allowing us to simply obtain a concrete list of all possible events in the system.

So, speaking of composition! If we want to share the schematic structure across events, we can just use shared struct values as fields!

This may be useful for things that are like EventProcessed (see "Specific vs Reusable Events" section; btw, 371f879 didn't add lifetimes to those!) but more complex. Like when we create a file (for one reason or another), suppose we always want to specify it's name and permissions (just for the sake of example). Then we can have struct FileCreationDatum<'a> { name: &'a str, perms: u8 }. Then we can have events like struct FileSinkFileCreatedEvent<'a> { creted_file: FileCreationDatum<'a> } and struct DiskBufferFileCreatedEvent<'a> { creted_file: FileCreationDatm<'a> }.
Well, in fact, I think I saw this pattern a lot in my previous log/metrics setups, so I guess we'll have to do some form of it anyway.

I've had a starkly opposite experience. Logs and metrics are notoriously low quality and messy. In my experience, apps, where this data was a joy to use, were apps that derived this data from events. Logs and metrics were correlated, high cardinality dimensions existed, and answering the unknown-unknowns was possible. I have never seen this accomplished from raw logs and metrics.

Maybe my experience was different, but I have had pleasing results with metrics and logs. It is true that it's never achievable with low-quality metrics and logs. It is possible though to maintain metrics and log events in good shape, and then using them becomes a breeze. The key to achieving this zen in my past experience was to actually use all the data the app emits, and do it as soon as code is written. This also helps to provide a rationale for every data point that's generated by the app. But, to be honest, my experience wasn't perfect either.

My point was that interfaces to emit logs and metrics are often have lots of tweaks, and while some tweaks are only to adjusted once (like log shape), others can often be switched at runtime (like the amount of data preserved in the log event). What I'm saying is it's hard to come up with a flexible enough interface that would satisfy all the needs that arise from practice. But, as I said, rust is different, and maybe this time we'll figure out a way to the interface that I'd be happy with. The potential sure is there.

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>

lukesteensen · 2020-03-19T14:17:03Z

So, to summarize - we do a single emit!, and that emit! has two counter! calls internally. There may be other event struct kinds that issue counter! with the same labels - and that's how we maintain the global counters.

Correct.

So, what's wrong here is the presupposition that global events exist. I am realizing now that there simply should be no such thing as global events. The RFC has this notion of EventProcessed and EventReceived in the "Specific vs Reusable Events" section. I think this is what put me off because the main RFC body specifies the concrete FileEventReceived event.

Sorry for the confusion here, but these are given explicitly in the "Open Questions" section of the RFC and are intended to show the spectrum of specificity that is possible when defining events. The point is not that we should have all of these, but quite the opposite. We need to figure out where exactly on the spectrum we should be. It seems your opinion is that we should be on the very specific end (i.e. in favor of FileEventReceived instead either of the other two), which is valid! I lean in that direction myself, but do think it's worth thinking about some degree of reuse.

My point was that interfaces to emit logs and metrics are often have lots of tweaks, and while some tweaks are only to adjusted once (like log shape), others can often be switched at runtime (like the amount of data preserved in the log event). What I'm saying is it's hard to come up with a flexible enough interface that would satisfy all the needs that arise from practice.

The goal of this RFC is actually to remove some of the flexibility of normal logging and metrics APIs in order to keep our data more consistent. We still have the full power of that flexibility internally, we're just putting a statically-designed facade in front of it.

ghost · 2020-03-19T17:39:28Z

This looks exciting! I like the idea of statically typed internal events.

The part about internal_events module is a slightly unclear to me:

Go to the internal_events module and define a new struct with fields for whatever relevant data you have.

Would it be global internal_events or a separate internal_events module for each component? As an alternative, could these structures reside in the same modules as components which use them, similarly to how snafu enums describing errors are defined locally inside component modules?

The example above uses an emit! macro, but doesn't currently do anything that requires it to be a macro. This may provide some flexibility for the future, or could be considered an overcomplication.

Does it make sense to have an optional compile-time feature which would disable internal logging to improve performance? If so, then the macro approach can make it possible to implement such a feature, so that even the event structures would not be created in the first place if internal observability is disabled at compile time.

In addition, although I'm not sure how much implementation complexity would it introduce, but it seems like in simple cases implementation of emit_logs and emit_metrics be can be facilitated by custom derive macros, for example like this:

#[derive(Debug, InternalEvent)]
#[internal_event::metric(type = "counter", name = "events_processed", value = 1, source = "file")]
struct FileEventReceived<'a> {
    #[internal_event::log(type = "trace", rate_limit_secs = 10, message = "Received one event.")]
    pub file: &'a str,
    #[internal_event::metric(type = "counter", name = "bytes_count", source = "file")]
    pub byte_size: usize,
}

or this

#[derive(Debug, InternalEvent)]
#[internal_event(source = "file")]
#[internal_event::metric(type = "counter", name = "events_processed", value = 1)]
struct FileEventReceived<'a> {
    #[internal_event::log(type = "trace", rate_limit_secs = 10, message = "Received one event.")]
    pub file: &'a str,
    #[internal_event::metric(type = "counter", name = "bytes_count")]
    pub byte_size: usize,
}

lukesteensen · 2020-03-19T18:50:31Z

Would it be global internal_events or a separate internal_events module for each component? As an alternative, could these structures reside in the same modules as components which use them, similarly to how snafu enums describing errors are defined locally inside component modules?

My intention was a single top-level internal_events module. This would concentrate all of the event definitions into one source directory for easy scanning.

Does it make sense to have an optional compile-time feature which would disable internal logging? If so, then the macro approach can make it possible to implement such a feature, so that even the event structures would not be created in the first place if the logging is disabled at compile time.

I'm not sure there's a need for this right now, but it is a good example of the kind of flexibility the macro gives us.

In addition, although I'm not sure how much implementation complexity would it introduce, but it seems like in simple cases implementation of emit_logs and emit_metrics be can be facilitated by custom derive macros, for example like this:

This is something we could do! Again, it might be a bit more work than it's worth right now, but by having all the events in one place we'd make it easy to transition to something like this in the future, totally isolated from normal application code. I definitely expect the internals of these events to evolve as the Rust metrics and logging ecosystem matures.

LucioFranco

Overall, seems like a nice improvement! I think what @a-rodin proposed with the derive macro seems ideal and shouldn't be that difficult with the new proc macros but it could really ease the difficulty for newer users. The bigger issue with the macros is that it will increase compile times but by how much I am not sure.

Hoverbear

Looks good, wonder if we can use generics to help prevent re-implementation.