-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: add event-driven observability rfc #2093
Conversation
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Great PR! Please pay attention to the following items before merging: Files matching
This is an automatically generated QA checklist based on modified files |
First of all, great RFC! I like the idea of having structured events, as it encourages the similar thought process we apply to API design, but to the log and metrics gathering. No more it's just a write-and-forget thing, you now have to put in some thought into how to structure log and metrics reporting, attempt reusing existing events and so on. I have a few questions. So, do we use a mix of What about composition? I'm thinking, what if there's a specific event, that has to advance some global counter Also, I noticed lifetimes are omitted in struct definitions, making them way prettier than it's going to be "in real life". I think we should be accurate with those to not give people the wrong impression. |
Great question! I actually meant to add a section on this. My proposal is that we can still have some standalone logging statements around things like startup, config loading, topology building, etc. That is, things that happen once where the user would obviously want to see output in the terminal and that's it. Anything that you'd even remotely consider collecting metrics about (i.e. things that happen during normal operation) should be events. So logs are ok in certain areas, but metrics only come from emitted events.
I think this is part of the tradeoff between generic and specific events. If you need to emit two events in one place, that means something is wrong. The events should match one-to-one with actual domain events, so this would be a sign that the generic event is too generic. I gave some thought to the idea of event composition but didn't come up with an obvious, easy-to-use scheme. It is a very interesting idea, but I wonder if there's a simpler way to go about it (e.g. shared functions in the
Fair point! I'll push a fix for that. |
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
In this terminology, the event that manages So, if we have a metric, like a
The only solution I see to satisfy all your criteria is to prohibit some forms of metrics altogether, like global metrics, that can be adjusted from multiple event sources. Not sure if that's what we want to do - as those limitations are on their own questionable. To summarize, I'd like to see a more in-depth explanation of this topic. P.S. In my past experience attempts to generalize metrics and log emitting were always far from ideal, and I always resorted back to handling them separately. I do often use the approach of structured entities to track the set of existing metrics, not so much for logs though. The difference between now and my past experience is that I wasn't doing that in |
I think you're conflating metrics and events. There is not a mutually exclusive relationship between them:
To put it more clearly, we should add events to the code base without even thinking about metrics to start. Then, once the events are in place, we can start to derive our metric catalog. And if we can't derive the metrics we want, then that's a strong sign our events are lacking. Either we're missing events, or our events are low quality.
Multiple events.
I think all of this moot given my comments above. Also, trying to reason through a bunch of hazy conceptual scenarios is going to make this discussion difficult 😄. If we want to get into the weeds with this, I'd recommend that we start to add events and try it out. We'll probably spend less effort and time that way.
I don't really understand what you're saying here, but I don't see any reason multiple events couldn't co-manage the same global metric. My two points above are probably applicable here.
I've had a starkly opposite experience. Logs and metrics are notoriously low quality and messy. In my experience, apps, where this data was a joy to use, were apps that derived this data from events. And there is a fairly strong consensus across the observability industry on this. I plan to write up a guide sharing my thoughts on this, with references, but, spoiler alert, I don't have hard and fast rules. The observability strategy should adjust to the app and the team. As long as events are mostly (not always) the single source of truth then that's what matters. I'm less concerned about event naming, organization, perfect rules, and so on.
Exactly, and conflating them produces suboptimal solutions. Finally, this is never going to be perfect 😄. There are no perfect rules that will remove all thinking from this process. As we progress with this strategy things will become more clear and we'll evolve -- just like all code. It'll get easier as we go and I'm happy to put together more education to align our discussions on this topic. And, if we find that this solution does not work for Vector, then we can try something else 😄 . |
So, considering all of the above, we probably won't actually have events like
This is helpful, I think it answers my initial question regarding composition:
So, to summarize - we do a single
The context here is that this is my response to the following:
I am trying to understand how to mix events and logs/metrics together. Actually the part you quoted is a part of the bigger paragraph, it ended up being split cause I used quoting there myself. I didn't intend it to be a standalone point, but rather a summary of the above. Here's the full "part":
So, what's wrong here is the presupposition that global events exist. I am realizing now that there simply should be no such thing as global events. The RFC has this notion of So, speaking of composition! If we want to share the schematic structure across events, we can just use shared struct values as fields! This may be useful for things that are like
Maybe my experience was different, but I have had pleasing results with metrics and logs. It is true that it's never achievable with low-quality metrics and logs. It is possible though to maintain metrics and log events in good shape, and then using them becomes a breeze. The key to achieving this zen in my past experience was to actually use all the data the app emits, and do it as soon as code is written. This also helps to provide a rationale for every data point that's generated by the app. But, to be honest, my experience wasn't perfect either. My point was that interfaces to emit logs and metrics are often have lots of tweaks, and while some tweaks are only to adjusted once (like log shape), others can often be switched at runtime (like the amount of data preserved in the log event). What I'm saying is it's hard to come up with a flexible enough interface that would satisfy all the needs that arise from practice. But, as I said, rust is different, and maybe this time we'll figure out a way to the interface that I'd be happy with. The potential sure is there. |
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Correct.
Sorry for the confusion here, but these are given explicitly in the "Open Questions" section of the RFC and are intended to show the spectrum of specificity that is possible when defining events. The point is not that we should have all of these, but quite the opposite. We need to figure out where exactly on the spectrum we should be. It seems your opinion is that we should be on the very specific end (i.e. in favor of
The goal of this RFC is actually to remove some of the flexibility of normal logging and metrics APIs in order to keep our data more consistent. We still have the full power of that flexibility internally, we're just putting a statically-designed facade in front of it. |
This looks exciting! I like the idea of statically typed internal events. The part about
Would it be global
Does it make sense to have an optional compile-time feature which would disable internal logging to improve performance? If so, then the macro approach can make it possible to implement such a feature, so that even the event structures would not be created in the first place if internal observability is disabled at compile time. In addition, although I'm not sure how much implementation complexity would it introduce, but it seems like in simple cases implementation of #[derive(Debug, InternalEvent)]
#[internal_event::metric(type = "counter", name = "events_processed", value = 1, source = "file")]
struct FileEventReceived<'a> {
#[internal_event::log(type = "trace", rate_limit_secs = 10, message = "Received one event.")]
pub file: &'a str,
#[internal_event::metric(type = "counter", name = "bytes_count", source = "file")]
pub byte_size: usize,
} or this #[derive(Debug, InternalEvent)]
#[internal_event(source = "file")]
#[internal_event::metric(type = "counter", name = "events_processed", value = 1)]
struct FileEventReceived<'a> {
#[internal_event::log(type = "trace", rate_limit_secs = 10, message = "Received one event.")]
pub file: &'a str,
#[internal_event::metric(type = "counter", name = "bytes_count")]
pub byte_size: usize,
} |
My intention was a single top-level
I'm not sure there's a need for this right now, but it is a good example of the kind of flexibility the macro gives us.
This is something we could do! Again, it might be a bit more work than it's worth right now, but by having all the events in one place we'd make it easy to transition to something like this in the future, totally isolated from normal application code. I definitely expect the internals of these events to evolve as the Rust metrics and logging ecosystem matures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, seems like a nice improvement! I think what @a-rodin proposed with the derive macro seems ideal and shouldn't be that difficult with the new proc macros but it could really ease the difficulty for newer users. The bigger issue with the macros is that it will increase compile times but by how much I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, wonder if we can use generics to help prevent re-implementation.
} | ||
``` | ||
|
||
On one hand, more specific events like `FileEventReceived` can be tightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use generics? EventRecieved<S: Sink>
or something? It could save us some reimplementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I'm not sure exactly what it'd look like. My gut feeling is to start simple for now and explore stuff like this as we get more experience.
uniformly on spans at the topology-building layer (e.g. `component_kind`, | ||
`component_type`, `component_name`). | ||
|
||
2. Rely on the existing `tracing` implementation to output that context in logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed +1
|
||
Metrics context work: | ||
|
||
* [ ] Coordinate with `metrics` maintainers to determine best path forward for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobz :)
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
information. Extending them to allow for more dynamic data would defeat much of | ||
the purpose of using structs in the first place. | ||
|
||
Finding the right middle ground here is something that will likely come with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, this will be something that we'll refine over time. And it'll never be perfect, but it should start to feel more automatic as head down this road. Speaking from experience, it took a couple of passes to settle on the right middle ground, afterward, it was mostly automatic. I plan to help help with this process in a separate PR where we can all discuss the tradeoffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent RFC and a good start. I think we'll find a good middle ground for event naming when we start to implement that. But I'm a 👍 on this RFC as it stands.
Closes #2064
Rendered