-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specification for metrics collection #2007
Comments
Metrics CollectionPhilosophyFor inspiration, we'll look at the RED and USE methodologies. Rate and errors are virtually always relevant, and depending on the component, utilization and duration can be as well. For rate, we should record an Errors are pretty self-explanatory. We don't need metrics for every possible kind, but we should look over Utilization is important for components that take a significant amount of some non-CPU system resources. For example, we measure the memory utilization of the Lua transform. Memory utilization of in-flight batches is another good example. Duration applies to components like HTTP-based sinks, where there's some request-response cycle we want to time. It can also be used around things like the Lua transform where runtime can depend heavily on the configuration. ImplementationNamingAs much as possible, names for the same type of metrics should be consistent across components. They should be namespaced by the category of component (i.e. source, transform, sink, internal) and use common suffixes for the same data (e.g. The example instrumentation so far in #1953 uses a rough Shared componentsThe naming scheme above runs into some complications with shared subcomponents or those that are simple wrappers around another. Since we don't know the whole runtime context at the callsite, we can't include things like The current examples simply omit that portion of the key and rely on the name. A perhaps better alternative is to make DurationsIn certain areas of the code, measuring durations is currently very complex due to the pre-async/await style. There are two ongoing pieces of work that should simplify them greatly: refactoring to use async/await and building the ChecklistThis is a rough skeleton of an instrumentation plan. Once we settle on a naming scheme we can go through and expand each item into the actual names of the metrics we want to gather. We can also drop the optional For now, I've checked off the ones added as examples in #1953. Sources
|
@lukesteensen Can you help me understand the status of this ticket? It hasn't seen activity in 3 weeks. |
@Hoverbear This is step 3 in the plan of attack as laid out in the RFC. Once #1953 is merged, we will update this issue with a list of events to be added. |
The initial implementation has been merged! 🎉 So far, the "spec" had been pretty minimal:
These all focus on the metrics that are generated via events, since those are user-facing and need to be consistent. We will expand this moving forward, including more patterns for the events themselves. |
Currently I'm using 0.8.2 (hopefully will have bandwidth for upgrade to 0.9.0 in a week or so) and while it's enough for a validation phase, it's good to see there's going to be a more detailed monitoring. IIRC proposed metrics cover almost all things I would need for a similar investigation with vector. Not sure about topologies other than "distributed", but since all components share same metrics, I guess it's a question of labels. |
Thanks @Alexx-G, that's helpful. And what you described will be possible with our approach. At the minimum, Vector will expose 2 counts:
These names and labels will change. That's what we're working through. If you have any other requirements please let us know. I'll also ping you on the PR that introduces all of this. |
I've checked exactly what metrics helped a lot with fluentd investigation:
There's a metric that might not be required but helps a lot to do some fine tuning and avoid hitting hard limits on events destination:
Also, the ability to define custom metrics (e.g. an event counter for a specific source/transform/sink) and add it to built-in metrics is highly valuable. IIRC all of these, except flush retries and queue length are covered by this spec. I didn't get a chance to check it yet, but "log_to_metric" transform seems to cover the "custom metric" use case. |
That's very helpful. We'll take all of that into account when defining all of the metrics. |
@binarylogic One quick question, is there a metric (existing or planned) to track exceeded rate limit? |
Definitely. We're also addressing the higher level problem of picking rate limits with #2329. |
Oh, I somehow missed rate being mentioned so many times in that comment with metrics attack plan. Forgive my ignorance :) |
Hey @binarylogic @lukesteensen , |
@Alexx-G that would be wonderful, thank you! |
ref: #2007 Signed-off-by: Jean Mertz <git@jeanmertz.com>
ref: #2007 Signed-off-by: Jean Mertz <git@jeanmertz.com>
Closing this since it will be superseded by the work in #3192. We've since switched to an event driven system, and we need specific issues for implementing the remaining events. We are defining the remaining work now. |
ref: #2007 Signed-off-by: Jean Mertz <git@jeanmertz.com>
As a follow up to #1761 we'll need a spec on which metrics we want to collect. I would like to inventory all of these metrics so we can obtain consensus and document them properly. This can be as simple as a list with metric names and labels.
The text was updated successfully, but these errors were encountered: