Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to expire internal metrics. #11995

Closed
2 of 3 tasks
tobz opened this issue Mar 25, 2022 · 12 comments
Closed
2 of 3 tasks

Add the ability to expire internal metrics. #11995

tobz opened this issue Mar 25, 2022 · 12 comments
Assignees
Labels
domain: metrics Anything related to Vector's metrics events domain: observability Anything related to monitoring/observing Vector Epic Larger, user-centric issue that contains multiple sub-issues source: internal_metrics Anything `internal_metrics` source related

Comments

@tobz
Copy link
Contributor

tobz commented Mar 25, 2022

This epic represents the work required to complete the ability to control the cardinality of our internal metrics through expiring old ones.

Context

Currently, internal metrics are captured via the metrics crate, to which we utilize a customized Recorder implementation that forwards them to a base-bones Registry that stores them, and then on an interval, we collect the registered metrics and forward them into the topology.

While this works well from a performance/simplicity standpoint, we currently do not expire metrics that are no longer used. This can lead to issues, such as the one described in #11821, where high cardinality label values can generate many unique metrics in the registry over time, which comes together as a slow memory leak that requires restarting Vector itself.

Additionally, this can also result in ever-growing outputs from metrics sinks, where now-stale-and-no-longer-used metrics continue to be emitted needlessly. We've addressed this in the past for specific sinks, such as the prometheus_exporter sink in #9769, but there should be a more general solution.

Proposal

We should upgrade our usage of metrics in order to provide the ability to expire internal metrics such that old/stale metrics eventually get removed from the registry entirely, and are no longer reported.

To do so involves a few steps that should be achievable in an incremental fashion:

  • upgrade to metrics-0.18.x/metrics-util-0.12.x to bring in support for metric handles and a more customizable Registry
  • bolt on the Recency module (part of metrics-util) so that we can begin tracking the generation of each metric to properly understand when a metric has or hasn't been updated since a particular observation of it
  • create our own storage implementation that melds the required generational tracking of Recency with the ability to consume the value of counters (gauges and histograms need no change here) such that we can switch to emitting incremental counter metrics rather than absolute ones

Upgrading metrics/metrics-util

This is required due to the introduction of "metric handles" and the more flexible Registry/Storage primitives, which will allow us to interact with the metric values such that we can mutate them rather than simply observe them.

Bolting on Recency support

This one is sort of self-explanatory: Recency provides the metric storage wrapper and logic to detect metrics that have truly not been updated within a given time period. This is required because gauges could technically be the same value when observed at two different times, even if it was in fact changed in between those two observations. Recency deals with this by tracking a "generation" for a metric, such that any operation on the metric at all increments an internal counter -- the generation -- so that you can tell, for both of those two hypothetical observations mentioned above: "oh, the generation is identical, so it hasn't been updated" or "oh, the generations are different, so it's definitely still being updated".

Creating customized storage for the metrics

This one feeds into being able to emit incremental counters.

Currently, we emit absolute metrics for anything we scrape from the internal metrics registry. This is fine for gauges, and even for histograms, we end up merging that stuff together. This is bad for counters, though.

It's bad because for many metrics sinks, they only emit incrementalized metrics downstream, so in order to incrementalize an absolute metric, you must observe it at least twice in order to begin tracking the delta between observations. This means that if we expired a counter, and then it came back to life -- maybe it's updated very infrequently -- we would potentially lose some of the increments to that counter between the first time we see it when it comes back to life and the second time we observe it.

By writing our own customized storage, we could provide a hook to consume a counter's value -- essentially, swap the value to 0 and take the previous value -- such that we could emit the consumed value as an incremental update. This would ensure that even if a metric is expired but comes back, we're always consuming its entire value so we never miss an update.

This would work, specifically, because we would only do expiration when iterating over the metrics and observing them, not automatically in the background, so we would always have a chance to observe the metric (and potentially consume its value if it's a counter) before removing it from the registry.

Caveats

One main thing we'll want to pay attention to is the change, if any, in performance. The changes in metrics-0.18.x were designed to be primarily beneficial to static metric keys, and those who could utilize metric handles directly, both of which we currently do not and cannot do.

This would likely mean some potential performance regressions in areas where we emit many metrics, or emit metrics tied to the rate of events flowing through the topology or specific component.

We've handled these types of issues before by doing local aggregation or some other change to reduce the impact, but this would be at a high level across all metrics that we currently emit, so we'd want to pay attention to changes i.e. as surfaced by soak tests, etc.

Tasks

Future Work

@tobz tobz added domain: observability Anything related to monitoring/observing Vector domain: metrics Anything related to Vector's metrics events source: internal_metrics Anything `internal_metrics` source related labels Mar 25, 2022
@nabokihms
Copy link
Contributor

@tobz, I'd like to help execute this plan step by step because our Kubernetes cluster installation suffers from the problem of metrcs number growth.

I have experience doing the same thing by customizing metrics storage for the Prometheus client_go, and I believe we'll cope with this issue under your guidance. What do you think? Is it ok for me to put the plan into action?

@tobz
Copy link
Contributor Author

tobz commented May 23, 2022

Yeah, we're happy to have someone interested in working on this, and I'll be happy to mentor/guide you where I can.

Happy to chat on Vector's Discord (text or voice, alias is tobz) to explain things more in depth, etc.

@leandrojmp
Copy link

Hello,

Anyone know if there is any workaround to deal with this issue? I have a memory leak reported in this issue, which was closed and referenced here, and it is blocking the move of our data pipeline in Logstash to Vector.

Currently I'm dealing with it by restarting the vector process every week, but I'm not sure that if I add more pipeline into vector it will leak the memory faster and requires more frequently restarts, so this is blocking the further adoption of vector.

I'm not using the internal metrics for anything, is there any possibility to disable it and avoid this memory leak?

@fpytloun
Copy link
Contributor

Yes, this issue is really painful 😞

@jszwedko has branch with removed problematic metrics which I am using now: https://github.com/vectordotdev/vector/tree/custom-build-without-high-cardinality-metrics

@nabokihms
Copy link
Contributor

I'm about to open a PR, but apparently it will only be a part of the future v0.24 release.

@bruceg bruceg self-assigned this Jul 19, 2022
@bruceg bruceg added the Epic Larger, user-centric issue that contains multiple sub-issues label Jul 19, 2022
@bruceg
Copy link
Member

bruceg commented Jul 21, 2022

@nabokihms This issue is on our roadmap to pursue in the near future. As you have indicated you have some changes in hand to help with this effort. Before we duplicate your efforts, could you let us know what approach you are taking on this problem?

@jszwedko
Copy link
Member

Hi @nabokihms ! Apologies for pestering. We just wanted to check in to see if you still had plans to work on this or had already completed some work towards this so we can decide whether to move forward with this as we want to include it in the upcoming 0.24.0 release if we can.

@nabokihms
Copy link
Contributor

@jszwedko @bruceg I did not manage to go far with it and stuck with the way to detect staleness (and I also lost time to update dependencies, what Bruce've already committed).

Sorry for being a snail 😅 It looks like I cannot keep the pace here. Hope I will be able to help the project the other way.

@jszwedko
Copy link
Member

Hey @nabokihms ! No worries 😄 We are happy to pick it up, we just didn't want to step on your toes. We appreciate all of your contributions to Vector thus far.

@leandrojmp
Copy link

Hello @jszwedko,

I saw that a new option was implemented, expire_metrics_secs, would it help solve the memory issue leak from this issue? Currently I need to reboot vector twice a week to avoid it being killed by an OOM error.

Also the expire_metrics_secs has this note:

Be careful to set this value high enough to avoid expiring critical but infrequently updated internal counters.

What would be the impact? What are those internal metrics used for? I do not have any monitoring on the vector process, just use it to ship data to Kafka.

@jszwedko
Copy link
Member

jszwedko commented Oct 4, 2022

Be careful to set this value high enough to avoid expiring critical but infrequently updated internal counters.

What would be the impact? What are those internal metrics used for? I do not have any monitoring on the vector process, just use it to ship data to Kafka.

Hi @leandrojmp ! The internal metrics are exported by internal_metrics and used to drive vector tap (via Vector's API). If you don't use either of those you can safely set that number pretty low. It will resolve the issue mentioned by #11025

As you note, this issue is resolved so I'll close it. Thanks for bringing my attention to it.

@leandrojmp
Copy link

leandrojmp commented Oct 8, 2022

Hello @jszwedko,

Unfortunatelly this change in 0.24.1 did not solve the memory leak issue from #11025.

In reality I think it looks worse, I have to restart the service daily now to avoid a OutOfMemory killing the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: metrics Anything related to Vector's metrics events domain: observability Anything related to monitoring/observing Vector Epic Larger, user-centric issue that contains multiple sub-issues source: internal_metrics Anything `internal_metrics` source related
Projects
None yet
Development

No branches or pull requests

6 participants