Slow memory leak #783

ollie-nye · 2024-03-19T12:37:01Z

Describe the bug

We're seeing a slow memory leak over the course of a few days that's resulting in the operator pod being OOMkilled. Logs look fairly normal except for a few recurring errors, all following these patterns:

failed to delete finalizer: Operation cannot be fulfilled on exchanges.rabbitmq.com ""<exchange_name>"": the object has been modified; please apply your changes to the latest version and try again
failed to delete finalizer: Operation cannot be fulfilled on bindings.rabbitmq.com "<binding>": StorageError: invalid object, Code: 4, Key: /registry/rabbitmq.com/bindings/<binding>, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: <some uid>, UID in object meta:

It's not much of a leak, just enough to start throwing alerts and then kill the pod every few days. (Usage screenshot below)

I couldn't see any existing issues for something related so opening up a new one

To Reproduce

Our manifests are managed via Argo, but that shouldn't be getting involved in resources the operator looks after. Outside of that, we're running a standard manifest: https://github.com/rabbitmq/messaging-topology-operator/releases/download/v1.13.0/messaging-topology-operator-with-certmanager.yaml

We have a persisted CA being patched in with Kustomize but that doesn't look to be affecting finalizers.

Expected behavior
Operator memory usage to remain stable over longer periods of time

Screenshots

If applicable, add screenshots to help explain your problem.

Version and environment information

Messaging Topology Operator: 1.13.0
RabbitMQ: 3.13.0
RabbitMQ Cluster Operator: 2.0.0
Kubernetes: v1.27.8
Cloud provider or hardware configuration: GKE

Additional context

This has been over the last 30 days at least, I can't say when it started exactly because we don't keep metrics longer than that. We jumped from 1.10.3 to 1.13.0 and the alerts started shortly afterwards

The text was updated successfully, but these errors were encountered:

Zerpet · 2024-04-08T09:25:21Z

@mkuratczyk FYI this may be similar to this rabbitmq/cluster-operator#1549

@ollie-nye how many Secret type objects do you have in your cluster? Does this number vary over time?

ollie-nye · 2024-04-08T09:34:28Z

That sounds like it might be similar! The total number of resources doesn't change very much but new deployments of services get added and old ones dropped fairly regularly. Could it be that the old references to those resources are hanging around still?
Each of our deployments is about ~150 k8s objects and that happens maybe 10-20 times a day, over 10 days or so that sounds like it could start causing problems with cache if they don't get cleared up

mkuratczyk · 2024-04-08T09:39:57Z

I've tested this operator for an issue similar to rabbitmq/cluster-operator#1549 and could not trigger anything like it. I believe because the root cause of the issue was that Cluster Operator watches "generic" resources, such as ConfigMaps, Secrets, StatefulSets. Meanwhile, the Topology Operator only watches rabbitmq-specific resources (correct me if I'm wrong).

ollie-nye · 2024-04-08T09:52:34Z

@mkuratczyk That 150 count is just the topology operator side, split over (approximately) 60 bindings, 30 exchanges, 30 policies and 30 queues. We set up per-PR environments with their own queues and bindings for testing our main app and such so there's quite a lot of shuffling about happening on the regular!

mkuratczyk · 2024-04-08T09:53:21Z

@ollie-nye if you could set ENABLE_DEBUG_PPROF=true in the Operator's pod environment, set up port forwarding and then connect with go tool pprof -web http://localhost:8080/debug/pprof/heap (I don't remember if it's 8080 or 9782) to see what's taking a lot of memory, that would be very helpful

Zerpet · 2024-04-08T11:44:07Z

I've tested this operator for an issue similar to rabbitmq/cluster-operator#1549 and could not trigger anything like it. I believe because the root cause of the issue was that Cluster Operator watches "generic" resources, such as ConfigMaps, Secrets, StatefulSets. Meanwhile, the Topology Operator only watches rabbitmq-specific resources (correct me if I'm wrong).

We also watch Secret objects. Those are required for User credentials and for connectionSecret feature.

ollie-nye · 2024-04-08T12:58:28Z

@mkuratczyk Here's that output after a few hours of running after it restarted, if there's nothing standing out yet I can grab a new capture in a couple of days after it's grown a bit more?

mkuratczyk · 2024-04-08T13:04:06Z

Certainly the CertPool size is suspiciously large (do you have a lot of certificates?) but the key questions is where the usage is growing, so yes - a second capture after some time would be great. thanks

ollie-nye · 2024-04-08T13:12:21Z

Not a huge amount, less than 50 across the whole cluster, but sure thing, I'll get another capture after it's sat for a bit. Thanks so much for your help so far!

ollie-nye · 2024-04-10T08:31:54Z

New capture from this morning, same pod has gone from 64MiB memory to 180MiB over the last couple of days. That certpool usage looks to only account for a small chunk of that though?

mkuratczyk · 2024-04-10T11:20:05Z

Strange. In the upper left corner, it says "81.86MB total", previously it was 56.21MB. So growth - yes, with certs responsible for a third of that growth. But where is the other 100MB :)

ollie-nye · 2024-04-10T13:34:07Z

Ah nice spot! I've got no clue where the extra usage is coming from though, might grab another capture in a few days and go from there?

ollie-nye · 2024-04-11T15:40:12Z

One more capture from today, had a sharp jump up yesterday morning in total use but the profiler seems to be diverging more from pod vs app memory use, with certs are still taking up a decent chunk of that. Our total certificate count hasn't changed during this either, but a lot have been created and destroyed with environment changes, could it just be that the process is hanging on to certificates that are no longer in the cluster?

mkuratczyk · 2024-04-23T11:29:37Z

Indeed I can see that creating a lot of Secrets, even completely unrelated to RabbitMQ, increases the memory usage. I will try to fix that, although we do need to watch some of the secrets so the exact solution is not yet clear to me.

Watching the secrets can lead to high memory usage if there's a lot of them. It seems like we don't actually rely on this, so we can turn that off completely. Fixes #783

ollie-nye · 2024-05-20T08:23:23Z

We've seen the pod usage flatten out over the last week or so, thanks so much for getting this resolved!

ollie-nye added the bug Something isn't working label Mar 19, 2024

Zerpet self-assigned this Apr 8, 2024

mkuratczyk self-assigned this Apr 23, 2024

mkuratczyk added a commit that referenced this issue May 7, 2024

Don't watch secrets

a723776

Watching the secrets can lead to high memory usage if there's a lot of them. It seems like we don't actually rely on this, so we can turn that off completely. Fixes #783

mkuratczyk added a commit that referenced this issue May 7, 2024

Don't watch secrets

3df28c3

Watching the secrets can lead to high memory usage if there's a lot of them. It seems like we don't actually rely on this, so we can turn that off completely. Fixes #783

mkuratczyk mentioned this issue May 7, 2024

Don't watch secrets #808

Merged

Zerpet added this to the v1.14 milestone May 7, 2024

mkuratczyk closed this as completed in #808 May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow memory leak #783

Slow memory leak #783

ollie-nye commented Mar 19, 2024

Zerpet commented Apr 8, 2024

ollie-nye commented Apr 8, 2024

mkuratczyk commented Apr 8, 2024

ollie-nye commented Apr 8, 2024

mkuratczyk commented Apr 8, 2024

Zerpet commented Apr 8, 2024

ollie-nye commented Apr 8, 2024

mkuratczyk commented Apr 8, 2024

ollie-nye commented Apr 8, 2024

ollie-nye commented Apr 10, 2024

mkuratczyk commented Apr 10, 2024

ollie-nye commented Apr 10, 2024

ollie-nye commented Apr 11, 2024

mkuratczyk commented Apr 23, 2024

ollie-nye commented May 20, 2024

Slow memory leak #783

Slow memory leak #783

Comments

ollie-nye commented Mar 19, 2024

Describe the bug

To Reproduce

Screenshots

Version and environment information

Additional context

Zerpet commented Apr 8, 2024

ollie-nye commented Apr 8, 2024

mkuratczyk commented Apr 8, 2024

ollie-nye commented Apr 8, 2024

mkuratczyk commented Apr 8, 2024

Zerpet commented Apr 8, 2024

ollie-nye commented Apr 8, 2024

mkuratczyk commented Apr 8, 2024

ollie-nye commented Apr 8, 2024

ollie-nye commented Apr 10, 2024

mkuratczyk commented Apr 10, 2024

ollie-nye commented Apr 10, 2024

ollie-nye commented Apr 11, 2024

mkuratczyk commented Apr 23, 2024

ollie-nye commented May 20, 2024