Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow memory leak #783

Closed
ollie-nye opened this issue Mar 19, 2024 · 15 comments · Fixed by #808
Closed

Slow memory leak #783

ollie-nye opened this issue Mar 19, 2024 · 15 comments · Fixed by #808
Assignees
Labels
bug Something isn't working
Milestone

Comments

@ollie-nye
Copy link

Describe the bug

We're seeing a slow memory leak over the course of a few days that's resulting in the operator pod being OOMkilled. Logs look fairly normal except for a few recurring errors, all following these patterns:

  • failed to delete finalizer: Operation cannot be fulfilled on exchanges.rabbitmq.com ""<exchange_name>"": the object has been modified; please apply your changes to the latest version and try again
  • failed to delete finalizer: Operation cannot be fulfilled on bindings.rabbitmq.com "<binding>": StorageError: invalid object, Code: 4, Key: /registry/rabbitmq.com/bindings/<binding>, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: <some uid>, UID in object meta:

It's not much of a leak, just enough to start throwing alerts and then kill the pod every few days. (Usage screenshot below)

I couldn't see any existing issues for something related so opening up a new one

To Reproduce

Our manifests are managed via Argo, but that shouldn't be getting involved in resources the operator looks after. Outside of that, we're running a standard manifest: https://github.com/rabbitmq/messaging-topology-operator/releases/download/v1.13.0/messaging-topology-operator-with-certmanager.yaml

We have a persisted CA being patched in with Kustomize but that doesn't look to be affecting finalizers.

Expected behavior
Operator memory usage to remain stable over longer periods of time

Screenshots

If applicable, add screenshots to help explain your problem.

Screenshot 2024-03-13 at 08 59 57

Version and environment information

  • Messaging Topology Operator: 1.13.0
  • RabbitMQ: 3.13.0
  • RabbitMQ Cluster Operator: 2.0.0
  • Kubernetes: v1.27.8
  • Cloud provider or hardware configuration: GKE

Additional context

This has been over the last 30 days at least, I can't say when it started exactly because we don't keep metrics longer than that. We jumped from 1.10.3 to 1.13.0 and the alerts started shortly afterwards

@ollie-nye ollie-nye added the bug Something isn't working label Mar 19, 2024
@Zerpet
Copy link
Contributor

Zerpet commented Apr 8, 2024

@mkuratczyk FYI this may be similar to this rabbitmq/cluster-operator#1549

@ollie-nye how many Secret type objects do you have in your cluster? Does this number vary over time?

@Zerpet Zerpet self-assigned this Apr 8, 2024
@ollie-nye
Copy link
Author

That sounds like it might be similar! The total number of resources doesn't change very much but new deployments of services get added and old ones dropped fairly regularly. Could it be that the old references to those resources are hanging around still?
Each of our deployments is about ~150 k8s objects and that happens maybe 10-20 times a day, over 10 days or so that sounds like it could start causing problems with cache if they don't get cleared up

@mkuratczyk
Copy link
Contributor

I've tested this operator for an issue similar to rabbitmq/cluster-operator#1549 and could not trigger anything like it. I believe because the root cause of the issue was that Cluster Operator watches "generic" resources, such as ConfigMaps, Secrets, StatefulSets. Meanwhile, the Topology Operator only watches rabbitmq-specific resources (correct me if I'm wrong).

@ollie-nye
Copy link
Author

@mkuratczyk That 150 count is just the topology operator side, split over (approximately) 60 bindings, 30 exchanges, 30 policies and 30 queues. We set up per-PR environments with their own queues and bindings for testing our main app and such so there's quite a lot of shuffling about happening on the regular!

@mkuratczyk
Copy link
Contributor

@ollie-nye if you could set ENABLE_DEBUG_PPROF=true in the Operator's pod environment, set up port forwarding and then connect with go tool pprof -web http://localhost:8080/debug/pprof/heap (I don't remember if it's 8080 or 9782) to see what's taking a lot of memory, that would be very helpful

@Zerpet
Copy link
Contributor

Zerpet commented Apr 8, 2024

I've tested this operator for an issue similar to rabbitmq/cluster-operator#1549 and could not trigger anything like it. I believe because the root cause of the issue was that Cluster Operator watches "generic" resources, such as ConfigMaps, Secrets, StatefulSets. Meanwhile, the Topology Operator only watches rabbitmq-specific resources (correct me if I'm wrong).

We also watch Secret objects. Those are required for User credentials and for connectionSecret feature.

@ollie-nye
Copy link
Author

@mkuratczyk Here's that output after a few hours of running after it restarted, if there's nothing standing out yet I can grab a new capture in a couple of days after it's grown a bit more?

Screenshot 2024-04-08 at 13 56 56

@mkuratczyk
Copy link
Contributor

Certainly the CertPool size is suspiciously large (do you have a lot of certificates?) but the key questions is where the usage is growing, so yes - a second capture after some time would be great. thanks

@ollie-nye
Copy link
Author

Not a huge amount, less than 50 across the whole cluster, but sure thing, I'll get another capture after it's sat for a bit. Thanks so much for your help so far!

@ollie-nye
Copy link
Author

New capture from this morning, same pod has gone from 64MiB memory to 180MiB over the last couple of days. That certpool usage looks to only account for a small chunk of that though?

Screenshot 2024-04-10 at 09 28 56

@mkuratczyk
Copy link
Contributor

Strange. In the upper left corner, it says "81.86MB total", previously it was 56.21MB. So growth - yes, with certs responsible for a third of that growth. But where is the other 100MB :)

@ollie-nye
Copy link
Author

Ah nice spot! I've got no clue where the extra usage is coming from though, might grab another capture in a few days and go from there?

@ollie-nye
Copy link
Author

One more capture from today, had a sharp jump up yesterday morning in total use but the profiler seems to be diverging more from pod vs app memory use, with certs are still taking up a decent chunk of that. Our total certificate count hasn't changed during this either, but a lot have been created and destroyed with environment changes, could it just be that the process is hanging on to certificates that are no longer in the cluster?

Screenshot 2024-04-11 at 16 17 27
Screenshot 2024-04-11 at 16 18 18

@mkuratczyk
Copy link
Contributor

Indeed I can see that creating a lot of Secrets, even completely unrelated to RabbitMQ, increases the memory usage. I will try to fix that, although we do need to watch some of the secrets so the exact solution is not yet clear to me.

@mkuratczyk mkuratczyk self-assigned this Apr 23, 2024
mkuratczyk added a commit that referenced this issue May 7, 2024
Watching the secrets can lead to high memory usage if there's a lot of
them. It seems like we don't actually rely on this, so we can turn that
off completely.

Fixes #783
mkuratczyk added a commit that referenced this issue May 7, 2024
Watching the secrets can lead to high memory usage if there's a lot of
them. It seems like we don't actually rely on this, so we can turn that
off completely.

Fixes #783
@Zerpet Zerpet added this to the v1.14 milestone May 7, 2024
@ollie-nye
Copy link
Author

We've seen the pod usage flatten out over the last week or so, thanks so much for getting this resolved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants