-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow memory leak #783
Comments
@mkuratczyk FYI this may be similar to this rabbitmq/cluster-operator#1549 @ollie-nye how many |
That sounds like it might be similar! The total number of resources doesn't change very much but new deployments of services get added and old ones dropped fairly regularly. Could it be that the old references to those resources are hanging around still? |
I've tested this operator for an issue similar to rabbitmq/cluster-operator#1549 and could not trigger anything like it. I believe because the root cause of the issue was that Cluster Operator watches "generic" resources, such as ConfigMaps, Secrets, StatefulSets. Meanwhile, the Topology Operator only watches rabbitmq-specific resources (correct me if I'm wrong). |
@mkuratczyk That 150 count is just the topology operator side, split over (approximately) 60 bindings, 30 exchanges, 30 policies and 30 queues. We set up per-PR environments with their own queues and bindings for testing our main app and such so there's quite a lot of shuffling about happening on the regular! |
@ollie-nye if you could set |
We also watch |
@mkuratczyk Here's that output after a few hours of running after it restarted, if there's nothing standing out yet I can grab a new capture in a couple of days after it's grown a bit more? |
Certainly the CertPool size is suspiciously large (do you have a lot of certificates?) but the key questions is where the usage is growing, so yes - a second capture after some time would be great. thanks |
Not a huge amount, less than 50 across the whole cluster, but sure thing, I'll get another capture after it's sat for a bit. Thanks so much for your help so far! |
Strange. In the upper left corner, it says "81.86MB total", previously it was 56.21MB. So growth - yes, with certs responsible for a third of that growth. But where is the other 100MB :) |
Ah nice spot! I've got no clue where the extra usage is coming from though, might grab another capture in a few days and go from there? |
One more capture from today, had a sharp jump up yesterday morning in total use but the profiler seems to be diverging more from pod vs app memory use, with certs are still taking up a decent chunk of that. Our total certificate count hasn't changed during this either, but a lot have been created and destroyed with environment changes, could it just be that the process is hanging on to certificates that are no longer in the cluster? |
Indeed I can see that creating a lot of Secrets, even completely unrelated to RabbitMQ, increases the memory usage. I will try to fix that, although we do need to watch some of the secrets so the exact solution is not yet clear to me. |
Watching the secrets can lead to high memory usage if there's a lot of them. It seems like we don't actually rely on this, so we can turn that off completely. Fixes #783
Watching the secrets can lead to high memory usage if there's a lot of them. It seems like we don't actually rely on this, so we can turn that off completely. Fixes #783
We've seen the pod usage flatten out over the last week or so, thanks so much for getting this resolved! |
Describe the bug
We're seeing a slow memory leak over the course of a few days that's resulting in the operator pod being OOMkilled. Logs look fairly normal except for a few recurring errors, all following these patterns:
failed to delete finalizer: Operation cannot be fulfilled on exchanges.rabbitmq.com ""<exchange_name>"": the object has been modified; please apply your changes to the latest version and try again
failed to delete finalizer: Operation cannot be fulfilled on bindings.rabbitmq.com "<binding>": StorageError: invalid object, Code: 4, Key: /registry/rabbitmq.com/bindings/<binding>, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: <some uid>, UID in object meta:
It's not much of a leak, just enough to start throwing alerts and then kill the pod every few days. (Usage screenshot below)
I couldn't see any existing issues for something related so opening up a new one
To Reproduce
Our manifests are managed via Argo, but that shouldn't be getting involved in resources the operator looks after. Outside of that, we're running a standard manifest: https://github.com/rabbitmq/messaging-topology-operator/releases/download/v1.13.0/messaging-topology-operator-with-certmanager.yaml
We have a persisted CA being patched in with Kustomize but that doesn't look to be affecting finalizers.
Expected behavior
Operator memory usage to remain stable over longer periods of time
Screenshots
If applicable, add screenshots to help explain your problem.
Version and environment information
Additional context
This has been over the last 30 days at least, I can't say when it started exactly because we don't keep metrics longer than that. We jumped from 1.10.3 to 1.13.0 and the alerts started shortly afterwards
The text was updated successfully, but these errors were encountered: