New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] rancher v2.7.2 cattle-cluster-agent memory leak oom kill #41225
Comments
Set the cattle-cluster-agent pod ad hoc request limit memory. When the pod restarts, the memory is reclaimed. |
I didn't find this problem in my setup, I assume it is because of some special resource counts. Could you share cattle-cluster-agent logs and Rancher diagnostic data? You can visit |
cattle-cluster-agent log: |
@taehwanlee It looks like you have three downstream clusters, they are not the heavy load clusters. There are some questions here:
|
Is there any workaround? |
Here is info about one of our clusters with this issue as well. |
@taehwanlee I could not reproduce this Memory Leak case, don't have any workaround for now. We can also try to look for clues in the rancher server logs. There are some built-in controllers that send API requests to downstream clusters, which go through the cattle-cluster-agent. |
I have the same problem and I fix this temporarily
|
We are experiencing the exact same issue, including the cattle-cluster-agent log entries mentioning I thought the issue was due to our |
I used the same patch as @wirwolf except I upped the requests and limits to 3GB. It seems to have kept the OOM issue under control for now. |
I've just had what I assume is the same issue, but with our "local" cluster, the Rancher cluster and the rancher pod itself. |
Indeed, this will also happen, just not as often. Last time it happened to us was last night but it took maybe five days for it to happen. Now we have limits set on the rancher deploy as well. |
I found rough steps to reproduce, and although I haven't caught the root cause yet, it should be close. Rancher v2.7.3 and a clean downstream cluster K3s v1.25.9. In the downstream cluster, create some resources, close to the real setup. For example, I created 2000 secrets.
Open a browser, visit the secret UI page of the downstream cluster, and then refresh crazily, or open multiple windows. |
I think I found the cause of this problem. The short answer is: The long answer is: By following steps, you can see the memory will grow with each request:
By compare the two heap info, you can see the memory increase and most of them goes to You can repeat the steps above with setting env |
I can confirm that Another mitigation is |
@taehwanlee can you please share the cattle-cluster-agent deployment YAML of one of the affected clusters, together with its memory profile at the time when memory usage is high? To collect it: export KUBECONFIG=/path/to/your/downstream.yaml
export POD=$(kubectl --namespace cattle-system get pod --selector='app=cattle-cluster-agent' --output jsonpath="{.items[0].metadata.name}")
kubectl --namespace cattle-system port-forward pod/$POD 6060:6060 &
wget http://localhost:6060/debug/pprof/heap TIA |
@moio
You can try to add env var |
I think it's because there weren't many memory leaks when the heap was created... |
Try to set |
This prevents a goroutine leak when item.Object is a `runtime.Object` but not a `metav1.Object`, as in that case `WatchNames`’s `for` loop will quit early and subsequent calls to `returnErr` will remain parked forever. This helps with rancher/rancher#41225 Fuller explanation in rancher#107
If you are considering trying the image or want to evaluate whether the fix is likely to work for you, you can follow the following steps:
if it is rancher-agent then run
|
Rancher and cattle-cluster-agent Memory Leak Validation checkMethodology:Followed the general guidance from Silvio and Ricardo from here and here. There are 3 scenarios that were tested, all using the same basic test steps:
Steps:
Memory Leak ScreenshotsFresh InstallImage UpdateUpgradeNotes
|
Additional internal reference SURE-6430 |
I believe this issue could be related to the problem described in #41684. |
This prevents a goroutine leak when item.Object is a `runtime.Object` but not a `metav1.Object`, as in that case `WatchNames`’s `for` loop will quit early and subsequent calls to `returnErr` will remain parked forever. This helps with rancher/rancher#41225 Fuller explanation in rancher#107
We realized exactly the same behaviour... we built a cluster v1.24.10 with rancher 2.7.4. We then added a resource limit in its deployment: Since then things calmed down a bit, no mor sudden controller node crashes. But of course, the container now gets OOM Killed when it runs into the memory eating situation. This happens quite often as can be be seen in monitoring (the last 24 hours at the time of writing): Doesn't look very healthy IMHO.... |
@markusleitoldZBS what did you expect ? This issue has been marked as fixed in 2.7.5 - so of course the bug would still be present in 2.7.4 |
@hoerup well how is 2.7.5, is it stable? i thought 2.7.3 was good but all of a sudden i got the memory leak as well which caught me by suprise. |
@Richardswe an additional issue was found and resolved involving goroutines. 2.7.5 has both issues resolved. |
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug
Memory leak occurs when using Rancher UI on Explorer
To Reproduce
Result
Expected Result
Screenshots
Additional context
The text was updated successfully, but these errors were encountered: