-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RKE2 v1.28.10+rke2r1 cluster stops working when critical pods are OOM killed #6249
Comments
Also on rke2-server nodes I see this kind of message. Maybe it can be related to it Snapshot ConfigMap is to large, attempting to elide 4 of 664 entries to reduce size |
Do you actually have sufficient memory on your proxmox cluster to allocate 10GB of RAM to these nodes, or are you oversubscribed such that the proxomox balloon driver is forcing nodes to free memory? You should DEFINITELY not be enabling dynamic memory management on your Kubernetes node VMs: https://pve.proxmox.com/wiki/Dynamic_Memory_Management |
Hi @brandond , Thank You for your response. Screens for |
It's also was strange that rke2 does not clear old S3 snapshots, I see there about 600 snapshots uploaded to s3. One cause for this issue and rke2 memory leak which was causing oom kill it's that I saw in resources longhorn snapshots controller for velero with volume snapshots and most of this volumes don't exist anymore. I cleared all this resources and restarted all nodes. After these changes looks like this issue also was resolved #6265 and agent load balancer convened to other server nodes or it was fixed due to reboot and in the future it still may happen, so to resolve it completely I have to update to 1.28.11 ? |
If you're using S3 you might also be running into #6162 - which is fixed in v1.28.11 as well. |
Yes. Looks like it's exactly the same issue. Is there are any way I can update my cluster to 1.28.11 from rancher UI ? |
no. the release is not available yet. |
@brandond going back to original issue I notices that after reboot I still have high resource usage for rke2 process. RKE2 process is using 1.1G of RAM, I checked other server nodes, there rke2 use only about 250MB. Noticed this problem only on default server node which was installed first during cluster installation From the logs I don't see anything I just see info level log every 10 minutes
Let me observe more to see how fast memory usage is growing. Also can you please tell how to enable debug logs level from rancher so maybe it will show more information why memory usage for RKE2 process is growing . |
Ran in debug mode now I'm seeing this messages and after this I notice rke2 process starts taking more memory
|
You mentioned you have 664 snapshots on S3, due to the issue with s3 snapshots not being pruned. Have you considered manually cleaning those up, until you can upgrade to a release where that issue is fixed? |
I removed maybe 100 snapshots. Do you want me to remove all s3 snapshots and see if memory increase is fixed or not ? If yes, after remove all snapshots from s3 should I restart all server nodes or leave as is and it's automatically will remove snapshots from config map ? |
The more you clean up the less it will have to keep track of... I would probably restart rke2 after doing the cleanup, if you want to see an immediate decrease in memory utilization. |
@brandond after cleaning up all snapshots during the weekend I don't see significant increase of memory usage for rke2 process on first server/etcd node, but still I see it increased to 338 mb and on other server nodes it takes 160 and 190 mb. |
That is not how golang works, and there is not any leak within rke2. Golang does not aggressively release memory back to the host operating system unless it is under external memory pressure. It is normal to see a golang process's memory utilization stay slightly higher after an allocation. |
So RKE2 process may start using a few GB or RAM ? Also in my case system was under the pressure and had to kill some processes to free memory. |
Depending on workload, the memory utilization of different components may fluctuate, yes. For the rke2 service itself, you can always put an upper limit on it via MemoryHigh/MemoryMax in the rke2 systemd unit: |
The release notes on https://github.com/rancher/rke2/releases/tag/v1.28.11%2Brke2r1 don't tell :-(. I'd love to learn what has actually been fixed for 1.28.11. |
@harridu rke2 is based on k3s. Some release notes will be described more detailed there https://github.com/k3s-io/k3s/releases/tag/v1.28.11%2Bk3s1 On rke2 release notes rancher team just report they updated k3s version |
Hi. I'm running a RKE2 about for a few month and recently I updated to v1.28.10+rke2r1 . I notices now that about every week my cluster stops responding and all workloads stop working. I notices that my server nodes (control-plane and etcd) have big CPU/Memory usage with high DISK IO. After this I disabled SWAP but it did not help.
My VMs I running under Proxmox and for server nodes I allocated 10cpu + 10Gb or RAM. Should I increase memory on these machines or there is some bug ? Before I have newer faced this issue and it was working fine for a few months
I'm running 5 agents so VM size for master should be more than enough based on this table
The text was updated successfully, but these errors were encountered: