RKE2 v1.28.10+rke2r1 cluster stops working when critical pods are OOM killed #6249

serhiynovos · 2024-06-25T07:26:47Z

Hi. I'm running a RKE2 about for a few month and recently I updated to v1.28.10+rke2r1 . I notices now that about every week my cluster stops responding and all workloads stop working. I notices that my server nodes (control-plane and etcd) have big CPU/Memory usage with high DISK IO. After this I disabled SWAP but it did not help.

My VMs I running under Proxmox and for server nodes I allocated 10cpu + 10Gb or RAM. Should I increase memory on these machines or there is some bug ? Before I have newer faced this issue and it was working fine for a few months

I'm running 5 agents so VM size for master should be more than enough based on this table

serhiynovos · 2024-06-25T07:27:50Z

BTW. Logged to one of the server nodes

serhiynovos · 2024-06-26T19:58:28Z

I noticed RKE2 process is started taking more memory. Now it's 2.7g but about 10 minutes back it was 2.6

serhiynovos · 2024-06-26T20:03:06Z

I also have another rke2 cluster with single node where I don't see this problem

serhiynovos · 2024-06-27T07:08:53Z

Now RKE2 process is taking 3.5G, 11 hours back it was 2.7G

serhiynovos · 2024-06-27T09:36:40Z

Also on rke2-server nodes I see this kind of message. Maybe it can be related to it

Snapshot ConfigMap is to large, attempting to elide 4 of 664 entries to reduce size

brandond · 2024-06-27T18:35:20Z

top shows that ou have ~7gb of memory free, what is causing the Kubernetes and etcd pods to get OOM killed? These pods should not have any resource limits by default, have you customized the resource requests/limits such that these are getting killed by the kubelet, or do you have something else on your node that is triggering this?

Do you actually have sufficient memory on your proxmox cluster to allocate 10GB of RAM to these nodes, or are you oversubscribed such that the proxomox balloon driver is forcing nodes to free memory? You should DEFINITELY not be enabling dynamic memory management on your Kubernetes node VMs: https://pve.proxmox.com/wiki/Dynamic_Memory_Management

serhiynovos · 2024-06-27T18:48:08Z

top shows that ou have ~7gb of memory free, what is causing the Kubernetes and etcd pods to get OOM killed? These pods should not have any resource limits by default, have you customized the resource requests/limits such that these are getting killed by the kubelet, or do you have something else on your node that is triggering this?

Do you actually have sufficient memory on your proxmox cluster to allocate 10GB of RAM to these nodes, or are you oversubscribed such that the proxomox balloon driver is forcing nodes to free memory? You should DEFINITELY not be enabling dynamic memory management on your Kubernetes node VMs: https://pve.proxmox.com/wiki/Dynamic_Memory_Management

Hi @brandond , Thank You for your response. Screens for top command I use after I restarted all nodes and expanded them from 10 GB to 16GB of RAM to show that RKE2 process was taking more memory with time.

serhiynovos · 2024-06-27T19:01:58Z

Snapshot ConfigMap is to large, attempting to elide 4 of 664 entries to reduce size

It's also was strange that rke2 does not clear old S3 snapshots, I see there about 600 snapshots uploaded to s3.

One cause for this issue and rke2 memory leak which was causing oom kill it's that I saw in resources longhorn snapshots controller for velero with volume snapshots and most of this volumes don't exist anymore. I cleared all this resources and restarted all nodes. After these changes looks like this issue also was resolved #6265 and agent load balancer convened to other server nodes or it was fixed due to reboot and in the future it still may happen, so to resolve it completely I have to update to 1.28.11 ?

brandond · 2024-06-27T19:14:30Z

If you're using S3 you might also be running into #6162 - which is fixed in v1.28.11 as well.

serhiynovos · 2024-06-27T19:19:26Z

If you're using S3 you might also be running into #6162 - which is fixed in v1.28.11 as well.

Yes. Looks like it's exactly the same issue. Is there are any way I can update my cluster to 1.28.11 from rancher UI ?

brandond · 2024-06-27T19:43:14Z

no. the release is not available yet.

serhiynovos · 2024-06-28T09:02:13Z

@brandond going back to original issue I notices that after reboot I still have high resource usage for rke2 process. RKE2 process is using 1.1G of RAM, I checked other server nodes, there rke2 use only about 250MB. Noticed this problem only on default server node which was installed first during cluster installation

From the logs I don't see anything I just see info level log every 10 minutes

Jun 28 08:49:06 cluster-master-1 rke2[808]: time="2024-06-28T08:49:06Z" level=info msg="Checking if S3 bucket k8s-backups exists"
Jun 28 08:49:07 cluster-master-1 rke2[808]: time="2024-06-28T08:49:07Z" level=info msg="S3 bucket k8s-backups exists"

Let me observe more to see how fast memory usage is growing. Also can you please tell how to enable debug logs level from rancher so maybe it will show more information why memory usage for RKE2 process is growing .

serhiynovos · 2024-06-28T16:53:33Z

@brandond going back to original issue I notices that after reboot I still have high resource usage for rke2 process. RKE2 process is using 1.1G of RAM, I checked other server nodes, there rke2 use only about 250MB. Noticed this problem only on default server node which was installed first during cluster installation

From the logs I don't see anything I just see info level log every 10 minutes
Jun 28 08:49:06 cluster-master-1 rke2[808]: time="2024-06-28T08:49:06Z" level=info msg="Checking if S3 bucket k8s-backups exists"
Jun 28 08:49:07 cluster-master-1 rke2[808]: time="2024-06-28T08:49:07Z" level=info msg="S3 bucket k8s-backups exists"
Let me observe more to see how fast memory usage is growing. Also can you please tell how to enable debug logs level from rancher so maybe it will show more information why memory usage for RKE2 process is growing .

Ran in debug mode now I'm seeing this messages and after this I notice rke2 process starts taking more memory

Jun 28 16:51:04 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:04Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718859601"
Jun 28 16:51:04 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:04Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718877601"
Jun 28 16:51:04 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:04Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718895603"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718913601"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718928004"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718946003"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718964001"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718982004"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719000003"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719014401"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719032403"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719050402"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719068405"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719086403"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719100805"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719118801"

brandond · 2024-06-28T17:55:05Z

You mentioned you have 664 snapshots on S3, due to the issue with s3 snapshots not being pruned. Have you considered manually cleaning those up, until you can upgrade to a release where that issue is fixed?

serhiynovos · 2024-06-28T18:04:34Z

I removed maybe 100 snapshots. Do you want me to remove all s3 snapshots and see if memory increase is fixed or not ?

If yes, after remove all snapshots from s3 should I restart all server nodes or leave as is and it's automatically will remove snapshots from config map ?

brandond · 2024-06-28T18:09:01Z

The more you clean up the less it will have to keep track of...

I would probably restart rke2 after doing the cleanup, if you want to see an immediate decrease in memory utilization.

serhiynovos · 2024-07-01T07:26:59Z

@brandond after cleaning up all snapshots during the weekend I don't see significant increase of memory usage for rke2 process on first server/etcd node, but still I see it increased to 338 mb and on other server nodes it takes 160 and 190 mb.
Looks like as snapshots count will be increasing then memory usage will also be increased on this way. There should be some memory leak as GC should clear all resources from the heap after each snapshots checking. It can be valid scenario that even snapshot eviction will be working correctly somebody will put param keep the last snapshots per node as 100 and it will be causing memory increase

brandond · 2024-07-01T18:25:51Z

There should be some memory leak as GC should clear all resources from the heap after each snapshots checking.

That is not how golang works, and there is not any leak within rke2. Golang does not aggressively release memory back to the host operating system unless it is under external memory pressure. It is normal to see a golang process's memory utilization stay slightly higher after an allocation.

Ref: runtime: default to MADV_DONTNEED on Linux golang/go#42330

serhiynovos · 2024-07-01T19:54:22Z

So RKE2 process may start using a few GB or RAM ? Also in my case system was under the pressure and had to kill some processes to free memory.

brandond · 2024-07-01T20:33:45Z

Depending on workload, the memory utilization of different components may fluctuate, yes. For the rke2 service itself, you can always put an upper limit on it via MemoryHigh/MemoryMax in the rke2 systemd unit:
https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html#MemoryHigh=bytes

harridu · 2024-07-18T11:28:20Z

If you're using S3 you might also be running into #6162 - which is fixed in v1.28.11 as well.

The release notes on https://github.com/rancher/rke2/releases/tag/v1.28.11%2Brke2r1 don't tell :-(. I'd love to learn what has actually been fixed for 1.28.11.

serhiynovos · 2024-07-20T07:52:49Z

@harridu rke2 is based on k3s. Some release notes will be described more detailed there https://github.com/k3s-io/k3s/releases/tag/v1.28.11%2Bk3s1

On rke2 release notes rancher team just report they updated k3s version

serhiynovos mentioned this issue Jun 27, 2024

RKE2 v1.28.10+rke2r1 cluster fails when one of the server nodes goes down #6265

Closed

brandond changed the title ~~RKE2 v1.28.10+rke2r1 cluster stops working about in one week.~~ RKE2 v1.28.10+rke2r1 cluster stops working when critical pods are OOM killed Jun 27, 2024

brandond closed this as completed Jul 1, 2024

serhiynovos mentioned this issue Jul 18, 2024

rke2 server eats up all the memory #6370

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RKE2 v1.28.10+rke2r1 cluster stops working when critical pods are OOM killed #6249

RKE2 v1.28.10+rke2r1 cluster stops working when critical pods are OOM killed #6249

serhiynovos commented Jun 25, 2024 •

edited

Loading

serhiynovos commented Jun 25, 2024

serhiynovos commented Jun 26, 2024

serhiynovos commented Jun 26, 2024

serhiynovos commented Jun 27, 2024

serhiynovos commented Jun 27, 2024 •

edited

Loading

brandond commented Jun 27, 2024 •

edited

Loading

serhiynovos commented Jun 27, 2024 •

edited

Loading

serhiynovos commented Jun 27, 2024 •

edited

Loading

brandond commented Jun 27, 2024

serhiynovos commented Jun 27, 2024

brandond commented Jun 27, 2024

serhiynovos commented Jun 28, 2024 •

edited

Loading

serhiynovos commented Jun 28, 2024

brandond commented Jun 28, 2024

serhiynovos commented Jun 28, 2024

brandond commented Jun 28, 2024

serhiynovos commented Jul 1, 2024

brandond commented Jul 1, 2024 •

edited

Loading

serhiynovos commented Jul 1, 2024

brandond commented Jul 1, 2024

harridu commented Jul 18, 2024

serhiynovos commented Jul 20, 2024

RKE2 v1.28.10+rke2r1 cluster stops working when critical pods are OOM killed #6249

RKE2 v1.28.10+rke2r1 cluster stops working when critical pods are OOM killed #6249

Comments

serhiynovos commented Jun 25, 2024 • edited Loading

serhiynovos commented Jun 25, 2024

serhiynovos commented Jun 26, 2024

serhiynovos commented Jun 26, 2024

serhiynovos commented Jun 27, 2024

serhiynovos commented Jun 27, 2024 • edited Loading

brandond commented Jun 27, 2024 • edited Loading

serhiynovos commented Jun 27, 2024 • edited Loading

serhiynovos commented Jun 27, 2024 • edited Loading

brandond commented Jun 27, 2024

serhiynovos commented Jun 27, 2024

brandond commented Jun 27, 2024

serhiynovos commented Jun 28, 2024 • edited Loading

serhiynovos commented Jun 28, 2024

brandond commented Jun 28, 2024

serhiynovos commented Jun 28, 2024

brandond commented Jun 28, 2024

serhiynovos commented Jul 1, 2024

brandond commented Jul 1, 2024 • edited Loading

serhiynovos commented Jul 1, 2024

brandond commented Jul 1, 2024

harridu commented Jul 18, 2024

serhiynovos commented Jul 20, 2024

serhiynovos commented Jun 25, 2024 •

edited

Loading

serhiynovos commented Jun 27, 2024 •

edited

Loading

brandond commented Jun 27, 2024 •

edited

Loading

serhiynovos commented Jun 27, 2024 •

edited

Loading

serhiynovos commented Jun 27, 2024 •

edited

Loading

serhiynovos commented Jun 28, 2024 •

edited

Loading

brandond commented Jul 1, 2024 •

edited

Loading