Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 v1.28.10+rke2r1 cluster stops working when critical pods are OOM killed #6249

Closed
serhiynovos opened this issue Jun 25, 2024 · 22 comments
Closed

Comments

@serhiynovos
Copy link

serhiynovos commented Jun 25, 2024

Hi. I'm running a RKE2 about for a few month and recently I updated to v1.28.10+rke2r1 . I notices now that about every week my cluster stops responding and all workloads stop working. I notices that my server nodes (control-plane and etcd) have big CPU/Memory usage with high DISK IO. After this I disabled SWAP but it did not help.

My VMs I running under Proxmox and for server nodes I allocated 10cpu + 10Gb or RAM. Should I increase memory on these machines or there is some bug ? Before I have newer faced this issue and it was working fine for a few months

I'm running 5 agents so VM size for master should be more than enough based on this table
image

@serhiynovos
Copy link
Author

BTW. Logged to one of the server nodes
image

@serhiynovos
Copy link
Author

Знімок екрана 2024-06-26 о 22 52 37

I noticed RKE2 process is started taking more memory. Now it's 2.7g but about 10 minutes back it was 2.6

@serhiynovos
Copy link
Author

Знімок екрана 2024-06-26 о 23 02 43

I also have another rke2 cluster with single node where I don't see this problem

@serhiynovos
Copy link
Author

Знімок екрана 2024-06-27 о 10 07 19

Now RKE2 process is taking 3.5G, 11 hours back it was 2.7G

@serhiynovos
Copy link
Author

serhiynovos commented Jun 27, 2024

Also on rke2-server nodes I see this kind of message. Maybe it can be related to it

Snapshot ConfigMap is to large, attempting to elide 4 of 664 entries to reduce size

@brandond
Copy link
Member

brandond commented Jun 27, 2024

top shows that ou have ~7gb of memory free, what is causing the Kubernetes and etcd pods to get OOM killed? These pods should not have any resource limits by default, have you customized the resource requests/limits such that these are getting killed by the kubelet, or do you have something else on your node that is triggering this?

Do you actually have sufficient memory on your proxmox cluster to allocate 10GB of RAM to these nodes, or are you oversubscribed such that the proxomox balloon driver is forcing nodes to free memory? You should DEFINITELY not be enabling dynamic memory management on your Kubernetes node VMs: https://pve.proxmox.com/wiki/Dynamic_Memory_Management

@brandond brandond changed the title RKE2 v1.28.10+rke2r1 cluster stops working about in one week. RKE2 v1.28.10+rke2r1 cluster stops working when critical pods are OOM killed Jun 27, 2024
@serhiynovos
Copy link
Author

serhiynovos commented Jun 27, 2024

top shows that ou have ~7gb of memory free, what is causing the Kubernetes and etcd pods to get OOM killed? These pods should not have any resource limits by default, have you customized the resource requests/limits such that these are getting killed by the kubelet, or do you have something else on your node that is triggering this?

Do you actually have sufficient memory on your proxmox cluster to allocate 10GB of RAM to these nodes, or are you oversubscribed such that the proxomox balloon driver is forcing nodes to free memory? You should DEFINITELY not be enabling dynamic memory management on your Kubernetes node VMs: https://pve.proxmox.com/wiki/Dynamic_Memory_Management

Hi @brandond , Thank You for your response. Screens for top command I use after I restarted all nodes and expanded them from 10 GB to 16GB of RAM to show that RKE2 process was taking more memory with time.

@serhiynovos
Copy link
Author

serhiynovos commented Jun 27, 2024

Snapshot ConfigMap is to large, attempting to elide 4 of 664 entries to reduce size

It's also was strange that rke2 does not clear old S3 snapshots, I see there about 600 snapshots uploaded to s3.

One cause for this issue and rke2 memory leak which was causing oom kill it's that I saw in resources longhorn snapshots controller for velero with volume snapshots and most of this volumes don't exist anymore. I cleared all this resources and restarted all nodes. After these changes looks like this issue also was resolved #6265 and agent load balancer convened to other server nodes or it was fixed due to reboot and in the future it still may happen, so to resolve it completely I have to update to 1.28.11 ?

@brandond
Copy link
Member

If you're using S3 you might also be running into #6162 - which is fixed in v1.28.11 as well.

@serhiynovos
Copy link
Author

If you're using S3 you might also be running into #6162 - which is fixed in v1.28.11 as well.

Yes. Looks like it's exactly the same issue. Is there are any way I can update my cluster to 1.28.11 from rancher UI ?

@brandond
Copy link
Member

no. the release is not available yet.

@serhiynovos
Copy link
Author

serhiynovos commented Jun 28, 2024

@brandond going back to original issue I notices that after reboot I still have high resource usage for rke2 process. RKE2 process is using 1.1G of RAM, I checked other server nodes, there rke2 use only about 250MB. Noticed this problem only on default server node which was installed first during cluster installation

From the logs I don't see anything I just see info level log every 10 minutes

Jun 28 08:49:06 cluster-master-1 rke2[808]: time="2024-06-28T08:49:06Z" level=info msg="Checking if S3 bucket k8s-backups exists"
Jun 28 08:49:07 cluster-master-1 rke2[808]: time="2024-06-28T08:49:07Z" level=info msg="S3 bucket k8s-backups exists"

Let me observe more to see how fast memory usage is growing. Also can you please tell how to enable debug logs level from rancher so maybe it will show more information why memory usage for RKE2 process is growing .

@serhiynovos
Copy link
Author

@brandond going back to original issue I notices that after reboot I still have high resource usage for rke2 process. RKE2 process is using 1.1G of RAM, I checked other server nodes, there rke2 use only about 250MB. Noticed this problem only on default server node which was installed first during cluster installation

From the logs I don't see anything I just see info level log every 10 minutes

Jun 28 08:49:06 cluster-master-1 rke2[808]: time="2024-06-28T08:49:06Z" level=info msg="Checking if S3 bucket k8s-backups exists"
Jun 28 08:49:07 cluster-master-1 rke2[808]: time="2024-06-28T08:49:07Z" level=info msg="S3 bucket k8s-backups exists"

Let me observe more to see how fast memory usage is growing. Also can you please tell how to enable debug logs level from rancher so maybe it will show more information why memory usage for RKE2 process is growing .

Ran in debug mode now I'm seeing this messages and after this I notice rke2 process starts taking more memory

Jun 28 16:51:04 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:04Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718859601"
Jun 28 16:51:04 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:04Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718877601"
Jun 28 16:51:04 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:04Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718895603"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718913601"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718928004"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718946003"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718964001"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1718982004"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719000003"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719014401"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719032403"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719050402"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719068405"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719086403"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719100805"
Jun 28 16:51:05 cluster-master-1 rke2[657173]: time="2024-06-28T16:51:05Z" level=debug msg="Loading snapshot metadata from s3://k8s-backups/prod-etcd/.metadata/etcd-snapshot-cluster-master-2-1719118801"

@brandond
Copy link
Member

You mentioned you have 664 snapshots on S3, due to the issue with s3 snapshots not being pruned. Have you considered manually cleaning those up, until you can upgrade to a release where that issue is fixed?

@serhiynovos
Copy link
Author

I removed maybe 100 snapshots. Do you want me to remove all s3 snapshots and see if memory increase is fixed or not ?

If yes, after remove all snapshots from s3 should I restart all server nodes or leave as is and it's automatically will remove snapshots from config map ?

@brandond
Copy link
Member

The more you clean up the less it will have to keep track of...

I would probably restart rke2 after doing the cleanup, if you want to see an immediate decrease in memory utilization.

@serhiynovos
Copy link
Author

@brandond after cleaning up all snapshots during the weekend I don't see significant increase of memory usage for rke2 process on first server/etcd node, but still I see it increased to 338 mb and on other server nodes it takes 160 and 190 mb.
Looks like as snapshots count will be increasing then memory usage will also be increased on this way. There should be some memory leak as GC should clear all resources from the heap after each snapshots checking. It can be valid scenario that even snapshot eviction will be working correctly somebody will put param keep the last snapshots per node as 100 and it will be causing memory increase

@brandond
Copy link
Member

brandond commented Jul 1, 2024

There should be some memory leak as GC should clear all resources from the heap after each snapshots checking.

That is not how golang works, and there is not any leak within rke2. Golang does not aggressively release memory back to the host operating system unless it is under external memory pressure. It is normal to see a golang process's memory utilization stay slightly higher after an allocation.

@brandond brandond closed this as completed Jul 1, 2024
@serhiynovos
Copy link
Author

So RKE2 process may start using a few GB or RAM ? Also in my case system was under the pressure and had to kill some processes to free memory.

@brandond
Copy link
Member

brandond commented Jul 1, 2024

Depending on workload, the memory utilization of different components may fluctuate, yes. For the rke2 service itself, you can always put an upper limit on it via MemoryHigh/MemoryMax in the rke2 systemd unit:
https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html#MemoryHigh=bytes

@harridu
Copy link

harridu commented Jul 18, 2024

If you're using S3 you might also be running into #6162 - which is fixed in v1.28.11 as well.

The release notes on https://github.com/rancher/rke2/releases/tag/v1.28.11%2Brke2r1 don't tell :-(. I'd love to learn what has actually been fixed for 1.28.11.

@serhiynovos
Copy link
Author

@harridu rke2 is based on k3s. Some release notes will be described more detailed there https://github.com/k3s-io/k3s/releases/tag/v1.28.11%2Bk3s1

On rke2 release notes rancher team just report they updated k3s version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants