Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Rancher 2.5.2 and 2.4.10: cattle-cluster-agent hammers apiserver, very high CPU usage #30048

Closed
horihel opened this issue Nov 12, 2020 · 14 comments
Milestone

Comments

@horihel
Copy link

horihel commented Nov 12, 2020

What kind of request is this (question/bug/enhancement/feature request):
possible bug

Steps to reproduce (least amount of steps as possible):
upgrade existing rancher from 2.4.8 to 2.5.2

Result:
cattle-agent cpu utilization skyrockets
grafik

upgrade took place at 08:00, since then cattle-agent heats the CPU. It also seems to affect kube-apiserver (also taking >100%)

I have this on 2 clusters, one using vsphere and one custom bare metal provisioned by rancher.

Requests to kube-apiserver also went up at the same time
grafik

Judging by more prometheus metrics it seems to write a lot to "apps" resources:
grafik

Other details that may be helpful:

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI):
    v2.5.2
  • Installation option (single install/HA):
    single install

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported):
    2 Clusters show the problem: one rancher provisioned on Vsphere and one Custom/Bare Metal, also rancher provisioned.
  • Machine type (cloud/VM/metal) and specifications (CPU/memory):

Vsphere-Cluster:
VMs (3 mgmt nodes, 4 CPU, 16GB RAM each, 6 workers, 4 CPUs, 16GB RAM each), based on RancherOS

Custom Cluster:
Metal, 5 mixed nodes, (4-8 CPUs, 8-64GB memory).

  • Kubernetes version (use kubectl version):
    1.18.10
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.10", GitCommit:"62876fc6d93e891aa7fbe19771e6a6c03773b0f7", GitTreeState:"clean", BuildDate:"2020-10-15T01:43:56Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

  • Docker version (use docker version):
metal cluster: 19.3.8
VM cluster: 19.3..8
@lucky4ever2
Copy link

I'm not sure if it's the same problem, but since the update from 2.5.1 to 2.5.2 I've also had problems with high cpu.
I did the update at 9:30 am and since then my monitoring system (checkmk) shows a much higher CPU load and CPU utilization on every node. The RAM is not affected.

grafik

grafik

I have a Rancher Cluster based on VMs (Hyper-V) with 3 Nodes and Bare Metal Cluster with 3 Nodes created with rancher.
It affects both clusters (Kubernetes Version 1.19.3).

kube-apiserver in top display:
grafik

@horihel
Copy link
Author

horihel commented Nov 13, 2020

I rolled back rancher to 2.4.8 and the problem is gone.

@lucky4ever2
Copy link

how did you roll back rancher?

@horihel
Copy link
Author

horihel commented Nov 13, 2020

@lucky4ever2
Copy link

I have now restored the vm snapshots. Now it looks good again:
grafik
grafik
grafik

@mrajashree
Copy link
Contributor

@horihel @lucky4ever2 do you have Project Network Isolation enabled on these clusters? and do you see the logs reported in this issue?

@lucky4ever2
Copy link

Yes, i have Network Isolation enabled. I don't have any logs because I wanted to downgrade again quickly.

@mrajashree
Copy link
Contributor

@lucky4ever2 okay, it's a known issue with 2.5.2, we're tracking it with this issue. The workaround is to turn off project network isolation. But I see that you have already successfully rolled back to 2.4.x so you don't need to turn off network isolation on 2.4.x
Once @horihel confirms if he has the same issue, we can close this issue and use the other one for tracking the fix.

@horihel
Copy link
Author

horihel commented Nov 16, 2020

I can confirm that network isolation was enabled on both clusters on my side too.

as i already rolled back, i can not immediately confirm if disabling solves the problem. I'll try and setup a new test env.

@maggieliu
Copy link

closing in favor of: #30045

@maggieliu maggieliu added this to the v2.5.3 milestone Nov 17, 2020
@kinarashah kinarashah changed the title Upgrade to Rancher 2.5.2: cattle-cluster-agent hammers apiserver, very high CPU usage Upgrade to Rancher 2.5.2 and 2.4.10: cattle-cluster-agent hammers apiserver, very high CPU usage Nov 30, 2020
@horihel
Copy link
Author

horihel commented Dec 2, 2020

is it possible to re-open this issue?
upgrading to 2.5.3 did not solve the problem
disabling network isolation (after upgrade) did not solve the problem.
(in relation to #30045:): redeploying ingress controller did not solve the problem.

cluster operation seems to be still working, although slower than usual.

please advise, otherwise i need to downgrade again.

@horihel
Copy link
Author

horihel commented Dec 3, 2020

tried the upgrade to 2.5.3 again today, this time turning PNI off before upgrading. Still misbehaving.

downgraded and upgraded to 2.4.11 instead - this time behaviour of cattle-cluster-agent seems to be normal.

so 2.4.11 seems to be fine, 2.5.3 (stable) seems to be still misbehaving.

@danielschlegel
Copy link

danielschlegel commented Dec 4, 2020

I also have high cpu on 2.5.3 if i try to go to system project. The server has less than 10% CPU usage while idle, if i try to access system project CPU goes to 300% and request fails after timeout.

@nickvth
Copy link

nickvth commented Dec 9, 2020

Please reopen, still issues wih 2.5.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants