Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-6990] fleet-controller consuming high cpu #1842

Closed
kkaempf opened this issue Oct 6, 2023 · 4 comments
Closed

[SURE-6990] fleet-controller consuming high cpu #1842

kkaempf opened this issue Oct 6, 2023 · 4 comments
Assignees
Milestone

Comments

@kkaempf
Copy link
Collaborator

kkaempf commented Oct 6, 2023

SURE-6990

Issue description:

fleet-controller is consuming 5GHz plus of CPU. customer is using AzureDevOps with 300+ git repos

Business impact:

fleet gets unstable, or even the node for rancher management

Troubleshooting steps:

scaling nodes 4x

Repro steps:

no repro steps yet

Workaround:

Is workararound available and implemented? yes
What is the workaround: scaling vertically the nodes

Actual behavior:

fleet-controller consuming 5GHz+ of CPU

@skaven81
Copy link

skaven81 commented Dec 11, 2023

We are observing this same problem. It seems to occur when there are any bundles out of sync. Once all the Bundles are back in sync, the CPU utilization calms back down. But if anything goes out of sync, fleet-controller will start using all the CPU it can get its hands on.

When any Bundle is out of sync:

$ kubectl -n cattle-fleet-system top pod
NAME                               CPU(cores)   MEMORY(bytes)
fleet-controller-5649d6545-fql99   4024m        695Mi
gitjob-5b8c4fdc67-tns7f            12m          152Mi

When all Bundles are back in sync:

$ kubectl -n cattle-fleet-system top pod
NAME                               CPU(cores)   MEMORY(bytes)
fleet-controller-5649d6545-fql99   7m           548Mi
gitjob-5b8c4fdc67-tns7f            12m          151Mi

@skaven81
Copy link

I was able to capture a core file from the fleet-controller process using Delve. If somebody can tell me what to do with such a file (or if there are other Delve commands I can run to extract the useful information for debugging this) please let me know. I cannot attach the core file itself.

@skaven81
Copy link

skaven81 commented Dec 11, 2023

If anybody else sees this and wants to do the same, here's how I attached to it:

First get the container ID of fleet-controller (8021a57ab5de in this example)

$ docker ps | grep fleet-controller
8021a57ab5de   fdcac788e741                                                       "fleetcontroller --d…"   3 minutes ago   Up 3 minutes             k8s_fleet-controller_fleet-controller-5649d6545-fql99_cattle-fleet-system_31ae6828-7972-4893-b278-d7cf4113b245_0
663317e93319   docker-registry.qualcomm.com/rancher/mirrored-pause:3.6            "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_fleet-controller-5649d6545-fql99_cattle-fleet-system_31ae6828-7972-4893-b278-d7cf4113b245_0

Then attach to it with a container that has the dlv binary in it:

$ docker run -it --rm --privileged --pid=container:8021a57ab5de your-registry.example.com/org/delve:latest

Once inside, the fleet-controller is running as PID 1 and you can attach to it.

If you cannot attach, you may need to drop out of the container and temporarily set this on the host to get the attach to work:

$ echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

@kkaempf kkaempf changed the title fleet-controller consuming high cpu [SURE-6990] fleet-controller consuming high cpu Dec 12, 2023
@aruiz14
Copy link
Contributor

aruiz14 commented Jan 10, 2024

/backport v2.7.x-next

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants