spire-server too high CPU usage #2827

szvincze · 2022-03-07T19:57:46Z

Version: 1.2.0
Platform: Linux 5.13.0-30-generic - 20.04.1-Ubuntu SMP x86_64
Subsystem: server

spire-server is running with no agent and without any load on it. CPU consumption is under 1% for ~50-60 minutes, then suddenly it starts consuming ~150-160% CPU and it persists until shutting down the spire-server. The same happens with all replicas.
Different environments show different results but the CPU resource consumption is always jumps much higher after similar amount of time spent (~50-60 mins).
I observed the above mentioned ~150-160% on a kind cluster, 95-100% on minikube, 100-110% on kvm/qemu and 45-50% on non-virtual k8s environment.

I tried the same with spire-server 1.0.1, 1.1.0 and 1.2.0 images and got the same results.
I also created a configuration without k8s-workload-registrar too, but did not help.
Most probably it depends on the configuration I use since I have not managed to reproduce it with the reference configuration.

Can you please tell me what is wrong in this configuration? What is the culprit and how can I fix it?

Thanks in advance,
Szilard

azdagron · 2022-03-07T20:19:35Z

That is suspicious, particularly with no load! If you turn on debug logs, is there anything in the logs that might indicate what the server is doing?

Another option here is to enable capture a CPU profile when this is reproduced. You can do so by setting the following configurables:

server {
    ... other stuff ...
    profiling_enabled = true
    profiling_names = ["cpu"]
    profiling_port = 1234
}

Then you can use something like the following to capture the profile:

curl -s http://localhost:1234/debug/pprof/profile > cpu.out

You can then attach that to the issue or analyze it yourself using go tool pprof.

szvincze · 2022-03-08T12:52:30Z

@azdagron I reproduced it again. I linked a log file and two different capture files that show very similar status:
spire-server.log
cpu1.out
cpu2.out

azdagron · 2022-03-08T15:05:22Z

Interesting. Looks like the k8sbundle notifier plugin is spinning trying to update the bundle after a server CA rotation. We need to add some more logging to figure out what's happening. I'll see if I can't replicate this locally.

szvincze · 2022-03-08T15:30:01Z

If you use the config I referred, most probably you will be able to reproduce it. I managed to do so in several environments.
In case you have a patch with more logging and you can give it to me I can reproduce it.

azdagron · 2022-03-08T15:31:20Z

Awesome, I can send a patch easily. Also, we can speed up time-to-reproduce by turning down the CA TTL, since this happens on CA rotation. Maybe set the ca_ttl configurable to something low, like 5m. I'll get a patch ready.

azdagron · 2022-03-08T16:36:36Z

I'm still working on reproducing in my environment, but here is a branch on my fork with extra logging:

https://github.com/azdagron/spire/tree/add-k8sbundle-logging

azdagron · 2022-03-08T16:37:00Z

It's on top of the v1.1.0 release, so you don't have to worry about other changes.

szvincze · 2022-03-08T19:40:09Z

I set ca_ttl to "5m" but did not speed it up...

azdagron · 2022-03-08T19:42:26Z

Oh, was this with an existing deployment? If so, your CA is probably still valid.

evan2645 · 2022-03-08T20:01:15Z

Oh, was this with an existing deployment? If so, your CA is probably still valid.

small clarification: CA TTL changes don't take effect until SPIRE rotates into a new CA

szvincze · 2022-03-08T20:27:46Z

As I see CA TTL has 24h as default value. I don't understand why this happens in an hour if it depends on CA rotation.

azdagron · 2022-03-08T20:39:51Z

Oh, yes, you are right. How long was the cluster alive before the repro? Is it possible it had been running for a few days, enough that it would have stale CAs in the bundle that needed to be purged?

I have a local environment up with the patched image and am just waiting for a repro myself.

azdagron · 2022-03-08T21:48:06Z

I reproduced the problem and it seems like the watch channel for the validation webhook is being closed unexpectedly causing an infinite loop on the select. Need to figure out how the channel is being closed.

szvincze · 2022-03-08T22:25:37Z

Good news that you have also seen it.

I usually tried it on freshly created kind clusters that definitely not running for days. As I wrote in the first comment it usually happen after 50-60 mins. However today I observed it after 34 and 35 minutes.

Did you use the configuration I linked or your own?
How long did it take to reproduce it?

azdagron · 2022-03-08T22:32:55Z

I used your config. It took somewhere between 30-60 minutes but I wasn't paying complete attention when it started happening because I was in a meeting but eventually noticed that my fans were exploding 😆

azdagron · 2022-03-08T22:37:54Z

Hmm, we never explicitly stop the watcher, so it must be that an error has occurred on the watcher. Unfortunately my logs were too noisy and lost the reason for the failure. I'll repro again with some slight changes to help capture.

szvincze · 2022-03-08T23:03:02Z

eventually noticed that my fans were exploding 😆
😆 Same here. The loud fans and the draining battery without load called my attention.

azdagron · 2022-03-08T23:50:33Z

The watch on the spiffe.io/webhook ValidatingWebhookConfiguration just closes without any sort of error being sent or anything. Guess I need to dig into the apimachinery repo to figure why this would happen.

azdagron · 2022-03-09T00:09:16Z

I suspect what is happening is that the k8sbundle notifier uses the low-level apimachinery clients to do the watch. The API server has a request timeout of 30m by default after which the request is closed and the watch channel is closed on the client side. Higher level clients implement retries. We should probably switch to the high level clients. Alternatively we need to implement our own retries.

azdagron · 2022-03-09T00:38:09Z

I think we need to refactor the k8sbundle notifier to leverage informers from client-go instead of the raw watch's. They will handle retries along with other transient conditions and keep things in sync. They will incur a small overhead by caching the resources in memory but I suspect that to be negligible in normal circumstances.

faisal-memon · 2022-03-15T07:24:55Z

@szvincze Can you test out this image with the fix described by @azdagron. I was able to reproduce the high cpu issue and i don't see it with this image. fmemon/spire-server:2827

szvincze · 2022-03-15T10:00:52Z

@faisal-memon Sure, I will do so and come back with my feedback today or tomorrow at the latest.

szvincze · 2022-03-16T12:19:15Z

@faisal-memon Something is wrong with the image. I get this error when try to switch to this one in my test environment:
standard_init_linux.go:228: exec user process caused: exec format error

What architecture the image was created for?

faisal-memon · 2022-03-16T17:19:13Z

What architecture the image was created for?

ARM, I have an M1 Mac. Let me see if i can get you an x86 build.

azdagron · 2022-03-16T17:22:29Z

If @faisal-memon submits a PR, the CI/CD pipeline will build a container that you can download from the archived artifacts on the action and import into docker.

faisal-memon · 2022-03-16T18:27:18Z

If @faisal-memon submits a PR, the CI/CD pipeline will build a container that you can download from the archived artifacts on the action and import into docker.

Ok, WIP PR opened. Thats a lot easier than trying to emulate x86 on arm.

azdagron · 2022-03-16T18:39:35Z

Unfortunately, there is a GH action outage right now 😰

edwarnicke · 2022-03-17T02:51:15Z

@azdagron @faisal-memon We are seeing this in NSM... and are hoping to get our NSM v1.3 release out (release candidate 2022-03-28 Release 2022-04-04). What's the ETA on getting this bug fixed?

faisal-memon · 2022-03-17T06:28:20Z

@edwarnicke @szvincze I pushed an x86 image, same tag fmemon/spire-server:2827. Can you try that out?

faisal-memon · 2022-03-17T07:14:50Z

For eta, I should be able remove the wip status by end of this week and then the code review process.

…

On Mar 16, 2022, at 7:51 PM, Ed Warnicke ***@***.***> wrote: @azdagron @faisal-memon We are seeing this in NSM... and are hoping to get our NSM v1.3 release out (release candidate 2022-03-28 Release 2022-04-04). What's the ETA on getting this bug fixed? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

szvincze · 2022-03-17T07:38:41Z

@faisal-memon Now my test is running. I will come back with the results during the day.

szvincze · 2022-03-17T10:13:45Z

@faisal-memon Running for 3 hours, fans are silent, CPU is cold, spire-server is not on the screen in top.

edwarnicke · 2022-03-21T14:27:49Z

@faisal-memon Sounds like the fix works! When can we expect a release containing it?

azdagron · 2022-03-21T19:59:26Z

PR is under active review, but it should be merged well in advance (and included in) our 1.2.2 release, which will release early to mid April.

edwarnicke · 2022-03-21T20:03:00Z

@azdagron Got it, so we will need to look at intermediate alternatives for our Apr 4 v1.3 release :) (like spinning a custom container). We'll be back on a release version as soon as we can manage :)

azdagron · 2022-03-21T20:13:18Z

I suspect the PR will be cleanly merge-able, so you may be able to get away with cherry-picking the resulting commit onto 1.2.1 :)

Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827

azdagron added this to the 1.2.1 milestone Mar 8, 2022

evan2645 modified the milestones: 1.2.1, 1.2.2 Mar 15, 2022

faisal-memon mentioned this issue Mar 16, 2022

Switch to Informers for k8sbundle notifier #2857

Merged

3 tasks

azdagron closed this as completed in #2857 Mar 24, 2022

denis-tingaikin mentioned this issue Mar 30, 2022

cherry-pick "Switch to Informers for k8sbundle notifier (#2857)" into v1.2 #2897

Closed

3 tasks

GerardoGR added a commit to katulu-io/fl-suite that referenced this issue May 31, 2022

chore(spire): upgrade spire components to 1.3.0

f65f72b

Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827

GerardoGR mentioned this issue May 31, 2022

Local kind cluster setup katulu-io/fl-suite#6

Merged

GerardoGR added a commit to katulu-io/fl-suite that referenced this issue Jun 1, 2022

chore(spire): upgrade spire components to 1.3.0

5e0a4e4

Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827

GerardoGR added a commit to katulu-io/fl-suite that referenced this issue Jun 1, 2022

chore(spire): upgrade spire components to 1.3.0

cd6f73f

Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827

GerardoGR added a commit to katulu-io/fl-suite that referenced this issue Jun 1, 2022

chore(spire): upgrade spire components to 1.3.0

4da3d02

Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827

GerardoGR added a commit to katulu-io/fl-suite that referenced this issue Jun 1, 2022

chore(spire): upgrade spire components to 1.3.0

8d439e8

Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827

spire-server too high CPU usage #2827

spire-server too high CPU usage #2827

Comments

szvincze commented Mar 7, 2022

azdagron commented Mar 7, 2022

szvincze commented Mar 8, 2022

azdagron commented Mar 8, 2022

szvincze commented Mar 8, 2022

azdagron commented Mar 8, 2022

azdagron commented Mar 8, 2022

azdagron commented Mar 8, 2022 • edited

szvincze commented Mar 8, 2022

azdagron commented Mar 8, 2022

evan2645 commented Mar 8, 2022

szvincze commented Mar 8, 2022

azdagron commented Mar 8, 2022

azdagron commented Mar 8, 2022

szvincze commented Mar 8, 2022

azdagron commented Mar 8, 2022

azdagron commented Mar 8, 2022

szvincze commented Mar 8, 2022

azdagron commented Mar 8, 2022

azdagron commented Mar 9, 2022

azdagron commented Mar 9, 2022

faisal-memon commented Mar 15, 2022

szvincze commented Mar 15, 2022

szvincze commented Mar 16, 2022

faisal-memon commented Mar 16, 2022

azdagron commented Mar 16, 2022

faisal-memon commented Mar 16, 2022

azdagron commented Mar 16, 2022

edwarnicke commented Mar 17, 2022

faisal-memon commented Mar 17, 2022

faisal-memon commented Mar 17, 2022 via email

szvincze commented Mar 17, 2022

szvincze commented Mar 17, 2022

edwarnicke commented Mar 21, 2022

azdagron commented Mar 21, 2022

edwarnicke commented Mar 21, 2022

azdagron commented Mar 21, 2022

azdagron commented Mar 8, 2022 •

edited