Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spire-server too high CPU usage #2827

Closed
szvincze opened this issue Mar 7, 2022 · 36 comments · Fixed by #2857
Closed

spire-server too high CPU usage #2827

szvincze opened this issue Mar 7, 2022 · 36 comments · Fixed by #2857
Milestone

Comments

@szvincze
Copy link
Contributor

szvincze commented Mar 7, 2022

  • Version: 1.2.0
  • Platform: Linux 5.13.0-30-generic - 20.04.1-Ubuntu SMP x86_64
  • Subsystem: server

spire-server is running with no agent and without any load on it. CPU consumption is under 1% for ~50-60 minutes, then suddenly it starts consuming ~150-160% CPU and it persists until shutting down the spire-server. The same happens with all replicas.
Different environments show different results but the CPU resource consumption is always jumps much higher after similar amount of time spent (~50-60 mins).
I observed the above mentioned ~150-160% on a kind cluster, 95-100% on minikube, 100-110% on kvm/qemu and 45-50% on non-virtual k8s environment.

I tried the same with spire-server 1.0.1, 1.1.0 and 1.2.0 images and got the same results.
I also created a configuration without k8s-workload-registrar too, but did not help.
Most probably it depends on the configuration I use since I have not managed to reproduce it with the reference configuration.

Can you please tell me what is wrong in this configuration? What is the culprit and how can I fix it?

Thanks in advance,
Szilard

@azdagron
Copy link
Member

azdagron commented Mar 7, 2022

That is suspicious, particularly with no load! If you turn on debug logs, is there anything in the logs that might indicate what the server is doing?

Another option here is to enable capture a CPU profile when this is reproduced. You can do so by setting the following configurables:

server {
    ... other stuff ...
    profiling_enabled = true
    profiling_names = ["cpu"]
    profiling_port = 1234
}

Then you can use something like the following to capture the profile:

curl -s http://localhost:1234/debug/pprof/profile > cpu.out

You can then attach that to the issue or analyze it yourself using go tool pprof.

@szvincze
Copy link
Contributor Author

szvincze commented Mar 8, 2022

@azdagron I reproduced it again. I linked a log file and two different capture files that show very similar status:
spire-server.log
cpu1.out
cpu2.out

@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

Interesting. Looks like the k8sbundle notifier plugin is spinning trying to update the bundle after a server CA rotation. We need to add some more logging to figure out what's happening. I'll see if I can't replicate this locally.

@szvincze
Copy link
Contributor Author

szvincze commented Mar 8, 2022

If you use the config I referred, most probably you will be able to reproduce it. I managed to do so in several environments.
In case you have a patch with more logging and you can give it to me I can reproduce it.

@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

Awesome, I can send a patch easily. Also, we can speed up time-to-reproduce by turning down the CA TTL, since this happens on CA rotation. Maybe set the ca_ttl configurable to something low, like 5m. I'll get a patch ready.

@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

I'm still working on reproducing in my environment, but here is a branch on my fork with extra logging:

https://github.com/azdagron/spire/tree/add-k8sbundle-logging

@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

It's on top of the v1.1.0 release, so you don't have to worry about other changes.

@szvincze
Copy link
Contributor Author

szvincze commented Mar 8, 2022

I set ca_ttl to "5m" but did not speed it up...

@azdagron azdagron added this to the 1.2.1 milestone Mar 8, 2022
@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

Oh, was this with an existing deployment? If so, your CA is probably still valid.

@evan2645
Copy link
Member

evan2645 commented Mar 8, 2022

Oh, was this with an existing deployment? If so, your CA is probably still valid.

small clarification: CA TTL changes don't take effect until SPIRE rotates into a new CA

@szvincze
Copy link
Contributor Author

szvincze commented Mar 8, 2022

As I see CA TTL has 24h as default value. I don't understand why this happens in an hour if it depends on CA rotation.

@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

Oh, yes, you are right. How long was the cluster alive before the repro? Is it possible it had been running for a few days, enough that it would have stale CAs in the bundle that needed to be purged?

I have a local environment up with the patched image and am just waiting for a repro myself.

@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

I reproduced the problem and it seems like the watch channel for the validation webhook is being closed unexpectedly causing an infinite loop on the select. Need to figure out how the channel is being closed.

@szvincze
Copy link
Contributor Author

szvincze commented Mar 8, 2022

Good news that you have also seen it.

I usually tried it on freshly created kind clusters that definitely not running for days. As I wrote in the first comment it usually happen after 50-60 mins. However today I observed it after 34 and 35 minutes.

Did you use the configuration I linked or your own?
How long did it take to reproduce it?

@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

I used your config. It took somewhere between 30-60 minutes but I wasn't paying complete attention when it started happening because I was in a meeting but eventually noticed that my fans were exploding 😆

@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

Hmm, we never explicitly stop the watcher, so it must be that an error has occurred on the watcher. Unfortunately my logs were too noisy and lost the reason for the failure. I'll repro again with some slight changes to help capture.

@szvincze
Copy link
Contributor Author

szvincze commented Mar 8, 2022

eventually noticed that my fans were exploding 😆
😆 Same here. The loud fans and the draining battery without load called my attention.

@azdagron
Copy link
Member

azdagron commented Mar 8, 2022

The watch on the spiffe.io/webhook ValidatingWebhookConfiguration just closes without any sort of error being sent or anything. Guess I need to dig into the apimachinery repo to figure why this would happen.

@azdagron
Copy link
Member

azdagron commented Mar 9, 2022

I suspect what is happening is that the k8sbundle notifier uses the low-level apimachinery clients to do the watch. The API server has a request timeout of 30m by default after which the request is closed and the watch channel is closed on the client side. Higher level clients implement retries. We should probably switch to the high level clients. Alternatively we need to implement our own retries.

@azdagron
Copy link
Member

azdagron commented Mar 9, 2022

I think we need to refactor the k8sbundle notifier to leverage informers from client-go instead of the raw watch's. They will handle retries along with other transient conditions and keep things in sync. They will incur a small overhead by caching the resources in memory but I suspect that to be negligible in normal circumstances.

@faisal-memon
Copy link
Contributor

@szvincze Can you test out this image with the fix described by @azdagron. I was able to reproduce the high cpu issue and i don't see it with this image. fmemon/spire-server:2827

@szvincze
Copy link
Contributor Author

@faisal-memon Sure, I will do so and come back with my feedback today or tomorrow at the latest.

@evan2645 evan2645 modified the milestones: 1.2.1, 1.2.2 Mar 15, 2022
@szvincze
Copy link
Contributor Author

@faisal-memon Something is wrong with the image. I get this error when try to switch to this one in my test environment:
standard_init_linux.go:228: exec user process caused: exec format error

What architecture the image was created for?

@faisal-memon
Copy link
Contributor

What architecture the image was created for?

ARM, I have an M1 Mac. Let me see if i can get you an x86 build.

@azdagron
Copy link
Member

If @faisal-memon submits a PR, the CI/CD pipeline will build a container that you can download from the archived artifacts on the action and import into docker.

@faisal-memon
Copy link
Contributor

If @faisal-memon submits a PR, the CI/CD pipeline will build a container that you can download from the archived artifacts on the action and import into docker.

Ok, WIP PR opened. Thats a lot easier than trying to emulate x86 on arm.

@azdagron
Copy link
Member

Unfortunately, there is a GH action outage right now 😰

@edwarnicke
Copy link

@azdagron @faisal-memon We are seeing this in NSM... and are hoping to get our NSM v1.3 release out (release candidate 2022-03-28 Release 2022-04-04). What's the ETA on getting this bug fixed?

@faisal-memon
Copy link
Contributor

@edwarnicke @szvincze I pushed an x86 image, same tag fmemon/spire-server:2827. Can you try that out?

@faisal-memon
Copy link
Contributor

faisal-memon commented Mar 17, 2022 via email

@szvincze
Copy link
Contributor Author

@faisal-memon Now my test is running. I will come back with the results during the day.

@szvincze
Copy link
Contributor Author

@faisal-memon Running for 3 hours, fans are silent, CPU is cold, spire-server is not on the screen in top.

@edwarnicke
Copy link

@faisal-memon Sounds like the fix works! When can we expect a release containing it?

@azdagron
Copy link
Member

PR is under active review, but it should be merged well in advance (and included in) our 1.2.2 release, which will release early to mid April.

@edwarnicke
Copy link

@azdagron Got it, so we will need to look at intermediate alternatives for our Apr 4 v1.3 release :) (like spinning a custom container). We'll be back on a release version as soon as we can manage :)

@azdagron
Copy link
Member

I suspect the PR will be cleanly merge-able, so you may be able to get away with cherry-picking the resulting commit onto 1.2.1 :)

GerardoGR added a commit to katulu-io/fl-suite that referenced this issue May 31, 2022
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
GerardoGR added a commit to katulu-io/fl-suite that referenced this issue Jun 1, 2022
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
GerardoGR added a commit to katulu-io/fl-suite that referenced this issue Jun 1, 2022
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
GerardoGR added a commit to katulu-io/fl-suite that referenced this issue Jun 1, 2022
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
GerardoGR added a commit to katulu-io/fl-suite that referenced this issue Jun 1, 2022
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants