-
Notifications
You must be signed in to change notification settings - Fork 23.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] redis sentinel 100% cpu usage #9956
Comments
Same problem here. |
can you run |
Due to a reboot the sentinel was restarted and is currently not using 100% cpu any more. I will have to wait until it happens again. |
Faced same issue today on 3-node cluster. Restart of single pod didn't help. Only scale to 0 and scale back to 3 nodes solved the issue. |
@yevhen-harmonizehr please specify which version you're using. |
redis:7.0.5-alpine |
so it can't be the above mentioned fix, which was part of 6.2. |
Its reproducing with this helm chart: https://artifacthub.io/packages/helm/dandydev-charts/redis-ha Tried multiple versions of redis ( 6.X + 7.X ) Its doesn't reproducing all the times Some info:
Inside the sentinel container:
Its 24% cause the vm machine is 4 cores (so its 100% of 1 core) lsof sentinel container:
|
i'd like a stack trace to see what it's doing. |
Running as 'root' and he doesnt shows the file and line number |
Tried with bullseye docker image and he gives me more info:
The Stacktraces running on sentinel that uses high cpu for no reason |
Running multiple times:
|
thanks. this is useful. @michael-grunder can you take a look? any reason why redisNetRead will eat CPU, or do a blocking read (following an epoll indication that it's readable)? |
No, he’s busy fixing macOS bugs. |
From the call stack, it seems like maybe it's getting into a loop where I'll take a closer look tomorrow. Edit: It would also be interesting to see if there was anything of note in the sentinel logs. |
Logs of sentinel that using 100% CPU:
|
can see same lines for gdb
and for sentinel logs
|
There some workaround ? Edit: |
This also happens on GKE using ContainerD so not Docker specific. We try to manage this by setting up livenessCheck on Sentinel CPU usage so that it gets restarted automatically if it uses too much CPU. |
We observe this issue on GKE as well. |
@oranagra - any update on this issue? |
@michael-grunder can you please take a closer look? |
is there any workaround or hint how to prevent/solve this ? what additionally confuses me:
thanks in advance! edit: |
maybe a redis-expert can see something here - this is the "info" command output of the "bad" sentinel container with 100% cpu consumption ("sentinel pending-scripts" returns nothing, failover works fine):
|
I found a workaround (for docker/k8s): I played a bit with the sentinel CLI and found out that a "sentinel reset *" always solves the problem and the CPU consumption is dropped down to normal. and so far the problem occurs only after a rolling update or container restarts.
|
Thank you for the proposed workaround @Sven1410. Did the issue return for you in the meantime over the last few days (as it is intermittent)?
A root cause analysis is (still) preferred though 😊 . |
so far the problem did not came back. Thanks for testing + sharing the results! |
Unfortunately, the workaround does not (consistently) work on my deployments. I have 2 sets of versions and configurations, I still observe the cpu drain, at least on the 6.2.5 version.
Note that I do not observe the |
"sentinel pending-scripts" was already the next command I typed in - it returns nothing, only an empty line... |
Ah OK my bad, should have checked the sentinel cli directives. I'll verify this the next time I observe the issue.
|
We have experienced this in prod today and |
Following up after observing this for a while (on a number of independent redis-ha deployments), findings:
Now I'll stop posting behavior patterns for this issue, in the hope of a root cause analysis. 🙏 |
After debugging with gdb I've got exactly the same stack traces like others posted here. Is there an option to fall back (for linux docker images) to "ae_kqueue.c"? (#define HAVE_EPOLL 0)? |
We are seeing the same issue in our environment. Our environment is a GKE cluster running on k8s version 1.26. The perplexing thing is, this CPU issue is only happening in one namespace in the cluster. If we try deploying redis-sentinel in another namespace, we are not seeing the issue! When I read it may be a networking issue, I thought maybe the network policies in the namespace where the CPU ramp up was observed were the issue. So I added a couple policies that allowed all egress/ingress traffic. But that didn't resolve the issue either. So still stumped as to what's causing this issue and why it's only happening in one namespace! |
We figured out what the issue was on our end. When I monitored the processes on the redis and sentinel containers, I noticed that the processes that get created for the health checks (liveness, readiness probes) were taking up a lot of CPU (around 40% - 50%). This didn't seem right. In the namespaces where the overall CPU usage was low, these health check processes took up a negligible amount of CPU. One key difference between the two namespaces was that the pod in the namespace where we saw high CPU usage had a very large number of environment variables set by k8s service links. When we disabled the service links, we no longer observed the high cpu usage! It looks like the large number of environment variables causes the liveness/readiness probes to take up a lot of cpu. But then again, I'm not sure if the high number of environment variables is the root cause of the CPU usage. Because we have other redis bitnami deployments running in standalone mode where the same health checks are run. And we don't see the health checks using up the CPU like in the redis-sentinel pods. Perhaps a combination of having a large number of environment variables and multiple containers in a pod is causing the health check processes to take up a lot of CPU? I also found an article discussing how exec probes can result in large cpu usage but not sure if it is related to the health check problems in redis-sentinel. And another github thread discussing the same high CPU issue for exec probes. |
Describe the bug
redis sentinel is having 100% cpu usage. we are using redis-ha helm chart version 4.12.15 under default configuration but we have increased resource requests and limits by a significant amount. The redis version used by us is 6.0.7.
To reproduce
I'm not sure if it can be reproduced by anyone else.
Even after redis restart, one or two sentinel reaches 100% CPU utilization.
Other redis metrics are fine and seems okay.
Expected behavior
Sentinel should be taking so much CPU.
Additional information
I ran strace on one of the sentinel and found this:
This seems like a repeating pattern; where redis is trying to read from two sockets and failing continuously and this followed by a epoll_wait. It happens more than 100x times in a second.
socket 11 and 10 seems to point to a connection with the redis process for this sentinel. To be exact it points to the ClusterIP of the redis-cluster-announce pod on which sentinel was running and the port is 6379.
redis
sentinel
Any tops for debugging this?
The text was updated successfully, but these errors were encountered: