New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frequent leader changes for components: kube-controller-manager, kube-scheduler, etcd #2295
Comments
Please share logs from the following containers:
And some more info on the setup:
|
masters: CPU 4xCORES , 8GB RAM, HDD In the output of top, wa (IO Wait) sometimes is high for all cores 'wa >1' kube-scheduler:
kube-controller-manager:
etcd:
kube-apiserver: (|grep error)
|
If
|
What exactly should i pay attention to here? Output from fio test: mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1 Run status group 0 (all jobs): Disk stats (read/write): ETCD metrics |
I don't think this is fast enough. I guess you used the command from https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd, where it states: Your example shows the 99th percentile at Before you do anything to the current cluster, I would start with doing some tests on other machines on the VM storage and check if you can get better storage attached to them to be used. |
So, if I well understand, this problems can have a direct impact on other components like: kube-scheduler and kube-controller-manager (frequently leader election) ?? Moreover after I repeated fio test, outcomes were better: for example: write: |
There is more information here: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#prerequisites Fluctuating results are also bad, 42630 usec is still not good enough, IOPS seem a bit better. I would advise to improve storage performance to see if that clears up your issues. |
I understand but i am not sure if I it is possible in my situation but I will try. |
To be honest, even with the highest values possible (which is only to support high latency setups and not even really to lower IO requirements), this will keep hurting you. As linked before (https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#disks):
and the example configurations: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#example-hardware-configurations |
Ok i get it. But I'd like to come back to kube-scheduler and kube-controller-manager. Can very frequent leader changes also be caused by etcd ?? thank you very much for the answers. I have errors in 'calico-kube-controllers' pod: I noticed that when these errors show up my kube-scheduler and kube-controller-manager change leadership. I have no idea why this is happening. Generally the cluster works, but this is a concern. I would be very grateful for your help. |
Pretty sure this is also caused by etcd, both the leader changes (not being able to hold a lock) and the |
Hi there, |
RKE version: v1.1.3
Docker version: (
docker version
,docker info
preferred) 19.03.11Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred) CentOS Linux 7, 5.7.4-1.el7.elrepo.x86_64Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Bare-metal
cluster.yml file:
Steps to Reproduce:
Results:
I have noticed very frequent 'leader election' changes for components: kube-controller-manager, kube-scheduler, etcd
it takes place even every few minutes. sometimes every few hours. when component lose leadership, next it is restarted. Is it normal ?
In most cases this situation daesn't impact on cluster status. But sometimes yes, becouse switching new leader takes to much time (only sometimes). After 100 days I have above 1000 leaderTransitions ...
I have noticed repeated error inside pods: calico-kube-controllers
Perhaps it has something to do with the problem:
what could be causing this ??
thanks in advance
The text was updated successfully, but these errors were encountered: