Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher使用常见问题处理 #80

Open
wmenjoy opened this issue Jan 11, 2021 · 6 comments
Open

Rancher使用常见问题处理 #80

wmenjoy opened this issue Jan 11, 2021 · 6 comments

Comments

@wmenjoy
Copy link
Owner

wmenjoy commented Jan 11, 2021

etcd组件

1. 诡异的 K8S 滚动更新异常

1: 重新部署后,deployment总是提示部署中,可用数为0,重新生成的为2, 服务部署成功,kubelet正常,而kube-controller-manager的提示对象不是最新版本。

现象

a. 查看kube-controller-manager的日志

I0111 17:41:09.923836       1 deployment_controller.go:484] Error syncing deployment footstone-common/bjca-deployer: Operation cannot be fulfilled on deployments.apps "bjca-deployer": the object has been modified; please apply your changes to the latest version and try again

b. describe pod 状态
MinAvailablePodn为false
c. 最近一天k8s主机的包含etcd的状态失败
分析刚刚有台主机etcd挂掉,使用rancher重新接入,有可能是etcd数据状态不一致导致,停掉kube-controller-manager,然后自动重定向到其他机器,发现状态恢复

参考

  1. 三年之久的 etcd3 数据不一致 bug 分析 - 腾讯云原生 - 博客园
@wmenjoy
Copy link
Owner Author

wmenjoy commented Mar 15, 2021

etcd 安装问题

  1. 报错误如下
2021-03-15 07:25:06.006959 I | embed: rejected connection from "192.168.214.32:32642" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")

原因:不同的客户端生成的配置是不一样的,ca证书可能失效,删除 /etc/kubernetes的配置,删除对应docker服务,重新生成即可

@wmenjoy
Copy link
Owner Author

wmenjoy commented Mar 24, 2021

@wmenjoy
Copy link
Owner Author

wmenjoy commented Jul 5, 2022

Cattle-System

  1. k3s集群监控(Rancher)删除之空间(namespace)cattle-system一直为Terminating状态解决方案
kubectl patch namespace cattle-system -p '{"metadata":{"finalizers":[]}}' --type='merge' -n cattle-system
kubectl delete namespace cattle-system --grace-period=0 --force

kubectl patch namespace cattle-global-data -p '{"metadata":{"finalizers":[]}}' --type='merge' -n cattle-system
kubectl delete namespace cattle-global-data --grace-period=0 --force

kubectl patch namespace local -p '{"metadata":{"finalizers":[]}}' --type='merge' -n cattle-system

for resource in `kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get -o name -n local`; do kubectl patch $resource -p '{"metadata": {"finalizers": []}}' --type='merge' -n local; done

kubectl delete namespace local --grace-period=0 --force
  1. 直接使用api-server删除
    1. 启动proxy
    #找一台机器
    kubectl proxy --port=8081
    
    1. 导出ns的json格式
 ns=cattle-fleet-system
 kubectl get ns $ns -o json > tmp.json
    1. 修改json
      修改spec为
"spec":{ ""}
    1. 调用接口
    ns=cattle-fleet-system
    curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8081/api/v1/namespaces/$ns/finalize
    

参考

  1. k3s集群监控(Rancher)删除之空间(namespace)cattle-system一直为Terminating状态解决方案_龍尐的博客-CSDN博客
  2. kubernetes无法删除namespace 提示 Terminating_吕楚王的博客-CSDN博客_kubectl 删除命名空间

@wmenjoy
Copy link
Owner Author

wmenjoy commented Jul 8, 2022

Rancher清理

#!/bin/sh
docker rm -f $(docker ps -qa)
docker volume rm $(docker volume ls -q)
cleanupdirs="/var/lib/etcd /etc/kubernetes /etc/cni /opt/cni /var/lib/cni /var/run/calico"
for dir in $cleanupdirs; do
echo “Removing $dir”
rm -rf $dir
done

第二种方法

df -h|grep kubelet |awk -F % ‘{print $2}’|xargs umount
#删除所有容器
sudo docker rm -f $(sudo docker ps -qa)

#删除/var/etcd目录
sudo rm -rf /var/etcd

#删除/var/lib/kubelet/目录,删除前先卸载
for m in $(sudo tac /proc/mounts | sudo awk ‘{print $2}’|sudo grep /var/lib/kubelet);do
sudo umount $m||true
done
sudo rm -rf /var/lib/kubelet/

#删除/var/lib/rancher/目录,删除前先卸载
for m in $(sudo tac /proc/mounts | sudo awk ‘{print $2}’|sudo grep /var/lib/rancher);do
sudo umount $m||true
done
sudo rm -rf /var/lib/rancher/

#删除/run/kubernetes/ 目录
sudo rm -rf /run/kubernetes/

#删除所有的数据卷
sudo docker volume rm $(sudo docker volume ls -q)

#再次显示所有的容器和数据卷,确保没有残留
sudo docker ps -a
sudo docker volume ls

rm /var/lib/kubelet/* -rf

rm /etc/kubernetes/* -rf

rm /var/lib/rancher/* -rf

rm /var/lib/etcd/* -rf

rm /var/lib/cni/* -rf

iptables -F && iptables -t nat -F

ip link del flannel.1

docker ps -a|awk ‘{print $1}’|xargs docker rm -f

docker volume ls|awk ‘{print $2}’|xargs docker volume rm

@wmenjoy
Copy link
Owner Author

wmenjoy commented Feb 20, 2024

冲突

1. ipv6不支持

canal 启动失败

rancher Streaming server stopped unexpectedly: listen tcp [::1]:0: bind: cannot assign requested address

发现 /etc/hosts.conf 把locahost 设置为了 #::1 低版本对ipv6支持不好, 删除恢复了

2. 网络冲突 Calico node '192.168.126.16' is already using the IPv4 address 172.18.0.1.

Error from server (BadRequest): a container name must be specified for pod canal-4dj9f, choose one of: [install-cni flexvol-driver calico-node kube-flannel]
[rke@fs01-192-168-131-240 ~]$ kubectl -n kube-system logs canal-4dj9f calico-node
2024-02-20 09:18:56.465 [INFO][9] startup/startup.go 379: Early log level set to info
2024-02-20 09:18:56.465 [INFO][9] startup/startup.go 395: Using NODENAME environment for node name
2024-02-20 09:18:56.466 [INFO][9] startup/startup.go 407: Determined node name: 192.168.126.5
2024-02-20 09:18:56.467 [INFO][9] startup/startup.go 439: Checking datastore connection
2024-02-20 09:18:56.483 [INFO][9] startup/startup.go 463: Datastore connection verified
2024-02-20 09:18:56.484 [INFO][9] startup/startup.go 112: Datastore is ready
2024-02-20 09:18:56.510 [INFO][9] startup/startup.go 759: Using autodetected IPv4 address on interface br-daa07946aef5: 172.18.0.1/16
2024-02-20 09:18:56.510 [INFO][9] startup/startup.go 576: Node IPv4 changed, will check for conflicts
2024-02-20 09:18:56.518 [WARNING][9] startup/startup.go 1119: Calico node '192.168.126.16' is already using the IPv4 address 172.18.0.1.
2024-02-20 09:18:56.518 [WARNING][9] startup/startup.go 1331: Terminating
Calico node failed to start

不要在k8s集群上,直接运行其他的服务

3. inotify_add_watch -- failed: "No space left on device"

node数量超了
参考:https://askubuntu.com/questions/1088272/inotify-add-watch-failed-no-space-left-on-device

@wmenjoy
Copy link
Owner Author

wmenjoy commented Feb 23, 2024

rancher问题

1、 Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system
网络插件部署失败
清空iptables -F && iptables -F -t nat
2、x509: cannot validate certificate for x because it doesn't contain any IP SANs seen when using custom certificates
重启docker

参考

  1. Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system  rancher/rke#2730
  2. x509: cannot validate certificate for x because it doesn't contain any IP SANs seen when using custom certificates rancher/rke#2216
  3. metrics-server error because it doesn't contain any IP SANs kubernetes-sigs/metrics-server#196

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant