New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network between hosts is broken after force upgrade of ipsec in k8s setup - k8s cannot become healthy #9508
Comments
From rancher-kubernetes-auth to rancher-kubernetes-agent:
From rancher-kubernetes-auth to rancher-ingress-controller:
Which shows networking is fine. |
@joshwget We have to fix the template so k8s api and auth services land on the same host. It will make communication between them less error prone to ipsec failures. If communication between them is broken, noone - including all k8s system services - will be able to authorize. We should also enable healthcheck on the auth service. @leodotcloud @galal-hussein we still have to debug why 10.42.x.x network between hosts is unavailable for a long period of time after the upgrade. We can follow up on that either in this ticket, or open a new one - up to you. |
Let's keep this issue open as I've re-titled it to the actual issue. |
As part of the fix for: #8684 (related PRs: rancher/cattle#2760, rancher/cattle#2766), the metadata started having information about stopped containers. In case of services started with When arpsync goroutine kicks in, the MAC address for the ARP entry can flap with the above mentioned two MAC addresses causing connectivity issues. This can be seen in the
Fix would be to consider only running/starting containers in arpsync goroutine. |
This problem is very easy to trigger. Just upgrade a k8s environment and don't click "Finish Upgrade". These are the logs for an etcd container:
|
Fixed in |
Rancher server: v1.6.6-rc6 I ran Stress testing with 10 loops on kubernetes which includes force upgrading ipsec for each loop iteration and check if kubernetes healthy, the test succeeded and i don't see the issue anymore. |
Rancher versions:
rancher/server: v1.6.6-rc5
kubernetes (if applicable): rancher/k8s:v1.7.2-rancher5
Docker version: (
docker version
,docker info
preferred)1.12
Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)Ubuntu 16.04
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
GCE
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single Node Rancher
Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
Kubernetes
Steps to Reproduce:
Results:
Kubernetes stack is in unhealthy state, kube api shows the following error:
Kubernetes agent and ingress controlelr keep restarting due to the failures of kubernetes api
The text was updated successfully, but these errors were encountered: