Network between hosts is broken after force upgrade of ipsec in k8s setup - k8s cannot become healthy #9508

galal-hussein · 2017-07-28T16:45:22Z

Rancher versions:
rancher/server: v1.6.6-rc5
kubernetes (if applicable): rancher/k8s:v1.7.2-rancher5

Docker version: (docker version,docker info preferred)
1.12
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Ubuntu 16.04
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
GCE
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single Node Rancher
Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
Kubernetes
Steps to Reproduce:

Install kubernets 1.7
Force upgrade ipsec stack

Results:

Kubernetes stack is in unhealthy state, kube api shows the following error:

docker logs f71c08caf66c 2>&1 | tail
E0728 13:22:19.820247       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:21.306333       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:21.715033       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:21.923309       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:25.328453       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:25.716978       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:25.923947       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:26.304622       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:26.768880       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:26.867564       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]

Kubernetes agent and ingress controlelr keep restarting due to the failures of kubernetes api

The text was updated successfully, but these errors were encountered:

leodotcloud · 2017-07-28T18:03:15Z

From rancher-kubernetes-auth to rancher-kubernetes-agent:

root@b61d3ee80f59:/# curl rancher-kubernetes-agent:10240/healthcheck
ok
root@b61d3ee80f59:/#

From rancher-kubernetes-auth to rancher-ingress-controller:

root@b61d3ee80f59:/# curl rancher-ingress-controller:10241/healthz
OK
root@b61d3ee80f59:/#

Which shows networking is fine.

alena1108 · 2017-07-28T20:34:29Z

@joshwget We have to fix the template so k8s api and auth services land on the same host. It will make communication between them less error prone to ipsec failures. If communication between them is broken, noone - including all k8s system services - will be able to authorize. We should also enable healthcheck on the auth service.

@leodotcloud @galal-hussein we still have to debug why 10.42.x.x network between hosts is unavailable for a long period of time after the upgrade. We can follow up on that either in this ticket, or open a new one - up to you.

deniseschannon · 2017-07-28T21:59:02Z

Let's keep this issue open as I've re-titled it to the actual issue.

leodotcloud · 2017-07-29T22:09:36Z

As part of the fix for: #8684 (related PRs: rancher/cattle#2760, rancher/cattle#2766), the metadata started having information about stopped containers. In case of services started with retainIp: True, both the stopped and running containers both have the same IP address but different MAC addresses in metadata. Due to this when building the ARP table map which is keyed by IP address can end up pointing to either of these MAC addresses depending on the random order in metadata.

When arpsync goroutine kicks in, the MAC address for the ARP entry can flap with the above mentioned two MAC addresses causing connectivity issues.

This can be seen in the network-manager logs:

time="2017-07-29T00:42:11Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:42:23Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:42:28Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:43:02Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:43:07Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:43:12Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:43:19Z" level=info msg="Setting up resolv.conf for ContainerId [70c1128a44064837e8816c5b6a414b4889f39e3fbb36e563d5539beaf0c60c61]"
time="2017-07-29T00:43:19Z" level=info msg="CNI up" cid=70c1128a44064837e8816c5b6a414b4889f39e3fbb36e563d5539beaf0c60c61 networkMode=ipsec
time="2017-07-29T00:43:19Z" level=info msg="CNI up done" cid=70c1128a44064837e8816c5b6a414b4889f39e3fbb36e563d5539beaf0c60c61 networkMode=ipsec result=IP4:{IP:{IP:10.42.143.63 Mask:ffff0000} Gateway:10.42.0.1 Routes:[{Dst:{IP:169.254.169.250 Mask:ffffffff} GW:<nil>} {Dst:{IP:0.0.0.0 Mask:00000000} GW:10.42.0.1}]}, DNS:{Nameservers:[] Domain: Search:[] Options:[]}
time="2017-07-29T00:43:38Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:43:43Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:8 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:43:48Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:43:50Z" level=info msg="CNI down" cid=2a33272b1004dd63c6f32469d10aaa6fb839b8dfa0574c642b44f1d574208e61 networkMode=ipsec
time="2017-07-29T00:43:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:44:08Z" level=info msg="Setting up resolv.conf for ContainerId [c6322bd9e718eb09d0d506ea29330e141764ca136d1e72556c5c4a37111d6dfb]"
time="2017-07-29T00:44:08Z" level=info msg="CNI up" cid=c6322bd9e718eb09d0d506ea29330e141764ca136d1e72556c5c4a37111d6dfb networkMode=ipsec
time="2017-07-29T00:44:08Z" level=info msg="CNI up done" cid=c6322bd9e718eb09d0d506ea29330e141764ca136d1e72556c5c4a37111d6dfb networkMode=ipsec result=IP4:{IP:{IP:10.42.244.63 Mask:ffff0000} Gateway:10.42.0.1 Routes:[{Dst:{IP:169.254.169.250 Mask:ffffffff} GW:<nil>} {Dst:{IP:0.0.0.0 Mask:00000000} GW:10.42.0.1}]}, DNS:{Nameservers:[] Domain: Search:[] Options:[]}
time="2017-07-29T00:44:08Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:44:39Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:44:40Z" level=info msg="Setting up resolv.conf for ContainerId [940c8fb0bdd82f43a9fce89ef2f68e55bbc56d32c634480640b3b1d0ec1c3a53]"
time="2017-07-29T00:44:40Z" level=info msg="CNI up" cid=940c8fb0bdd82f43a9fce89ef2f68e55bbc56d32c634480640b3b1d0ec1c3a53 networkMode=ipsec
time="2017-07-29T00:44:40Z" level=info msg="CNI up done" cid=940c8fb0bdd82f43a9fce89ef2f68e55bbc56d32c634480640b3b1d0ec1c3a53 networkMode=ipsec result=IP4:{IP:{IP:10.42.125.229 Mask:ffff0000} Gateway:10.42.0.1 Routes:[{Dst:{IP:169.254.169.250 Mask:ffffffff} GW:<nil>} {Dst:{IP:0.0.0.0 Mask:00000000} GW:10.42.0.1}]}, DNS:{Nameservers:[] Domain: Search:[] Options:[]}
time="2017-07-29T00:44:49Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:45:52Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:45:57Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:46:47Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:46:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:47:13Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:48:03Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:48:08Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:48:29Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:49:11Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:49:22Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:50:08Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:50:28Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:53:38Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:54:35Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:54:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:55:46Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:57:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:57:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:58:43Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:59:30Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:59:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:00:10Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:00:23Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:00:39Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:00:59Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:01:38Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:03:48Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:06:55Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:07:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:8 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:08:10Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:08:15Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:08:43Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:09:27Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:09:32Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:09:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:09:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:10:13Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:11:38Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:11:42Z" level=info msg="Setting up resolv.conf for ContainerId [c872bfe3f43f5005ca3dca2b3545433e30135137a144841e613c0af01c131304]"
time="2017-07-29T01:12:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:13:00Z" level=info msg="Setting up resolv.conf for ContainerId [c9deceb1db79c15b660af92446dcd947614b3262480d7d054120b800e0cae131]"
time="2017-07-29T01:15:57Z" level=info msg="CNI down" cid=70c1128a44064837e8816c5b6a414b4889f39e3fbb36e563d5539beaf0c60c61 networkMode=ipsec
time="2017-07-29T01:16:00Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:8 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"

Fix would be to consider only running/starting containers in arpsync goroutine.

leodotcloud · 2017-07-29T22:30:57Z

This problem is very easy to trigger. Just upgrade a k8s environment and don't click "Finish Upgrade".

These are the logs for an etcd container:

7/29/2017 3:09:29 PMtime="2017-07-29T22:09:29Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"
7/29/2017 3:09:34 PMtime="2017-07-29T22:09:34Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:53:c7:73}(expected: 02:4d:57:21:2e:9a) for local container, fixing it"
7/29/2017 3:10:01 PMtime="2017-07-29T22:10:01Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"
7/29/2017 3:10:24 PMtime="2017-07-29T22:10:24Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:53:c7:73}(expected: 02:4d:57:21:2e:9a) for local container, fixing it"
7/29/2017 3:10:29 PMtime="2017-07-29T22:10:29Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"
7/29/2017 3:11:29 PMtime="2017-07-29T22:11:29Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:53:c7:73}(expected: 02:4d:57:21:2e:9a) for local container, fixing it"
7/29/2017 3:11:46 PMtime="2017-07-29T22:11:46Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"
7/29/2017 3:12:11 PMtime="2017-07-29T22:12:11Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:53:c7:73}(expected: 02:4d:57:21:2e:9a) for local container, fixing it"
7/29/2017 3:12:27 PMtime="2017-07-29T22:12:27Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"

leodotcloud · 2017-07-30T04:46:42Z

Fixed in network-services:v0.2.4 and rancher/network-manager:v0.7.6

galal-hussein · 2017-07-30T18:43:00Z

Rancher server: v1.6.6-rc6
Network-manager: v0.7.6
Kubernets: v1.7

I ran Stress testing with 10 loops on kubernetes which includes force upgrading ipsec for each loop iteration and check if kubernetes healthy, the test succeeded and i don't see the issue anymore.

galal-hussein added area/kubernetes kind/bug Issues that are defects reported by users or that we know have reached a real release labels Jul 28, 2017

galal-hussein added this to the July 2017 milestone Jul 28, 2017

galal-hussein assigned alena1108 and leodotcloud Jul 28, 2017

leodotcloud mentioned this issue Jul 28, 2017

[Token has been invalidated, an error on the server ("illegal base64 data at input byte 36") has prevented the request from succeeding] seen in Kubernetes container after K8s upgrade #9509

Closed

alena1108 assigned joshwget Jul 28, 2017

deniseschannon changed the title ~~Kubernetes stack won't start correctly after force upgrade ipsec stack~~ Network between hosts is broken after force upgrade of ipsec in k8s setup - k8s cannot become healthy Jul 28, 2017

leodotcloud mentioned this issue Jul 29, 2017

Not able to launch Dashboard UI after force upgrading ipsec. #9511

Closed

This was referenced Jul 29, 2017

considering only running/starting containers for arpsync rancher/plugin-manager#80

Merged

Updating network-manager to v0.7.6 for arpsync fixes rancher/rancher-catalog#792

Merged

leodotcloud added status/resolved labels Jul 30, 2017

leodotcloud assigned sangeethah and unassigned alena1108 and joshwget Jul 30, 2017

sangeethah assigned galal-hussein and unassigned sangeethah Jul 30, 2017

galal-hussein closed this as completed Jul 30, 2017

leodotcloud mentioned this issue Aug 3, 2017

investigate the impact on "stopping" containers #9561

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network between hosts is broken after force upgrade of ipsec in k8s setup - k8s cannot become healthy #9508

Network between hosts is broken after force upgrade of ipsec in k8s setup - k8s cannot become healthy #9508

galal-hussein commented Jul 28, 2017

leodotcloud commented Jul 28, 2017

alena1108 commented Jul 28, 2017 •

edited

deniseschannon commented Jul 28, 2017

leodotcloud commented Jul 29, 2017

leodotcloud commented Jul 29, 2017

leodotcloud commented Jul 30, 2017

galal-hussein commented Jul 30, 2017

Network between hosts is broken after force upgrade of ipsec in k8s setup - k8s cannot become healthy #9508

Network between hosts is broken after force upgrade of ipsec in k8s setup - k8s cannot become healthy #9508

Comments

galal-hussein commented Jul 28, 2017

leodotcloud commented Jul 28, 2017

alena1108 commented Jul 28, 2017 • edited

deniseschannon commented Jul 28, 2017

leodotcloud commented Jul 29, 2017

leodotcloud commented Jul 29, 2017

leodotcloud commented Jul 30, 2017

galal-hussein commented Jul 30, 2017

alena1108 commented Jul 28, 2017 •

edited