Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network between hosts is broken after force upgrade of ipsec in k8s setup - k8s cannot become healthy #9508

Closed
galal-hussein opened this issue Jul 28, 2017 · 7 comments
Assignees
Labels
area/kubernetes kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@galal-hussein
Copy link
Contributor

Rancher versions:
rancher/server: v1.6.6-rc5
kubernetes (if applicable): rancher/k8s:v1.7.2-rancher5

Docker version: (docker version,docker info preferred)
1.12
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Ubuntu 16.04
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
GCE
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single Node Rancher
Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
Kubernetes
Steps to Reproduce:

  • Install kubernets 1.7
  • Force upgrade ipsec stack

Results:

Kubernetes stack is in unhealthy state, kube api shows the following error:

docker logs f71c08caf66c 2>&1 | tail
E0728 13:22:19.820247       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:21.306333       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:21.715033       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:21.923309       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:25.328453       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:25.716978       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:25.923947       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:26.304622       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:26.768880       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]
E0728 13:22:26.867564       1 authentication.go:58] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, [invalid bearer token, Post http://rancher-kubernetes-auth/: dial tcp 10.42.71.192:80: i/o timeout]]]

Kubernetes agent and ingress controlelr keep restarting due to the failures of kubernetes api

@galal-hussein galal-hussein added area/kubernetes kind/bug Issues that are defects reported by users or that we know have reached a real release labels Jul 28, 2017
@galal-hussein galal-hussein added this to the July 2017 milestone Jul 28, 2017
@leodotcloud
Copy link
Collaborator

From rancher-kubernetes-auth to rancher-kubernetes-agent:

root@b61d3ee80f59:/# curl rancher-kubernetes-agent:10240/healthcheck
ok
root@b61d3ee80f59:/#

From rancher-kubernetes-auth to rancher-ingress-controller:

root@b61d3ee80f59:/# curl rancher-ingress-controller:10241/healthz
OK
root@b61d3ee80f59:/#

Which shows networking is fine.

@alena1108
Copy link

alena1108 commented Jul 28, 2017

@joshwget We have to fix the template so k8s api and auth services land on the same host. It will make communication between them less error prone to ipsec failures. If communication between them is broken, noone - including all k8s system services - will be able to authorize. We should also enable healthcheck on the auth service.

@leodotcloud @galal-hussein we still have to debug why 10.42.x.x network between hosts is unavailable for a long period of time after the upgrade. We can follow up on that either in this ticket, or open a new one - up to you.

@deniseschannon deniseschannon changed the title Kubernetes stack won't start correctly after force upgrade ipsec stack Network between hosts is broken after force upgrade of ipsec in k8s setup - k8s cannot become healthy Jul 28, 2017
@deniseschannon
Copy link

Let's keep this issue open as I've re-titled it to the actual issue.

@leodotcloud
Copy link
Collaborator

As part of the fix for: #8684 (related PRs: rancher/cattle#2760, rancher/cattle#2766), the metadata started having information about stopped containers. In case of services started with retainIp: True, both the stopped and running containers both have the same IP address but different MAC addresses in metadata. Due to this when building the ARP table map which is keyed by IP address can end up pointing to either of these MAC addresses depending on the random order in metadata.

When arpsync goroutine kicks in, the MAC address for the ARP entry can flap with the above mentioned two MAC addresses causing connectivity issues.

This can be seen in the network-manager logs:

time="2017-07-29T00:42:11Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:42:23Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:42:28Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:43:02Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:43:07Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:43:12Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:43:19Z" level=info msg="Setting up resolv.conf for ContainerId [70c1128a44064837e8816c5b6a414b4889f39e3fbb36e563d5539beaf0c60c61]"
time="2017-07-29T00:43:19Z" level=info msg="CNI up" cid=70c1128a44064837e8816c5b6a414b4889f39e3fbb36e563d5539beaf0c60c61 networkMode=ipsec
time="2017-07-29T00:43:19Z" level=info msg="CNI up done" cid=70c1128a44064837e8816c5b6a414b4889f39e3fbb36e563d5539beaf0c60c61 networkMode=ipsec result=IP4:{IP:{IP:10.42.143.63 Mask:ffff0000} Gateway:10.42.0.1 Routes:[{Dst:{IP:169.254.169.250 Mask:ffffffff} GW:<nil>} {Dst:{IP:0.0.0.0 Mask:00000000} GW:10.42.0.1}]}, DNS:{Nameservers:[] Domain: Search:[] Options:[]}
time="2017-07-29T00:43:38Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:43:43Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:8 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:43:48Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:43:50Z" level=info msg="CNI down" cid=2a33272b1004dd63c6f32469d10aaa6fb839b8dfa0574c642b44f1d574208e61 networkMode=ipsec
time="2017-07-29T00:43:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:44:08Z" level=info msg="Setting up resolv.conf for ContainerId [c6322bd9e718eb09d0d506ea29330e141764ca136d1e72556c5c4a37111d6dfb]"
time="2017-07-29T00:44:08Z" level=info msg="CNI up" cid=c6322bd9e718eb09d0d506ea29330e141764ca136d1e72556c5c4a37111d6dfb networkMode=ipsec
time="2017-07-29T00:44:08Z" level=info msg="CNI up done" cid=c6322bd9e718eb09d0d506ea29330e141764ca136d1e72556c5c4a37111d6dfb networkMode=ipsec result=IP4:{IP:{IP:10.42.244.63 Mask:ffff0000} Gateway:10.42.0.1 Routes:[{Dst:{IP:169.254.169.250 Mask:ffffffff} GW:<nil>} {Dst:{IP:0.0.0.0 Mask:00000000} GW:10.42.0.1}]}, DNS:{Nameservers:[] Domain: Search:[] Options:[]}
time="2017-07-29T00:44:08Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:44:39Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:44:40Z" level=info msg="Setting up resolv.conf for ContainerId [940c8fb0bdd82f43a9fce89ef2f68e55bbc56d32c634480640b3b1d0ec1c3a53]"
time="2017-07-29T00:44:40Z" level=info msg="CNI up" cid=940c8fb0bdd82f43a9fce89ef2f68e55bbc56d32c634480640b3b1d0ec1c3a53 networkMode=ipsec
time="2017-07-29T00:44:40Z" level=info msg="CNI up done" cid=940c8fb0bdd82f43a9fce89ef2f68e55bbc56d32c634480640b3b1d0ec1c3a53 networkMode=ipsec result=IP4:{IP:{IP:10.42.125.229 Mask:ffff0000} Gateway:10.42.0.1 Routes:[{Dst:{IP:169.254.169.250 Mask:ffffffff} GW:<nil>} {Dst:{IP:0.0.0.0 Mask:00000000} GW:10.42.0.1}]}, DNS:{Nameservers:[] Domain: Search:[] Options:[]}
time="2017-07-29T00:44:49Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:45:52Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:45:57Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:46:47Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:46:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:47:13Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:48:03Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:48:08Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:48:29Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:49:11Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:49:22Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:50:08Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:50:28Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:53:38Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:54:35Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:54:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:55:46Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:57:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:57:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:58:43Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T00:59:30Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T00:59:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:00:10Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:00:23Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:00:39Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:00:59Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:01:38Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:03:48Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:06:55Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:07:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:8 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:08:10Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:08:15Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:08:43Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:09:27Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:09:32Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:09:53Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:09:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:10:13Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:11:38Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:b4:89:fb}(expected: 02:a4:55:49:5d:d4) for local container, fixing it"
time="2017-07-29T01:11:42Z" level=info msg="Setting up resolv.conf for ContainerId [c872bfe3f43f5005ca3dca2b3545433e30135137a144841e613c0af01c131304]"
time="2017-07-29T01:12:58Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"
time="2017-07-29T01:13:00Z" level=info msg="Setting up resolv.conf for ContainerId [c9deceb1db79c15b660af92446dcd947614b3262480d7d054120b800e0cae131]"
time="2017-07-29T01:15:57Z" level=info msg="CNI down" cid=70c1128a44064837e8816c5b6a414b4889f39e3fbb36e563d5539beaf0c60c61 networkMode=ipsec
time="2017-07-29T01:16:00Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:8 Type:1 Flags:0 IP:10.42.143.63 HardwareAddr:02:a4:55:49:5d:d4}(expected: 02:a4:55:b4:89:fb) for local container, fixing it"

Fix would be to consider only running/starting containers in arpsync goroutine.

@leodotcloud
Copy link
Collaborator

This problem is very easy to trigger. Just upgrade a k8s environment and don't click "Finish Upgrade".

These are the logs for an etcd container:

7/29/2017 3:09:29 PMtime="2017-07-29T22:09:29Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"
7/29/2017 3:09:34 PMtime="2017-07-29T22:09:34Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:53:c7:73}(expected: 02:4d:57:21:2e:9a) for local container, fixing it"
7/29/2017 3:10:01 PMtime="2017-07-29T22:10:01Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"
7/29/2017 3:10:24 PMtime="2017-07-29T22:10:24Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:53:c7:73}(expected: 02:4d:57:21:2e:9a) for local container, fixing it"
7/29/2017 3:10:29 PMtime="2017-07-29T22:10:29Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"
7/29/2017 3:11:29 PMtime="2017-07-29T22:11:29Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:53:c7:73}(expected: 02:4d:57:21:2e:9a) for local container, fixing it"
7/29/2017 3:11:46 PMtime="2017-07-29T22:11:46Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"
7/29/2017 3:12:11 PMtime="2017-07-29T22:12:11Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:4 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:53:c7:73}(expected: 02:4d:57:21:2e:9a) for local container, fixing it"
7/29/2017 3:12:27 PMtime="2017-07-29T22:12:27Z" level=info msg="arpsync: (host) wrong ARP entry found={LinkIndex:3 Family:2 State:2 Type:1 Flags:0 IP:10.42.62.48 HardwareAddr:02:4d:57:21:2e:9a}(expected: 02:4d:57:53:c7:73) for local container, fixing it"

@leodotcloud
Copy link
Collaborator

Fixed in network-services:v0.2.4 and rancher/network-manager:v0.7.6

@galal-hussein
Copy link
Contributor Author

Rancher server: v1.6.6-rc6
Network-manager: v0.7.6
Kubernets: v1.7

I ran Stress testing with 10 loops on kubernetes which includes force upgrading ipsec for each loop iteration and check if kubernetes healthy, the test succeeded and i don't see the issue anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

6 participants