Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. #18637

dhrp · 2019-03-06T18:36:59Z

What kind of request is this (question/bug/enhancement/feature request):
Bug

Steps to reproduce (least amount of steps as possible):
Setup where this occurred:

Start with Rancher 2.1.1
Create Kubernetes 1.10 cluster with Flannel CNI
Use Rancher to edit the cluster and upgrade kubernetes to 1.11.3
Notice Rancher does this nicely, and without any downtime of the hosts. (I rebooted one)
Now (possibly after some time), notice networking between (some) pods stopped working.

Result:
Networking between (some) pods stops working. In my case specifically, I could ping /some/ containers from the one of the host os's, but not others.

When checking the logs of kube-flannel it showed me several lines of:

I0227 14:22:31.917220 1 vxlan_network.go:60] watching for new subnet leases
E0304 05:39:31.131114 1 iptables.go:97] Failed to ensure iptables rules: Error checking rule existence: failed to check rule existence: running [/sbin/iptables -t nat -C POSTROUTING ! -s 10.42.0.0/16 -d 10.42.0.0/16 -j MASQUERADE --wait]: exit status 4: iptables: Resource temporarily unavailable.

Other details that may be helpful:

Here is what I believe caused this, and why I'm making this issue on the rancher repository:

During the upgrade some service touched or modified the IPTABLES on each host, but did not fully release it.
I only restarted one of the hosts
On the hosts that were not restarted Flannel now could not get the lock on IPTABLES, because something was already holding it.

A reboot of each node resolved the issue. (But I lost a day figuring this out).

I believe it has to do with:

iptables reports Resource temporarily unavailable flannel-io/flannel#950 and
Another app is currently holding the xtables lock kubernetes/kubernetes#57428
and it sound a little like this: Network Connectivity Issues after upgrade to 2.0.8 #15576

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI):
Rancher 2.1.1

all nodes are running: Ubuntu 16.04.5 LTS (GNU/Linux 4.4.0-1075-aws x86_64)

Installation option (single install/HA):
Single install

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported):
Rancher created AWS EC2 cluster
Machine type (cloud/VM/metal) and specifications (CPU/memory):
AWS t3 Medium instances
Kubernetes version (use kubectl version):
Kubernetes 1.11.3

Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Docker version (use docker version):

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2

The text was updated successfully, but these errors were encountered:

janeczku · 2019-04-10T15:43:00Z

@dhrp double checking: This cluster was configured to using proper Flannel CNI, ie. not Canal, correct?

janeczku · 2019-04-10T19:05:19Z

This seems to be combination of a race condition in iptables access between flannel and kube-proxy due to the latter not mounting the common flock file from host (PR)and upstream Flannel issue failing to recover from a failed attempt to ensure iptable rules flannel-io/flannel#988.

janeczku · 2019-04-10T19:14:36Z

We should also update Flannel to >= v0.11 to allow configuration of the iptables update interval from hardcoded 5 seconds which might lead to contention. See flannel-io/flannel#933 and flannel-io/flannel#935.

jira-sync-svc · 2019-04-25T23:41:31Z

➤ Jack Luo commented:

The bug fix is validated in Rancher: master d0d79b0

Steps:

add a cluster with k8s version 1.11.9 and Flannel as the network provider
run automation to deploy about 200 workloads with cluster IP and test their cross-node and cross-workload accessibility.
check logs of the workload kube-flannel for errors
upgrade k8s to 1.12.7
check logs of the workload kube-flannel for errors
test cluster IP randomly picked
upgrade k8s to 1.13.5
check logs of the workload kube-flannel for errors
test cluster IP randomly picked

jira-sync-svc · 2019-04-25T23:43:55Z

➤ Jack Luo commented:

[~jan] Is the test in my previous comment sufficient to validate the bug fix? Are there any other cases needed to be covered?

janeczku · 2019-04-26T10:20:55Z

@JacieChao Yes, what you tested should cover the scenarios in which the race is most likely to manifest itself.

janeczku mentioned this issue Apr 10, 2019

Fix racy iptables between kube-proxy and CNI provider rancher/rke#1281

Merged

janeczku added kind/bug Issues that are defects reported by users or that we know have reached a real release area/rke area/networking labels Apr 10, 2019

deniseschannon added the status/ready-for-review label Apr 10, 2019

deniseschannon modified the milestones: v2.3, v2.2.x Apr 10, 2019

deniseschannon added the status/triaged label Apr 10, 2019

alena1108 assigned janeczku Apr 11, 2019

alena1108 added the team/ca label Apr 11, 2019

sangeethah assigned soumyalj Apr 17, 2019

alena1108 modified the milestones: v2.2.3, v2.3 Apr 18, 2019

alena1108 mentioned this issue Apr 18, 2019

Backport: Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. #19711

Closed

alena1108 added the status/resolved label Apr 18, 2019

jira-sync-svc changed the title ~~Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable.~~ Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. Apr 24, 2019

jira-sync-svc assigned JacieChao and unassigned janeczku Apr 24, 2019

jiaqiluo self-assigned this Apr 25, 2019

loganhz unassigned JacieChao Apr 26, 2019

jira-sync-svc assigned JacieChao Apr 26, 2019

jira-sync-svc closed this as completed Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. #18637

Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. #18637

dhrp commented Mar 6, 2019 •

edited by jira-sync-svc

janeczku commented Apr 10, 2019

janeczku commented Apr 10, 2019 •

edited

janeczku commented Apr 10, 2019 •

edited

jira-sync-svc commented Apr 25, 2019

jira-sync-svc commented Apr 25, 2019

janeczku commented Apr 26, 2019

Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. #18637

Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. #18637

Comments

dhrp commented Mar 6, 2019 • edited by jira-sync-svc

janeczku commented Apr 10, 2019

janeczku commented Apr 10, 2019 • edited

janeczku commented Apr 10, 2019 • edited

jira-sync-svc commented Apr 25, 2019

jira-sync-svc commented Apr 25, 2019

janeczku commented Apr 26, 2019

dhrp commented Mar 6, 2019 •

edited by jira-sync-svc

janeczku commented Apr 10, 2019 •

edited

janeczku commented Apr 10, 2019 •

edited