Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. #18637

Closed
dhrp opened this issue Mar 6, 2019 · 6 comments
Assignees
Labels
area/networking area/rke kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@dhrp
Copy link

dhrp commented Mar 6, 2019

What kind of request is this (question/bug/enhancement/feature request):
Bug

Steps to reproduce (least amount of steps as possible):
Setup where this occurred:

  • Start with Rancher 2.1.1
  • Create Kubernetes 1.10 cluster with Flannel CNI
  • Use Rancher to edit the cluster and upgrade kubernetes to 1.11.3
  • Notice Rancher does this nicely, and without any downtime of the hosts. (I rebooted one)
  • Now (possibly after some time), notice networking between (some) pods stopped working.

Result:
Networking between (some) pods stops working. In my case specifically, I could ping /some/ containers from the one of the host os's, but not others.

When checking the logs of kube-flannel it showed me several lines of:

I0227 14:22:31.917220 1 vxlan_network.go:60] watching for new subnet leases
E0304 05:39:31.131114 1 iptables.go:97] Failed to ensure iptables rules: Error checking rule existence: failed to check rule existence: running [/sbin/iptables -t nat -C POSTROUTING ! -s 10.42.0.0/16 -d 10.42.0.0/16 -j MASQUERADE --wait]: exit status 4: iptables: Resource temporarily unavailable.

Other details that may be helpful:

Here is what I believe caused this, and why I'm making this issue on the rancher repository:

  • During the upgrade some service touched or modified the IPTABLES on each host, but did not fully release it.
  • I only restarted one of the hosts
  • On the hosts that were not restarted Flannel now could not get the lock on IPTABLES, because something was already holding it.

A reboot of each node resolved the issue. (But I lost a day figuring this out).

I believe it has to do with:

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI):
    Rancher 2.1.1

all nodes are running: Ubuntu 16.04.5 LTS (GNU/Linux 4.4.0-1075-aws x86_64)

  • Installation option (single install/HA):
    Single install

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported):
    Rancher created AWS EC2 cluster

  • Machine type (cloud/VM/metal) and specifications (CPU/memory):
    AWS t3 Medium instances

  • Kubernetes version (use kubectl version):
    Kubernetes 1.11.3

Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version (use docker version):
Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
@janeczku
Copy link
Contributor

@dhrp double checking: This cluster was configured to using proper Flannel CNI, ie. not Canal, correct?

@janeczku
Copy link
Contributor

janeczku commented Apr 10, 2019

This seems to be combination of a race condition in iptables access between flannel and kube-proxy due to the latter not mounting the common flock file from host (PR)and upstream Flannel issue failing to recover from a failed attempt to ensure iptable rules flannel-io/flannel#988.

@janeczku janeczku added kind/bug Issues that are defects reported by users or that we know have reached a real release area/rke area/networking labels Apr 10, 2019
@janeczku
Copy link
Contributor

janeczku commented Apr 10, 2019

We should also update Flannel to >= v0.11 to allow configuration of the iptables update interval from hardcoded 5 seconds which might lead to contention. See flannel-io/flannel#933 and flannel-io/flannel#935.

@deniseschannon deniseschannon modified the milestones: v2.3, v2.2.x Apr 10, 2019
@alena1108 alena1108 modified the milestones: v2.2.3, v2.3 Apr 18, 2019
@jira-sync-svc jira-sync-svc changed the title Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. Apr 24, 2019
@jira-sync-svc jira-sync-svc assigned JacieChao and unassigned janeczku Apr 24, 2019
@jiaqiluo jiaqiluo self-assigned this Apr 25, 2019
@jira-sync-svc
Copy link

➤ Jack Luo commented:

The bug fix is validated in Rancher: master d0d79b0

Steps:

  • add a cluster with k8s version 1.11.9 and Flannel as the network provider
  • run automation to deploy about 200 workloads with cluster IP and test their cross-node and cross-workload accessibility.
  • check logs of the workload kube-flannel for errors
  • upgrade k8s to 1.12.7
  • check logs of the workload kube-flannel for errors
  • test cluster IP randomly picked
  • upgrade k8s to 1.13.5
  • check logs of the workload kube-flannel for errors
  • test cluster IP randomly picked

@jira-sync-svc
Copy link

➤ Jack Luo commented:

[~jan] Is the test in my previous comment sufficient to validate the bug fix? Are there any other cases needed to be covered?

@janeczku
Copy link
Contributor

@JacieChao Yes, what you tested should cover the scenarios in which the race is most likely to manifest itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/rke kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

8 participants