Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper iptables configuration in case of concurrent iptables access #2998

Closed
yannrouillard opened this issue Jun 1, 2017 · 10 comments
Closed
Labels
Milestone

Comments

@yannrouillard
Copy link

We currently use weave on our Kubernetes cluster to provide the networking layer and we encounter from time to time networking issues with weave at container startup time.

Our non-production clusters are automatically stopped at night and re-started in the morning.
On several occasion, we noticed the Kubernetes network stack didn't work correctly in the morning.

The symptoms were that containers were not able to access resources outside of the cluster.
It generally impacted also internal access as the kube-dns was not properly working because the container was not able to reach external DNS servers.

After investigation, we noticed the following things:

  • the problem appear at the node level rather than cluster level: some of the nodes were not impacted and pods hosted on these nodes could access external resources,
  • restarting weave pods on a network-unhealthy node didn't solve the issue,
  • the packets sent outside of the cluster weren't properly masqueraded by iptables,
  • and indeed the WEAVE rule set was not present in the iptables nat table,
  • the weave container failed its healtcheck at startup at first and was restarted at least once on failed nodes
@yannrouillard
Copy link
Author

After having at look at the weave shell script that is launched at container start time (by launch.sh),
we noticed than the iptables WEAVE ruleset configuration is performed by the try_create_bridge but only if the bridge is not already present.

Our theory is that the weave container was restarted after the bridge was created but before all iptables rules were created. Upon subsequent weave restarts, iptables were not added again as the bridge was already present and hence the iptables configuration was left in an improper state.

We don't know why this situation happens from time to time only and why it usually impact a lot of nodes at the same time. There may be an external condition that slows down our weave container startup time.

@yannrouillard yannrouillard changed the title Weave container Improper iptables configuration when weave container is restarted before full initialization Jun 1, 2017
@yannrouillard
Copy link
Author

The best solution would be to make weave start script more resilient in case of failure.
It might be also that weave liveness probe threshold defined in the default daemon set yaml file is too low and causes unnecessary restarts.

@bboreham bboreham added the bug label Jun 1, 2017
@bboreham
Copy link
Contributor

bboreham commented Jun 1, 2017

Thanks @yannrouillard; I think your analysis of the situation is very good.

Currently the code starts from scratch and does actions A, B, C, D, E to achieve the target state.
It would be better to compare actual state to target state and decide that only actions C and D (say) are needed to get there. We recently moved all that code from shell-script to Go which makes it far easier to contemplate such a change.

Re the liveness probe, it is configured to allow 30 seconds, and the network set-up typically takes less than one second. So very interested in any clues you can give what would stretch it out that much.

@yannrouillard
Copy link
Author

Ok some news about this issue.
This might not be caused by restart at wrong moment (or else there are several ways to trigger the issue).

We had a similar issue again but this time we got more information as we enabled debug log.
We saw that the WEAVE target creation failed because of Resource temporarily unavailable error.

This caused an improper iptables configuration that is never repaired after.
Here is the log snippet showing the issue:

+ run_iptables -t nat -N WEAVE
+ [ -z 1 ]
+ iptables -w -t nat -N WEAVE
iptables: Resource temporarily unavailable.
+ true
+ add_iptables_rule nat POSTROUTING -j WEAVE
+ IPTABLES_TABLE=nat
+ shift 1
+ run_iptables -t nat -C POSTROUTING -j WEAVE
+ true
+ run_iptables -t nat -A POSTROUTING -j WEAVE
+ [ -z 1 ]
+ iptables -w -t nat -A POSTROUTING -j WEAVE
iptables v1.6.0: Couldn't load target `WEAVE':No such file or directory

Try `iptables -h' or 'iptables --help' for more information.
+ [ 2 != 4 ]
+ return 1

Currently looking how this could happen but one question: any failure in WEAVE target creation is ignored and error message are redirected to /dev/null, what was the reason for that ?
run_iptables -t nat -N WEAVE >/dev/null 2>&1 || true

We had to remove the >/dev/null 2>&1 to be able to see the proper error message.

@yannrouillard
Copy link
Author

yannrouillard commented Jun 5, 2017

@bboreham I didn't understand why the -woption didn't prevent this issue but I wonder if we don't run into the problem mentioned in this bug moby/moby#30379:

Iptables binaries on the host have a lock that they try to get (/run/xtables.lock or a unix socket)
and will wait until it's grabbed. However, inside of a container that lock will be different, so
iptables on the host and the container will both attempt to run at the same time, causing this
issue.

From what I see indeed iptables on my host and inside weave container are both using /run/xtables.lock and unless I mistaken, the /run/xtables.lock is not mounted from the host.

Shouldn't we mount /run/xtables.lock in weave container ?

@yannrouillard yannrouillard changed the title Improper iptables configuration when weave container is restarted before full initialization Improper iptables configuration in case of concurrent iptables access Jun 5, 2017
@bboreham
Copy link
Contributor

bboreham commented Jun 5, 2017

@yannrouillard yes; that is under discussion at #2980. Note we have to ensure the file exists on the host before running a container that mounts it.

That moby issue you linked to is closed as a duplicate but I updated the open one moby/moby#12547

@yannrouillard
Copy link
Author

For now we mounted the /run/xtables.lock in the weave container as we know this file will be present as the time the weave container is started on our host.

For now the problem didn't appear again but we are waiting for more time before being sure.

I will update this ticket with the outcome.

@chrislovecnm
Copy link

@bboreham can you provide more information, and possibly an example manifest? Should we add the /run directory to the weave pods?

@bboreham
Copy link
Contributor

@chrislovecnm the problem with just doing a mount is that, for a freshly-booted machine where the lock file doesn't exist Docker will create a directory of the same name, which will then break everything.

Mounting the parent directory, /run, is problematic because Docker's container trees are under there, which means every volume mount is now recursive, and that makes things break inside the kernel.

There is an upcoming feature kubernetes/kubernetes#46597 which will allow you to say you want a file and not a directory, so we could safely mount /run/xtables.lock. Sadly we can't rely on that until some future version of Kubernetes (1.8, probably)

Failing that, you need to arrange on the host that the file exists before starting the Weave pod, which may be straightforward for kops. @yannrouillard could you share your manifest change as an example?

@bboreham
Copy link
Contributor

bboreham commented Oct 5, 2017

Fixed by #3134

@bboreham bboreham closed this as completed Oct 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants