Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

'weave ps' fails with "Cannot find weave bridge: Link not found" #2388

Closed
2opremio opened this issue Jun 20, 2016 · 8 comments
Closed

'weave ps' fails with "Cannot find weave bridge: Link not found" #2388

2opremio opened this issue Jun 20, 2016 · 8 comments
Assignees
Labels
Milestone

Comments

@2opremio
Copy link
Contributor

2opremio commented Jun 20, 2016

... even if the weave bridge exists:

weave     Link encap:Ethernet  HWaddr 06:c2:01:12:41:d5  
          inet addr:10.32.0.1  Bcast:0.0.0.0  Mask:255.240.0.0
          inet6 addr: fe80::4c2:1ff:fe12:41d5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1410  Metric:1
          RX packets:732925 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1016150 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:597366797 (597.3 MB)  TX bytes:2877166348 (2.8 GB)

This happened when running a local instance of the scope service (kubernetes anywhere).

Running with bash -x I get:

+ docker run --rm --privileged --net=host -v /var/run/docker.sock:/var/run/docker.sock --pid=host -e DOCKERHUB_USER=weaveworks -e WEAVE_VERSION -e WEAVE_DEBUG -e WEAVE_DOCKER_ARGS -e WEAVEPROXY_DOCKER_ARGS -e WEAVEPLUGIN_DOCKER_ARGS -e WEAVE_PASSWORD -e WEAVE_PORT -e WEAVE_HTTP_ADDR -e WEAVE_CONTAINER_NAME -e WEAVE_MTU -e WEAVE_NO_FASTDP -e WEAVE_NO_BRIDGED_FASTDP -e WEAVE_NO_PLUGIN -e DOCKER_BRIDGE -e DOCKER_CLIENT_HOST= -e DOCKER_CLIENT_ARGS -e PROXY_HOST=127.0.0.1 -e COVERAGE -e CHECKPOINT_DISABLE weaveworks/weaveexec:1.5.2 --local ps
Cannot find weave bridge: Link not found

Output when running the docker command above with WEAVE_DEBUG=1: https://gist.github.com/2opremio/cfe5c7fa236854e9963dda0f8afc5754

Some details about my system:

$ uname -a
Linux vagrant-ubuntu-wily-64 4.2.0-36-generic #42-Ubuntu SMP Thu May 12 22:05:35 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ docker version
Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:40:35 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:40:35 2016
 OS/Arch:      linux/amd64
@rade
Copy link
Member

rade commented Jun 20, 2016

I looked at this the other day. The WEAVE_DEBUG output shows the error half way through iterating over the containers, which is very mysterious indeed.

@brb brb self-assigned this Jun 20, 2016
@brb
Copy link
Contributor

brb commented Jun 20, 2016

@2opremio Are you able to reproduce the issue? Maybe you have an output of dmesg?

From inspecting the netlink library, I can see that "Link not found" might be returned only in two cases and by the same function. The second case is more interesting, because it might not necessary mean that a link does not exist.

@2opremio
Copy link
Contributor Author

2opremio commented Jun 21, 2016

I can't reproduce with my computer in Spain (different Docker/Weave versions). I can reproduce systematically with my laptop in London but it would have to wait until Monday.

In the meantime, I think @jml can (he shows exactly the same error in weaveworks/scope#1599 )

@brb
Copy link
Contributor

brb commented Jun 24, 2016

As I've already mentioned on Slack, the root cause is the WithNetNS function.

During execution of work() or before defer netns.Set(oldNs), due to e.g. blocking syscall, the Go runtime might create an OS thread (child) via clone(2) w/o CLONE_NEWNET, which means that the child is going to reside in the same network namespace as its parent. The child is used to schedule go-routines and thus, any go-routine run by the child will be in the "wrong" network namespace.

In the Linux kernel, each netdev is associated only with exactly one network namespace. Therefore, the "weave" bridge cannot be seen from a container which network namespace != host network namespace.

The error you get is due to https://github.com/weaveworks/weave/blob/v1.6.0/common/utils.go#L108 being executed on a thread which entered container network namespace which is not associated with the "weave" bridge netdev.

As an evidence, please check the following strace output (full log):

<...>
[pid  3361] openat(AT_FDCWD, "/proc/17526/ns/net", O_RDONLY) = 61
[pid  3361] getpid()                    = 3357
[pid  3361] gettid()                    = 3361
[pid  3361] openat(AT_FDCWD, "/proc/3357/task/3361/ns/net", O_RDONLY) = 62
[pid  3361] setns(61, CLONE_NEWNET)     = 0
<...>
[pid  3361] <... clone resumed> child_stack=0x7f19efffee70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f19effff9d0, tls=0x7f19effff700, child_tidptr=0x7f19effff9d0) = 3365
<...>
[pid  3365] sendto(65, "\x2c\x00\x00\x00\x12\x00\x05\x00\x75\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0a\x00\x03\x00\x77\x65\x61\x76\x65\x00\x00\x00", 44, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 44
[pid  3365] getsockname(65, {sa_family=AF_NETLINK, pid=3357, groups=00000000}, [12]) = 0
[pid  3365] recvfrom(65, "\x40\x00\x00\x00\x02\x00\x00\x00\x75\x00\x00\x00\x1d\x0d\x00\x00\xed\xff\xff\xff\x2c\x00\x00\x00\x12\x00\x05\x00\x75\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0a\x00\x03\x00\x77\x65\x61\x76\x65\x00\x00\x00", 4096, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 64
[pid  3365] close(65)                   = 0
[pid  3358] <... select resumed> )      = 0 (Timeout)
[pid  3365] write(2, "Cannot find weave bridge: Link not found\n", 41

The 3365 process does the RTM_GETLINK rtnetlink request for the "weave" bridge interface which is not found. The 3365 process is created by the 3361 process which entered the /proc/17526/ns/net network namespace (of some container) before cloning, therefore the 3365 process is running in the /proc/17526/ns/net netns as well.

@brb
Copy link
Contributor

brb commented Jun 24, 2016

To reproduce the issue (on my machine - 4.6 / 1.11.2), I start 32 containers (for i in $(seq 1 32); do docker $(weave config) run --name "c$i" -td alpine /bin/sh; done) and runweave ps multiple times (for i in $(seq 1 100); do weave ps; done).

@rade rade modified the milestone: 1.6.1 Jun 25, 2016
@bboreham
Copy link
Contributor

This is basically the same as moby/libnetwork#1113, which any Go program calling Syscall(SYS_SETNS) will (eventually) fall victim to.

brb added a commit that referenced this issue Jul 4, 2016
brb added a commit that referenced this issue Jul 4, 2016
brb added a commit that referenced this issue Jul 4, 2016
brb added a commit that referenced this issue Jul 4, 2016
This was referenced Jul 4, 2016
brb added a commit that referenced this issue Jul 5, 2016
@bcmills
Copy link

bcmills commented Jun 2, 2017

I found this via https://www.weave.works/blog/linux-namespaces-and-go-don-t-mix.

This snippet:

     runtime.LockOSThread()
     defer runtime.UnlockOSThread()
     hostNetNs, _ := netns.Get()
     netns.Set(ns)

     links, _ := netlink.LinkList()
     fmt.Println(filterAttachedLinks(links, indexes))

     // Return to the host network namespace
     netns.Set(hostNetNs)

is a perfect illustration of erroneous usage of runtime.UnlockOSThread (as described in golang/go#20458).

netlink.LinkList is most likely unlocking your thread before it returns, so the second netns.Set may be switching threads when you don't expect it.

@bboreham
Copy link
Contributor

bboreham commented Jun 3, 2017

@bcmills we think the most likely explanation is the one we observed: a new OS thread is started inside the locked region, which inherits the changed namespace, and when other goroutines are then scheduled onto that thread they are in the that other namespace.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants