Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to evaluate networking: failed to find plugin \"rancher-bridge\" when metadata container's bridge IP is equal to another hosts docker0 IP (bip) #8383

Closed
ciokan opened this issue Apr 1, 2017 · 102 comments
Assignees
Labels
area/metadata internal kind/bug Issues that are defects reported by users or that we know have reached a real release version/1.6

Comments

@ciokan
Copy link

ciokan commented Apr 1, 2017

Rancher Versions:
Server: 1.5.2
healthcheck: 0.2.3
ipsec: rancher/net:holder
network-services: rancher/network-manager:v0.5.3
scheduler: rancher/scheduler:v0.7.5
kubernetes (if applicable):

Docker Version:
1.12.6

OS and where are the hosts located? (cloud, bare metal, etc):
Ubuntu 16.04/Bare Metal/Multiple locations

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single node rancher, external DB

Environment Type: (Cattle/Kubernetes/Swarm/Mesos)
Cattle

Steps to Reproduce:
I have about 100 hosts running 4 containers. On 5-6 of them ipsec runs ok, on every other host the ipsec stack fails (also the healthcheck) with Timeout getting ip. Inside the network-manager logs I can see:

nsenter: cannot open /proc/26034/ns/ipc: No such file or directory
time="2017-04-01T02:31:01Z" level=error msg="Failed to evaluate network state for 059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input" 
time="2017-04-01T02:31:01Z" level=error msg="Error processing event &docker.APIEvents{Action:\"start\", Type:\"container\", Actor:docker.APIActor{ID:\"059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769\", Attributes:map[string]string{\"io.rancher.container.agent_id\":\"58465\", \"io.rancher.project_service.name\":\"healthcheck/healthcheck\", \"io.rancher.service.hash\":\"d0a8fd4061d3b2a8c5782f5563db6df0b25655cb\", \"io.rancher.stack_service.name\":\"healthcheck/healthcheck\", \"io.rancher.cni.network\":\"ipsec\", \"io.rancher.scheduler.global\":\"true\", \"io.rancher.service.requested.host.id\":\"209\", \"io.rancher.stack.name\":\"healthcheck\", \"name\":\"r-healthcheck-healthcheck-28-2576d08e\", \"image\":\"rancher/healthcheck:v0.2.3\", \"io.rancher.container.create_agent\":\"true\", \"io.rancher.container.ip\":\"10.42.112.234/16\", \"io.rancher.container.mac_address\":\"02:4f:71:44:58:1a\", \"io.rancher.service.launch.config\":\"io.rancher.service.primary.launch.config\", \"io.rancher.cni.wait\":\"true\", \"io.rancher.container.system\":\"true\", \"io.rancher.container.uuid\":\"2576d08e-3100-4935-aa34-3ecb73da302a\", \"io.rancher.project.name\":\"healthcheck\", \"io.rancher.service.deployment.unit\":\"8a360a36-5ff6-497f-a7db-92193ecf2415\", \"io.rancher.container.name\":\"healthcheck-healthcheck-28\"}}, Status:\"start\", ID:\"059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769\", From:\"rancher/healthcheck:v0.2.3\", Time:1491013861, TimeNano:1491013861288487613}. Error: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input" 
@snipking
Copy link

snipking commented Apr 1, 2017

Can you provide the ipsec log or can you try http://rancher-metadata/2015-12-19/self in a container which use Managed network and ipsec failed on the same host.

Have you tried delete the network stack and re-run them, then restart the failed ipsec container.

Maybe it's related to #8368

@ciokan
Copy link
Author

ciokan commented Apr 1, 2017

No container uses managed network. All 4 are running with network_mode: host. I just deactivated all my hosts in rancher because it was causing one host (where the scheduler is apparently running) to go at 400Mb/s and also my rancher server to stay 100% at all times with 8 cores and 16gb of RAM. One of the 4 containers is a healthchecker of mine designed to keep the other 3 alive so my services are running even with hosts deactivated in rancher.

I just wish I could get rid of this ipsec thing and clustering features as I'm not interested in them.

I only picked rancher because it was the only one supporting the run one container on each host feature and allowed you to NOT treat all your machines as a single block. Things are moving in the other direction it seems and it's giving me all sorts of issues. I can't even disable ipsec at all.

ipsec does not produce any logs, it doesn't even start.

@ciokan
Copy link
Author

ciokan commented Apr 3, 2017

Tried adding a dummy ubuntu (with the managed network) container that just sleeps and prints something to the console. It fails to start, with the same error saying it failed to bring up the network.

Also, I found this error in the logs of network-manager container:

Error (Couldn't bring up network: failed to find plugin "rancher-bridge" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin])

@spawnthink
Copy link

I have the same issue, reproducible with 1.5.2, 1.5.3 and 1.5.4 Error (Couldn't bring up network: failed to find plugin "rancher-bridge" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin])

@asg1612
Copy link

asg1612 commented Apr 11, 2017

I had the same issue because I used a virtualbox host rancher within a host rancher :-S. When I removed the virtualbox machine from the cluster the error disappear.

Are you using a similar config?

@OneB1t
Copy link

OneB1t commented Apr 11, 2017

same issue happens to us on physical hardware :-(

@ciokan
Copy link
Author

ciokan commented Apr 11, 2017

@asg1612 you can read the ticket and see it's not the same config - at all. No virtualbox involved.

I have about 100 hosts running 4 containers.

@OneB1t
Copy link

OneB1t commented Apr 12, 2017

do someone managed some sort of workaround? this is really annoying issue as we cannot use more than 1 host for now (out of 4 which we are using normally)

@maltyxx
Copy link

maltyxx commented Apr 13, 2017

me too

@ciokan
Copy link
Author

ciokan commented Apr 13, 2017

do someone managed some sort of workaround? this is really annoying issue as we cannot use more than 1 host for now (out of 4 which we are using normally)

Nope...and noone is responding from the devs either. I upgraded to 1.5.4 and 1.5.5 without luck. Bug is still present and it affects 90% of all our servers. I can't pinpoint the fault at all. I checked, re-installed, deleted, upgraded, downgraded ..you name it, nothing pops out.

We may have to roll our own custom solution as this is not a production ready setup for us unfortunately.

@snipking
Copy link

I've reproduce and solve a similar issue #8276 . Hope this helps.

@ciokan
Copy link
Author

ciokan commented Apr 14, 2017

@snipking I believe that's required when you run the agent on the same host with the server. Can you confirm?

@fedya
Copy link

fedya commented Apr 16, 2017

same issue

@fedya
Copy link

fedya commented Apr 16, 2017

@snipking
i'm investigating a bit here
there is no binaries in /opt/cni/bin dir in rancher/network-manager:v0.6.6

And all of my servers today show such messages like in first message.

@tpgxyz
Copy link

tpgxyz commented Apr 16, 2017

Ok this is not funny anymore. Looks like healthcheck and ipsec are constantly failing
http://i.share.pho.to/dda8108c_o.png

Any docker container executed from rancher fails. Doing same thing from console works without issues.

@Emil983
Copy link

Emil983 commented Apr 16, 2017

Same issue on fresh, clean machine with Ubuntu 16

@snipking
Copy link

@ciokan Actually, I've reproduced the #8276 in two env which means it happen both when agent and server on the same host or not. Even one host in a rancher env with the wrong IP will cause this issue on other hosts in this env. This may not happen immediately because you may add a host with the wrong IP to an env which already have hosts working fine. But it may happen after you reboot some of your working hosts.

I'm not sure your problem same as me. But it worth a try.

@snipking
Copy link

@fedya I'm using rancher/network-manager:v0.6.6 and there's only rancher-bridge in /opt/cni/bin. In my case I didn't see any error message relevant with it.

@fedya
Copy link

fedya commented Apr 20, 2017

Still not working
docker 17.03.1

time="2017-04-20T11:12:02Z" level=info msg="Evaluating state from retry" cid=eac88e13f505ba0362c49601f269e758a8cf4bb4d2cec434eccdbce0ca908d6e count=14 
time="2017-04-20T11:12:02Z" level=info msg="CNI up" cid=eac88e13f505ba0362c49601f269e758a8cf4bb4d2cec434eccdbce0ca908d6e networkMode=ipsec 
time="2017-04-20T11:12:02Z" level=error msg="Failed to evaluate networking: failed to find plugin \"rancher-bridge\" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin]"

Well, let's look into rancher/network-manager:v0.6.6

# docker exec -ti 5b4670639597 bash

# root@bongie:/# ls -la /opt/cni/bin/
total 8
drwx------ 2 root root 4096 Apr 20 11:11 .
drwxr-xr-x 4 root root 4096 Apr 20 11:11 ..

Empty dir, there is no such binaries like rancher expect.

let's find them

root@bongie:/# find / -name "rancher-bridge"
find: '/proc/11335': No such file or directory
/var/lib/docker/overlay/9a46a21caa97b45e05beb39bbcdfa699c6e149e929673ef083e051f429eb93a7/root/opt/cni/bin/rancher-bridge
/var/lib/docker/overlay/7e3f77562c357050381c8bffb4e0b03c3d52b3a2cb1e40da94f9f566b5258a37/root/opt/cni/bin/rancher-bridge
/var/lib/docker/overlay/e7d3d2d6ffe968aaa64c9c7ba2b8277672832aeae646af4d8a6b1321d6e0f45d/root/opt/cni/bin/rancher-bridge
/var/lib/docker/overlay/357c145376e656c58432794cdb28adc6ff9bd6848edbde10b03fa11ab3f2779c/root/opt/cni/bin/rancher-bridge

Well, it's found, but in odd places.

Let's copy binaries to the right place /opt/cni/bin

and now in logs i see such stuff:

687210, TimeNano:1492687210508494434}. Error: Get http://rancher-metadata/2015-12-19/version: dial tcp: lookup rancher-metadata on 192.168.1.1:53: no such host" 
time="2017-04-20T11:20:43Z" level=info msg="Evaluating state from retry" cid=8d63ff41f5cabb760b6666ee89579e6dbce012c6ac2141220ea77bbe0e1b982d count=1 
time="2017-04-20T11:20:43Z" level=info msg="CNI up" cid=8d63ff41f5cabb760b6666ee89579e6dbce012c6ac2141220ea77bbe0e1b982d networkMode=ipsec 
time="2017-04-20T11:20:47Z" level=error msg="macsync: error syncing MAC addresses for the first tiime: inspecting container: 5b85bfbcad07ca3856f8d6e7747265243898dfc0068fed5664cda577ef5c5d0e: Error: No such container: 5b85bfbcad07ca3856f8d6e7747265243898dfc0068fed5664cda577ef5c5d0e" 

culrpit
log from Logs: network-services-network-manager-2

 dial tcp: lookup rancher-metadata on **192.168.1.1:53: no such host"**
20.04.2017 14:46:47time="2017-04-20T11:46:47Z" level=info msg="Evaluating state from retry" cid=05eee3c01ca8d0563e50be9bed82e42703b1594fa51a5c017c1492cae99e5d71 count=1

obviously 192.168.1.1 it's my router and of course port 53 not available there

time="2017-04-20T12:12:10Z" level=error msg="Failed to apply cni conf: Error: No such container: f9f200742352744f2f58ea1f70283962d63ed09ddcddf97fbb8940c968fa0550"

probably realted to sidekick container, it's working not as rancher expects

@penguinxr2
Copy link

penguinxr2 commented Apr 24, 2017

Something similar is happening to me after updating to 1.5.5. I've even tried creating a whole new environment to try and get the ipsec stack to start, but it still exhibits the same error.

"Failed to evaluate networking: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"

@fedya
Copy link

fedya commented Apr 24, 2017

@penguinxr2 yes, same issue, there is no network interface in container.

@thdxr
Copy link

thdxr commented Apr 25, 2017

Same issue here, anyone have any updates?

@thdxr
Copy link

thdxr commented Apr 26, 2017

My issues went away once I switched to Ubuntu 16.04 instead of Debian

@kaos
Copy link

kaos commented Apr 28, 2017

We've been struggling all day with ipsec issues on one of the nodes after upgrading a rancher environment running 1.4 to 1.5.6 (in other words, we have several nodes where ipsec was working).

After a lot of headscratching and cussing, we got ipsec working following these simple steps:

  • copy /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge from one of the working nodes (we were missing this file on the node with ipsec issues).
  • the file is tiny, a one line script calling nsenter, and needs the PID of the target process it should enter. this needs to be adjusted for each host and time.
  • so, replace the pid for the -t option to nsenter with the pid of ps ax | grep rancher-cni (it should be the sidekick tailing the log file:
[root@xxx bin]# ps ax | grep rancher-cni
2754 ?        Ss     0:00 tail ---disable-inotify -F /var/log/rancher-cni.log
  • contents of /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge after patching:
[root@xxx bin]# cat rancher-bridge 
#!/bin/sh
exec /usr/bin/nsenter -m -u -i -n -p -t 2754 -- $0 "$@"
  • now, restart network-manager and things starts rolling!! ;)

So, in conclusion, it seems like the process responsible for updating this file is failing, for some to us yet unknown reason (haven't had time to look into that yet).

@aemneina
Copy link

tagging @leodotcloud into this ticket. Expert sleuthing @kaos

@ciokan
Copy link
Author

ciokan commented Apr 28, 2017

@kaos that seems to work. I have to play "catch" since rancher is constantly changing the pid since it restarts the cni service and, once I'm fast enough with setting the pid, ipsec starts rolling. With so many servers though (and with many of them being restarted through the day) that's quite a lot to manage so I'll try to deploy a container that constantly monitors the pids and does just that...at lest until rancher devs fix this.

@kaos
Copy link

kaos commented Apr 28, 2017

Indeed, if I was digging in the right place, the plugin-manager updates stuff every 5 minutes, or when triggered by a change in the metadata.. I'm guess it's in there some where something's broken.

@ciokan
Copy link
Author

ciokan commented Apr 28, 2017

Here's my "working" code to fix this:

import time
import subprocess

bridge_tpl = """#!/bin/sh
exec /usr/bin/nsenter -m -u -i -n -p -t {} -- \$0 \"\$@\"
"""


def cmd(command):
    try:
        return subprocess.check_output(command, shell=True).decode('utf-8').strip()
    except subprocess.CalledProcessError:
        return ''


def watch():
    ipsec_pid = cmd("docker ps --filter \"name=ipsec-cni\" --format {{.ID}} | xargs docker inspect  --format '{{.State.Pid}}'").strip()

    if ipsec_pid:
        bridge_contents = cmd("cat /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge")

        if ipsec_pid not in bridge_contents:
            cmd('truncate -s0 /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge')

            code = bridge_tpl.format(int(ipsec_pid))

            cmd('echo "%s" >> /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge' % code)


if __name__ == "__main__":
    while True:
        watch()
        time.sleep(1)

This one runs inside a container that needs the following volumes:

/usr/bin/docker:/usr/bin/docker
/var/run/docker.sock:/var/run/docker.sock

It constantly monitors the pid and nsenter file for a mismatch. Not the prettiest one but it works for now.

Also, I made it public if anyone else wants to try it out before official fix:

docker pull quay.io/drsoft/ipsec-fix:latest

@ibuildthecloud
Copy link
Contributor

@ciokan can you ping me on our users slack? I'd gladly troubleshoot this with you

@leodotcloud
Copy link
Contributor

leodotcloud commented Jul 13, 2017

Just got off call with @fedya ... The reason for his problem was different than above.

Summary:
Host 1: Running both rancher/server and rancher/agent ... And this caused the IP detected for this host in Cattle as 172.17.0.1.
Host 2: This is the host having problem "Error .... "

When running both server and agent, it's needed to pass CATTLE_AGENT_IP as the public IP address of the host while registering the host.

Fix:

  • Delete rancher-agent container on Host1
  • Start rancher-agent again on Host1, but this time pass CATTLE_AGENT_IP=PUBLIC_IP_OF_HOST_1
  • Things started fine on both Host 1 and Host 2.

@fedya
Copy link

fedya commented Jul 13, 2017

Tonight we were fixed this issue on my servers.
Thanks to @leodotcloud for his time.

Main culrpit was rancher agent that running on rancher server, in such case need to pass CATTLE_AGENT_IP when you deploy agent on the server.

@leodotcloud
Copy link
Contributor

One more reason: #9367

@superseb superseb changed the title Failed to evaluate networking: failed to find plugin \"rancher-bridge\" due to metadata returning inconsistent results for /self Failed to evaluate networking: failed to find plugin \"rancher-bridge\" when metadata container's bridge IP is equal to another hosts docker0 IP (bip) Jul 18, 2017
@cwash
Copy link

cwash commented Jul 22, 2017

My devops team approached me with this behavior yesterday when applying an upgrade to rancher-server 1.6.5

I've double checked CATTLE_AGENT_IP is properly set when adding all of the hosts.

After doing some digging, I found that two of the hosts had docker0 as the 'old' 172.17.42.1

Applying this fix from @leodotcloud seemed to clear up the problems for them.

But there is still one other host that is in a bad state where its IPsec containers will not start up and complains about not finding rancher-bridge on the path.

I have tried the workaround mentioned by @kaos where we try to write the file ourselves. I just wrote a script to to get it to pull the right PID in for me as the stack appears.

Here's what the script looks like if anyone else is interested:

root@o1:/var/lib/docker/volumes/rancher-cni-driver/_data/bin # ls .
patch*  rancher-bridge*

root@o1:/var/lib/docker/volumes/rancher-cni-driver/_data/bin # cat patch
echo "#!/bin/sh" > rancher-bridge
echo "exec /usr/bin/nsenter -m -u -i -n -p -t $(pgrep -f rancher-cni.log) -- \$0 \"\$@\"" >> rancher-bridge
chmod +x rancher-bridge

This seems to change the message away from any mention of rancher-bridge - but it still fails to start up.

I tried the whole process a few times, and then tried to restart network-manager right after running my patch script as well, but I still can't get IPsec to start on this host.

What I can see in the logs leads me to believe there is something screwed up more deeply with the networking (like maybe iptables) on this box. Here is a log snippet from network-manager, and it causes a problem installing the stack properly.

Connectivity isn't an issue, though -- I was able to deactivate and delete the host/remove rancher-agent before it could delete the containers in the IPsec stack, and they DID start up in standalone mode when I stopped and restarted them. I could see it get its 10.42.x.x address and communicate with everything else. BUT, it wasn't part of the ipsec stack, so Rancher would list it as Standalone when I added the host back and and it would try to start new IPsec containers. I've since removed it, but I swear that did work.

I think the root cause is likely with having crufty values somewhere in the metadata as @ibuildthecloud described. I can see that there is a message still getting spammed in the network-manager log for that host:

level=error msg="macsync: error syncing MAC addresses for the first tiime: inspecting container: 4c65b32cb3ee2a4f1d7c0fd9bc27103e9bd6955765db713d084057df5751ad36: Error: No such container: 4c65b32cb3ee2a4f1d7c0fd9bc27103e9bd6955765db713d084057df5751ad36"

How do I flush that out? I've gone so far as to remove all containers (including infrastructure and rancher-agents) and docker system prune -a on everything and bring it back up but I still see this host complaining about the 4c65b... container.

@leodotcloud leodotcloud added this to the August 2017 milestone Jul 25, 2017
@cwash
Copy link

cwash commented Jul 25, 2017

@leodotcloud helped me troubleshoot our setup. We discovered a veth that was showing up on the host in question that was being returned if you did an ip addr

After verifying this particular one looked off and should not be there, we tried removing it. It seems that it was not being cleaned up for whatever reason, and had some values for the host from the "old" setup before the upgrade.

Removing manually via an ip link delete veth... seemed to fix things on the next go 'round, IPSec started up correctly, then healthcheck, etc.

Thanks, @leodotcloud and other users on this thread.

@kaos
Copy link

kaos commented Jul 26, 2017

I could pitch in another case of netw interface issues that relates to this that we found out.

It was something with the docker0 interface that apparently was causing issues for us on some occasions, as it got resolved by stopping docker, wiping all docker state (/var/lib/docker, /var/run/docker, and recreating storage volumes) in addition to actually removing the docker0 interface. Then start docker again. It was that last piece with removing the docker0 interface that fixed it for us.

@aemneina
Copy link

@kaos do you happen to know what the docker bridge link address was? Just ran into this today where metadata was stealing docker's bridge ip of '172.17.0.1', possibly because post upgrade of docker it came up with a '172.42.x.x' address...

@kaos
Copy link

kaos commented Aug 31, 2017

@aemneina IIRC it was 172.17.0.2. It had just incremented one octet by one.

@deniseschannon deniseschannon modified the milestones: September 2017, October 2017 Sep 27, 2017
@calexandre
Copy link

Guys I'm being overwhelmed with this error.
We started the upgrade from version 1.5.5 to rancher 1.6.10.
We followed the recommendations to upgrade the network-services before ipsec. After successfully upgrading network-services we started to upgrade the ipsec stack...it started failing on the first host with the following error:

Rancher managed network not work with error 'Couldn't bring up network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input'

@fedya
Copy link

fedya commented Oct 2, 2017

@calexandre rancher server and rancher node is same pc?

@calexandre
Copy link

calexandre commented Oct 2, 2017

Nope, separate hosts.

I've created a forum post with more detail here:
https://forums.rancher.com/t/rancher-1-6-10-upgrade-issues-ipsec-upgrade-error/7476

Let me know if you want me to bring the issue to to github.

@guilhermefernandes1
Copy link

I had the same issue yesterday. I have two hosts, and one of them in the same machine of my Rancher. The node I've created in the same PC of the server worked well but the other one was giving me the rancher-bridge error.
I resolved this removing both hosts and reinstalling them. But first I've installed the host from another machine and then the other one (the one in the same machine of the server).

@cashlalala
Copy link

I do the same steps as @guilhermefernandes1 said above and solve the issue.

@VinceBT
Copy link

VinceBT commented Oct 20, 2017

I had the same problem and it came from the fact that a machine was used both as a server and as an agent, so in the Hosts tab, it was detected with a wrong IP because it resolved it locally au lieu of resolving the IP in an external way (eg: 172.17.0.1). Which made the machine unaccessible from anywhere.

I stopped the agent where the IP was 172.17.0.1 and docker run the agent with :

-e CATTLE_AGENT_IP="$(ifconfig eth0 | grep 'inet ' | cut -d ' ' -f10 | awk '{ print $1}')"

You may have to modify the line a bit if your configuration is different.

Everything works fine now and the right IP shows on the Hosts tab.

LATE EDIT:

Actually is it now told in the "Add Host" page that you'll have a problem by installing the server and the host on the same server.
image

You can specify the IP of your server in the input.

@deniseschannon deniseschannon removed this from the v1.6 - Dec 2017 milestone Dec 1, 2017
@deniseschannon
Copy link

With the release of Rancher 2.0, development on v1.6 is only limited to critical bug fixes and security patches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metadata internal kind/bug Issues that are defects reported by users or that we know have reached a real release version/1.6
Projects
None yet
Development

No branches or pull requests