-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to evaluate networking: failed to find plugin \"rancher-bridge\" when metadata container's bridge IP is equal to another hosts docker0 IP (bip) #8383
Comments
Can you provide the Have you tried delete the network stack and re-run them, then restart the failed Maybe it's related to #8368 |
No container uses managed network. All 4 are running with I just wish I could get rid of this ipsec thing and clustering features as I'm not interested in them. I only picked rancher because it was the only one supporting the
|
Tried adding a dummy ubuntu (with the managed network) container that just sleeps and prints something to the console. It fails to start, with the same error saying it failed to bring up the network. Also, I found this error in the logs of network-manager container:
|
I have the same issue, reproducible with 1.5.2, 1.5.3 and 1.5.4 |
I had the same issue because I used a virtualbox host rancher within a host rancher :-S. When I removed the virtualbox machine from the cluster the error disappear. Are you using a similar config? |
same issue happens to us on physical hardware :-( |
@asg1612 you can read the ticket and see it's not the same config - at all. No virtualbox involved.
|
do someone managed some sort of workaround? this is really annoying issue as we cannot use more than 1 host for now (out of 4 which we are using normally) |
me too |
Nope...and noone is responding from the devs either. I upgraded to 1.5.4 and 1.5.5 without luck. Bug is still present and it affects 90% of all our servers. I can't pinpoint the fault at all. I checked, re-installed, deleted, upgraded, downgraded ..you name it, nothing pops out. We may have to roll our own custom solution as this is not a production ready setup for us unfortunately. |
I've reproduce and solve a similar issue #8276 . Hope this helps. |
@snipking I believe that's required when you run the agent on the same host with the server. Can you confirm? |
same issue |
@snipking And all of my servers today show such messages like in first message. |
Ok this is not funny anymore. Looks like healthcheck and ipsec are constantly failing Any docker container executed from rancher fails. Doing same thing from console works without issues. |
Same issue on fresh, clean machine with Ubuntu 16 |
@ciokan Actually, I've reproduced the #8276 in two env which means it happen both when agent and server on the same host or not. Even one host in a rancher env with the wrong IP will cause this issue on other hosts in this env. This may not happen immediately because you may add a host with the wrong IP to an env which already have hosts working fine. But it may happen after you reboot some of your working hosts. I'm not sure your problem same as me. But it worth a try. |
@fedya I'm using rancher/network-manager:v0.6.6 and there's only |
Still not working
Well, let's look into rancher/network-manager:v0.6.6
Empty dir, there is no such binaries like rancher expect. let's find them
Well, it's found, but in odd places. Let's copy binaries to the right place /opt/cni/bin and now in logs i see such stuff:
culrpit
obviously 192.168.1.1 it's my router and of course port 53 not available there
probably realted to sidekick container, it's working not as rancher expects |
Something similar is happening to me after updating to 1.5.5. I've even tried creating a whole new environment to try and get the ipsec stack to start, but it still exhibits the same error.
|
@penguinxr2 yes, same issue, there is no network interface in container. |
Same issue here, anyone have any updates? |
My issues went away once I switched to Ubuntu 16.04 instead of Debian |
We've been struggling all day with ipsec issues on one of the nodes after upgrading a rancher environment running 1.4 to 1.5.6 (in other words, we have several nodes where ipsec was working). After a lot of headscratching and cussing, we got ipsec working following these simple steps:
So, in conclusion, it seems like the process responsible for updating this file is failing, for some to us yet unknown reason (haven't had time to look into that yet). |
tagging @leodotcloud into this ticket. Expert sleuthing @kaos |
@kaos that seems to work. I have to play "catch" since rancher is constantly changing the pid since it restarts the cni service and, once I'm fast enough with setting the pid, |
Indeed, if I was digging in the right place, the plugin-manager updates stuff every 5 minutes, or when triggered by a change in the metadata.. I'm guess it's in there some where something's broken. |
Here's my "working" code to fix this:
This one runs inside a container that needs the following volumes:
It constantly monitors the pid and Also, I made it public if anyone else wants to try it out before official fix:
|
@ciokan can you ping me on our users slack? I'd gladly troubleshoot this with you |
Just got off call with @fedya ... The reason for his problem was different than above. Summary: When running both server and agent, it's needed to pass Fix:
|
Tonight we were fixed this issue on my servers. Main culrpit was rancher agent that running on rancher server, in such case need to pass CATTLE_AGENT_IP when you deploy agent on the server. |
One more reason: #9367 |
My devops team approached me with this behavior yesterday when applying an upgrade to rancher-server 1.6.5 I've double checked CATTLE_AGENT_IP is properly set when adding all of the hosts. After doing some digging, I found that two of the hosts had docker0 as the 'old' 172.17.42.1 Applying this fix from @leodotcloud seemed to clear up the problems for them. But there is still one other host that is in a bad state where its IPsec containers will not start up and complains about not finding rancher-bridge on the path. I have tried the workaround mentioned by @kaos where we try to write the file ourselves. I just wrote a script to to get it to pull the right PID in for me as the stack appears. Here's what the script looks like if anyone else is interested:
This seems to change the message away from any mention of rancher-bridge - but it still fails to start up. I tried the whole process a few times, and then tried to restart network-manager right after running my patch script as well, but I still can't get IPsec to start on this host. What I can see in the logs leads me to believe there is something screwed up more deeply with the networking (like maybe iptables) on this box. Here is a log snippet from network-manager, and it causes a problem installing the stack properly. Connectivity isn't an issue, though -- I was able to deactivate and delete the host/remove rancher-agent before it could delete the containers in the IPsec stack, and they DID start up in standalone mode when I stopped and restarted them. I could see it get its 10.42.x.x address and communicate with everything else. BUT, it wasn't part of the ipsec stack, so Rancher would list it as Standalone when I added the host back and and it would try to start new IPsec containers. I've since removed it, but I swear that did work. I think the root cause is likely with having crufty values somewhere in the metadata as @ibuildthecloud described. I can see that there is a message still getting spammed in the network-manager log for that host:
How do I flush that out? I've gone so far as to remove all containers (including infrastructure and rancher-agents) and docker system prune -a on everything and bring it back up but I still see this host complaining about the 4c65b... container. |
@leodotcloud helped me troubleshoot our setup. We discovered a veth that was showing up on the host in question that was being returned if you did an After verifying this particular one looked off and should not be there, we tried removing it. It seems that it was not being cleaned up for whatever reason, and had some values for the host from the "old" setup before the upgrade. Removing manually via an Thanks, @leodotcloud and other users on this thread. |
I could pitch in another case of netw interface issues that relates to this that we found out. It was something with the |
@kaos do you happen to know what the docker bridge link address was? Just ran into this today where metadata was stealing docker's bridge ip of '172.17.0.1', possibly because post upgrade of docker it came up with a '172.42.x.x' address... |
@aemneina IIRC it was |
Guys I'm being overwhelmed with this error.
|
@calexandre rancher server and rancher node is same pc? |
Nope, separate hosts. I've created a forum post with more detail here: Let me know if you want me to bring the issue to to github. |
I had the same issue yesterday. I have two hosts, and one of them in the same machine of my Rancher. The node I've created in the same PC of the server worked well but the other one was giving me the rancher-bridge error. |
I do the same steps as @guilhermefernandes1 said above and solve the issue. |
With the release of Rancher 2.0, development on v1.6 is only limited to critical bug fixes and security patches. |
Rancher Versions:
Server: 1.5.2
healthcheck: 0.2.3
ipsec: rancher/net:holder
network-services: rancher/network-manager:v0.5.3
scheduler: rancher/scheduler:v0.7.5
kubernetes (if applicable):
Docker Version:
1.12.6
OS and where are the hosts located? (cloud, bare metal, etc):
Ubuntu 16.04/Bare Metal/Multiple locations
Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single node rancher, external DB
Environment Type: (Cattle/Kubernetes/Swarm/Mesos)
Cattle
Steps to Reproduce:
I have about 100 hosts running 4 containers. On 5-6 of them ipsec runs ok, on every other host the
ipsec
stack fails (also thehealthcheck
) withTimeout getting ip
. Inside the network-manager logs I can see:The text was updated successfully, but these errors were encountered: