New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPSec tunnel connections missing on hosts in 9 host cluster. #9863
Comments
@leodotcloud do you want to take a look at this? |
@chkelly Can you please collect logs using: |
Any progress here? We are probably facing same issue. |
@leodotcloud Can you confirm this will be fixed in #9971 ? |
I am on Docker 1.6.10. rancher/net is v0.11.9. I do see a lot of failed multi host communication. I would like to upgrade to 0.13.2, if that fixes the communication. But Rancher UI tells me the Ipsec is up to date. Is it possible to upgrade rancher/net to a later version? |
In a supported/released version, no. It will be in v1.6.11 which should be ready soon. If you have a playground or test environment, and want to test out if your case is solved you can try one of the release candidates or manually upgrade the service to the new version of the container. This is all not supported (because unreleased/untested), can break during future upgrades etc so don't do this in any production or other important environment. |
Thanks. will wait for 1.6.11 |
Rancher versions:
rancher/server: 1.6.7
rancher/agent: 1.2.5
Infrastructure Stack versions:
healthcheck: 0.3.1
ipsec: 0.11.7
network-services: 0.7.7
scheduler: v0.8.2
kubernetes (if applicable): N/A
Docker version: (
docker version
,docker info
preferred)Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
AWS
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single node Rancher server backed by an RDS instance for the DB.
Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
Cattle
Saw cross host network connectivity issues between some containers (we have about 225 containers across 9 hosts). We initially isolated the connectivity issues to a single docker host which was docker-4. Upon investigating a specific service issue we discovered the router container was missing between two of the hosts. Further investigation had us realize that connections between other hosts were also broken but for simplicity we will focus on one case here.
Docker-1 which can communicate with docker-4 (but we did discover has broken connections to other hosts)
Docker-3 which cannot communicate with docker-4.
Swanctl output from docker-4.
On docker-4 I found this in the logs and it seems to be repeating every few seconds.
Note that 10.202.33.32 is the IP of docker-3.
on docker-3 i see thousands of these entries:
Note that while investigating further we seem to have discovered more hosts within this cluster that cannot communicate with each other. They display the same symptoms as above.
This is for our QA environment. We checked multiple production environments and could not find instances of this happening outside of QA, production is on all of the same versions however. We did upgrade 2 days ago and didn't immediately notice any issues so I cant say for sure if it was broken from
Restarting the ipsec container on each host seemed to resolve the issue. but we are unsure of what triggered it and why it didnt or wouldn't self recover.
The text was updated successfully, but these errors were encountered: