Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPSec tunnel connections missing on hosts in 9 host cluster. #9863

Closed
chkelly opened this issue Sep 12, 2017 · 8 comments
Closed

IPSec tunnel connections missing on hosts in 9 host cluster. #9863

chkelly opened this issue Sep 12, 2017 · 8 comments
Assignees
Labels
area/networking kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@chkelly
Copy link

chkelly commented Sep 12, 2017

Rancher versions:
rancher/server: 1.6.7
rancher/agent: 1.2.5

Infrastructure Stack versions:
healthcheck: 0.3.1
ipsec: 0.11.7
network-services: 0.7.7
scheduler: v0.8.2
kubernetes (if applicable): N/A

Docker version: (docker version,docker info preferred)

Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:16 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:17:13 2017
 OS/Arch:      linux/amd64
 Experimental: false
Containers: 27
 Running: 26
 Paused: 0
 Stopped: 1
Images: 98
Server Version: 17.06.0-ce
Storage Driver: aufs
 Root Dir: /data/docker/aufs
 Backing Filesystem: extfs
 Dirs: 290
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 apparmor
Kernel Version: 3.13.0-125-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 14.94GiB
Name: docker-1
ID: 37LV:4CD5:PSKZ:YDVI:IJ5Z:UXOU:7GXH:VJMN:F3UN:M45H:QX57:BUYI
Docker Root Dir: /data/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
3.13.0-125-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
AWS
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single node Rancher server backed by an RDS instance for the DB.

Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
Cattle

Saw cross host network connectivity issues between some containers (we have about 225 containers across 9 hosts). We initially isolated the connectivity issues to a single docker host which was docker-4. Upon investigating a specific service issue we discovered the router container was missing between two of the hosts. Further investigation had us realize that connections between other hosts were also broken but for simplicity we will focus on one case here.

Docker-1 which can communicate with docker-4 (but we did discover has broken connections to other hosts)

conn-10.202.32.33: #255, ESTABLISHED, IKEv2, 18fcc0c77b4331be_i* d7de1ce07100c003_r
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.45.95' @ 10.202.32.33[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 907s ago, rekeying in 13252s
  child-10.202.32.33: #3, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256599s ago
    in  c243e2ee,  16439 bytes,   296 packets
    out ca29e80e,  22030 bytes,   311 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.33.35: #254, ESTABLISHED, IKEv2, e7f040c47574a9e3_i* 3a0bbd1d26894935_r
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.22.83' @ 10.202.33.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 1396s ago, rekeying in 12792s
  child-10.202.33.35: #13, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256547s ago
    in  c9824913, 1099502000 bytes, 17709239 packets
    out ce340ab5, 1181079586 bytes, 16427310 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.33.32: #253, ESTABLISHED, IKEv2, c9c7161c2b7f5ab7_i* a153fa4882c16607_r
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.154.214' @ 10.202.33.32[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 1801s ago, rekeying in 12342s
  child-10.202.33.32: #2, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256599s ago
    in  cde3c319,  28421 bytes,   349 packets
    out c54e6002,  25788 bytes,   390 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.33.35: #252, ESTABLISHED, IKEv2, 1cc12e58b2255b5c_i a23e8fb2267490b8_r*
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.22.83' @ 10.202.33.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 1828s ago, rekeying in 11683s
  child-10.202.33.35: #12, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256564s ago
    in  c028a829,  17442 bytes,   280 packets
    out c98d4187,  17177 bytes,   272 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.31.33: #251, ESTABLISHED, IKEv2, 39b54a424066b064_i* d863d994ff6352a9_r
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.90.53' @ 10.202.31.33[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 2002s ago, rekeying in 11732s
  child-10.202.31.33: #17, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256405s ago, rekeying in 29681237s, expires in 34433195s
    in  c412f8fb, 1021674172 bytes, 16948402 packets
    out cea37eef, 1145463693 bytes, 15499105 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.34: #250, ESTABLISHED, IKEv2, d7dfb100d2ef7507_i* 30c769199cd6da8b_r
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.170.201' @ 10.202.32.34[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 2125s ago, rekeying in 10862s
  child-10.202.32.34: #15, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256512s ago
    in  c1104fa1, 1332141570 bytes, 20795740 packets
    out c4ddb0d8, 1402156214 bytes, 20208511 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.33.32: #249, ESTABLISHED, IKEv2, a75f0807d26dc89e_i 101a2f24caa3cf56_r*
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.154.214' @ 10.202.33.32[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 2164s ago, rekeying in 11701s
  child-10.202.33.32: #11, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256582s ago
    in  cd4a1657, 1418489354 bytes, 19772956 packets
    out c00f2a33, 1353882757 bytes, 20158133 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.35: #248, ESTABLISHED, IKEv2, a15a5c94a6d963ac_i f9c419702f77a9b6_r*
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.253.65' @ 10.202.32.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 2517s ago, rekeying in 11312s
  child-10.202.32.35: #10, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256582s ago
    in  cb27fbed, 1319118770 bytes, 20901139 packets
    out cd66c69c, 1413867357 bytes, 20031844 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.35: #247, ESTABLISHED, IKEv2, 075897addd943acd_i 790ead9da94d2894_r*
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.253.65' @ 10.202.32.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 2810s ago, rekeying in 11182s
  child-10.202.32.35: #8, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256598s ago
    in  c4ef93b6,  20911 bytes,   349 packets
    out c99ed6f7,  24094 bytes,   349 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.33: #246, ESTABLISHED, IKEv2, 11d8929d9b70c3e2_i* 25abd149a6372172_r
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.45.95' @ 10.202.32.33[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 12295s ago, rekeying in 857s
  child-10.202.32.33: #9, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256583s ago
    in  cef20976, 1406991581 bytes, 21600172 packets
    out cfbfa9f2, 1454068738 bytes, 21285962 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.34: #245, ESTABLISHED, IKEv2, 564fdbe95bfcf3b4_i 872da9b237f470ce_r*
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.170.201' @ 10.202.32.34[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 13355s ago, rekeying in 861s
  child-10.202.32.34: #14, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256521s ago
    in  c0c9dca3,  11774 bytes,   182 packets
    out cc59ad24,  10961 bytes,   183 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.31.35: #244, ESTABLISHED, IKEv2, 1d42da364683f304_i 6c62831ca77618bf_r*
  local  '10.42.77.126' @ 10.42.77.126[4500]
  remote '10.42.161.172' @ 10.202.31.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 13597s ago, rekeying in 201s
  child-10.202.31.35: #6, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256599s ago
    in  cf6ec080, 1417678116 bytes, 20924898 packets
    out c998a96b, 1405732954 bytes, 20656944 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0

Docker-3 which cannot communicate with docker-4.

conn-10.202.31.35: #260, ESTABLISHED, IKEv2, 01ffa96d5405dce8_i c90144c4652924e8_r*
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.161.172' @ 10.202.31.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 86s ago, rekeying in 13226s
  child-10.202.31.35: #8, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256549s ago, rekeying in 30588845s, expires in 34433051s
    in  c9f74e1a,  76162 bytes,   492 packets
    out ccf6fd53,  62918 bytes,   467 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.34: #259, ESTABLISHED, IKEv2, 7af7512383cc9f59_i* 7c7ac72d7409763e_r
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.170.201' @ 10.202.32.34[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 1245s ago, rekeying in 11983s
  child-10.202.32.34: #18, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256379s ago, rekeying in 28719129s, expires in 34433221s
    in  cd384db1, 1411871068 bytes, 21572765 packets
    out c1705f92, 1519499013 bytes, 20882258 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.31.35: #258, ESTABLISHED, IKEv2, 3fd2ba122b0a86fb_i* 7a4290624fe3220a_r
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.161.172' @ 10.202.31.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 1532s ago, rekeying in 11692s
  child-10.202.31.35: #11, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256535s ago, rekeying in 28941247s, expires in 34433065s
    in  c54bfcb5, 2572263454 bytes, 25753923 packets
    out c81f192a, 2412362895 bytes, 25062157 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.33.35: #257, ESTABLISHED, IKEv2, 9c2c597e493599a1_i b5c31b431c7d08c1_r*
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.22.83' @ 10.202.33.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 1626s ago, rekeying in 11560s
  child-10.202.33.35: #16, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256414s ago
    in  c132179d, 1066929359 bytes, 16876827 packets
    out c266d33e, 1236208286 bytes, 15329467 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.31.34: #256, ESTABLISHED, IKEv2, c9c7161c2b7f5ab7_i a153fa4882c16607_r*
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.77.126' @ 10.202.31.34[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 1667s ago, rekeying in 11584s
  child-10.202.31.34: #13, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256466s ago, rekeying in 28568530s, expires in 34433134s
    in  c54e6002,  25788 bytes,   390 packets
    out cde3c319,  28421 bytes,   349 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.34: #255, ESTABLISHED, IKEv2, 1b7dd8882e078916_i* c829b3a7c6e5aeca_r
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.170.201' @ 10.202.32.34[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 1671s ago, rekeying in 12557s
  child-10.202.32.34: #17, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256388s ago, rekeying in 29790035s, expires in 34433212s
    in  c129073f,  15985 bytes,   215 packets
    out c68defee,  18243 bytes,   200 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.33: #254, ESTABLISHED, IKEv2, 4d8db54849516599_i e91ba98a25fc82b8_r*
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.45.95' @ 10.202.32.33[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 1865s ago, rekeying in 11310s
  child-10.202.32.33: #10, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256535s ago, rekeying in 29850804s, expires in 34433065s
    in  c9478254, 1304933448 bytes, 19609747 packets
    out c3e6c221, 1389089164 bytes, 19239364 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.31.34: #253, ESTABLISHED, IKEv2, a75f0807d26dc89e_i* 101a2f24caa3cf56_r
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.77.126' @ 10.202.31.34[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 2031s ago, rekeying in 12089s
  child-10.202.31.34: #14, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256449s ago, rekeying in 28547956s, expires in 34433151s
    in  c00f2a33, 1353241636 bytes, 20148793 packets
    out cd4a1657, 1417830970 bytes, 19763654 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.35: #252, ESTABLISHED, IKEv2, 422a57d5646ef2e2_i* 4875c82684cc44fe_r
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.253.65' @ 10.202.32.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 2682s ago, rekeying in 10328s
  child-10.202.32.35: #12, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256499s ago, rekeying in 28194364s, expires in 34433101s
    in  cf1d150a, 1384778600 bytes, 20813356 packets
    out c21bf9b6, 1452199998 bytes, 20154801 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.33.34: #251, ESTABLISHED, IKEv2, bb18f7d75117d810_i* 1c52034046705813_r
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.144.96' @ 10.202.33.34[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 3739s ago, rekeying in 10382s
  child-10.202.33.34: #19, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256337s ago
    in  c21f9e3c, 1233478408 bytes, 19293667 packets
    out c2fce985, 1428896536 bytes, 17565883 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.32.33: #250, ESTABLISHED, IKEv2, a82140b807cd85d3_i* 920818be1d2d3e28_r
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.45.95' @ 10.202.32.33[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 12950s ago, rekeying in 590s
  child-10.202.32.33: #2, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256550s ago, rekeying in 28267257s, expires in 34433050s
    in  c3f4873d,  19409 bytes,   317 packets
    out cf0b8f2a,  26943 bytes,   307 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0
conn-10.202.33.35: #249, ESTABLISHED, IKEv2, eac0d55ba133706a_i* da16aabc729afbd9_r
  local  '10.42.154.214' @ 10.42.154.214[4500]
  remote '10.42.22.83' @ 10.202.33.35[4500]
  AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
  established 13060s ago, rekeying in 810s
  child-10.202.33.35: #15, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
    installed 256431s ago
    in  ca8557f5,  24179 bytes,   386 packets
    out cc392231,  27739 bytes,   333 packets
    local  0.0.0.0/0
    remote 0.0.0.0/0

Swanctl output from docker-4.

conn-10.202.33.34: #168, ESTABLISHED, IKEv2, 32e32de9712e4445_i* 13c7f525f2fc670c_r
 local  '10.42.90.53' @ 10.42.90.53[4500]
 remote '10.42.144.96' @ 10.202.33.34[4500]
 AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
 established 352s ago, rekeying in 13557s
 child-10.202.33.34: #1, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
   installed 256433s ago
   in  c4cc36ea,  14710 bytes,   270 packets
   out cddef87a,  19120 bytes,   320 packets
   local  0.0.0.0/0
   remote 0.0.0.0/0
conn-10.202.31.35: #167, ESTABLISHED, IKEv2, 0a9215e8871dc8e9_i 1c4f245789d51a34_r*
 local  '10.42.90.53' @ 10.42.90.53[4500]
 remote '10.42.161.172' @ 10.202.31.35[4500]
 AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
 established 1212s ago, rekeying in 11863s
 child-10.202.31.35: #8, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
   installed 256432s ago
   in  cd5946aa, 1253200949 bytes, 16093449 packets
   out c77e4ea0, 1077355017 bytes, 17326355 packets
   local  0.0.0.0/0
   remote 0.0.0.0/0
conn-10.202.32.34: #166, ESTABLISHED, IKEv2, e7aab62d0b46470f_i* 0cb9689f39f7cef5_r
 local  '10.42.90.53' @ 10.42.90.53[4500]
 remote '10.42.170.201' @ 10.202.32.34[4500]
 AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
 established 1245s ago, rekeying in 12335s
 child-10.202.32.34: #5, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
   installed 256433s ago
   in  c20c30b7, 753157069 bytes, 11548813 packets
   out cecc3ac2, 768758949 bytes, 11546908 packets
   local  0.0.0.0/0
   remote 0.0.0.0/0
conn-10.202.33.34: #165, ESTABLISHED, IKEv2, 263cd3072afeca12_i 5b8345ac8f2240bf_r*
 local  '10.42.90.53' @ 10.42.90.53[4500]
 remote '10.42.144.96' @ 10.202.33.34[4500]
 AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
 established 1656s ago, rekeying in 12164s
 child-10.202.33.34: #9, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
   installed 256398s ago
   in  cf19a29a, 614523200 bytes, 9189134 packets
   out c051bd58, 624818824 bytes, 9716488 packets
   local  0.0.0.0/0
   remote 0.0.0.0/0
conn-10.202.31.34: #164, ESTABLISHED, IKEv2, 39b54a424066b064_i d863d994ff6352a9_r*
 local  '10.42.90.53' @ 10.42.90.53[4500]
 remote '10.42.77.126' @ 10.202.31.34[4500]
 AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
 established 2029s ago, rekeying in 10996s
 child-10.202.31.34: #6, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
   installed 256433s ago
   in  cea37eef, 1145593428 bytes, 15500902 packets
   out c412f8fb, 1021806953 bytes, 16950561 packets
   local  0.0.0.0/0
   remote 0.0.0.0/0
conn-10.202.32.33: #163, ESTABLISHED, IKEv2, fd7693df9350b3ce_i* 3fae8fcac640967a_r
 local  '10.42.90.53' @ 10.42.90.53[4500]
 remote '10.42.45.95' @ 10.202.32.33[4500]
 AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
 established 3093s ago, rekeying in 10513s
 child-10.202.32.33: #3, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
   installed 256433s ago
   in  cb5ee52d, 1091945001 bytes, 15903100 packets
   out c22cee6a, 1053431374 bytes, 16453742 packets
   local  0.0.0.0/0
   remote 0.0.0.0/0
conn-10.202.32.35: #162, ESTABLISHED, IKEv2, 56ecebece644632e_i 1821928814b27f44_r*
 local  '10.42.90.53' @ 10.42.90.53[4500]
 remote '10.42.253.65' @ 10.202.32.35[4500]
 AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
 established 12596s ago, rekeying in 1680s
 child-10.202.32.35: #7, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
   installed 256433s ago
   in  c8c73920, 967542277 bytes, 13856808 packets
   out cd0ea36d, 920273974 bytes, 14690917 packets
   local  0.0.0.0/0
   remote 0.0.0.0/0

On docker-4 I found this in the logs and it seems to be repeating every few seconds.

16[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
12[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
05[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
05[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
08[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
08[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
15[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
15[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
14[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
14[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
09[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
12[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
time="2017-09-12T15:57:23Z" level=info msg="Metadata OnChange received, version: 121231-4f3c5c96fb170da7fa8781d4ac55192c"
time="2017-09-12T15:57:24Z" level=error msg="couldn't find subnetPrefixSize in network ipam config"
time="2017-09-12T15:57:24Z" level=info msg=Reconfiguring
time="2017-09-12T15:57:24Z" level=info msg="Deleted policy: {Dst: 10.42.0.0/16, Src: 10.42.62.66/32, Proto: 0, DstPort: 0, SrcPort: 0, Dir: dir in, Priority: 10000, Index: 120448, Mark: <nil>, Tmpls: [{Dst: 10.42.90.53, Src: 10.202.32.33, Proto: esp, Mode: tunnel, Spi: 0x0, Reqid: 0x4d2}]}"
time="2017-09-12T15:57:24Z" level=info msg="Deleted policy: {Dst: 10.42.62.66/32, Src: 10.42.0.0/16, Proto: 0, DstPort: 0, SrcPort: 0, Dir: dir out, Priority: 10000, Index: 120441, Mark: <nil>, Tmpls: [{Dst: 10.202.32.33, Src: 10.42.90.53, Proto: esp, Mode: tunnel, Spi: 0x0, Reqid: 0x4d2}]}"
time="2017-09-12T15:57:24Z" level=info msg="Deleted policy: {Dst: 10.42.0.0/16, Src: 10.42.62.66/32, Proto: 0, DstPort: 0, SrcPort: 0, Dir: dir fwd, Priority: 10000, Index: 120434, Mark: <nil>, Tmpls: [{Dst: 10.42.90.53, Src: 10.202.32.33, Proto: esp, Mode: tunnel, Spi: 0x0, Reqid: 0x4d2}]}"
time="2017-09-12T15:57:24Z" level=info msg="Added policy: {Dst: 10.42.47.55/32, Src: 10.42.0.0/16, Proto: 0, DstPort: 0, SrcPort: 0, Dir: dir out, Priority: 10000, Index: 0, Mark: <nil>, Tmpls: [{Dst: 10.202.33.35, Src: 10.42.90.53, Proto: esp, Mode: tunnel, Spi: 0x0, Reqid: 0x4d2}]}"
time="2017-09-12T15:57:24Z" level=info msg="Added policy: {Dst: 10.42.0.0/16, Src: 10.42.47.55/32, Proto: 0, DstPort: 0, SrcPort: 0, Dir: dir in, Priority: 10000, Index: 0, Mark: <nil>, Tmpls: [{Dst: 10.42.90.53, Src: 10.202.33.35, Proto: esp, Mode: tunnel, Spi: 0x0, Reqid: 0x4d2}]}"
time="2017-09-12T15:57:24Z" level=info msg="Added policy: {Dst: 10.42.0.0/16, Src: 10.42.47.55/32, Proto: 0, DstPort: 0, SrcPort: 0, Dir: dir fwd, Priority: 10000, Index: 0, Mark: <nil>, Tmpls: [{Dst: 10.42.90.53, Src: 10.202.33.35, Proto: esp, Mode: tunnel, Spi: 0x0, Reqid: 0x4d2}]}"
time="2017-09-12T15:57:24Z" level=info msg="Metadata OnChange received, version: 121233-4f3c5c96fb170da7fa8781d4ac55192c"
time="2017-09-12T15:57:25Z" level=error msg="couldn't find subnetPrefixSize in network ipam config"
time="2017-09-12T15:57:25Z" level=info msg=Reconfiguring
time="2017-09-12T15:57:26Z" level=info msg="Metadata OnChange received, version: 121234-4f3c5c96fb170da7fa8781d4ac55192c"
time="2017-09-12T15:57:26Z" level=error msg="couldn't find subnetPrefixSize in network ipam config"
time="2017-09-12T15:57:26Z" level=info msg=Reconfiguring
08[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
08[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
time="2017-09-12T15:57:28Z" level=info msg="Metadata OnChange received, version: 121235-4f3c5c96fb170da7fa8781d4ac55192c"
time="2017-09-12T15:57:29Z" level=error msg="couldn't find subnetPrefixSize in network ipam config"
time="2017-09-12T15:57:29Z" level=info msg=Reconfiguring
time="2017-09-12T15:57:30Z" level=info msg="Metadata OnChange received, version: 121237-4f3c5c96fb170da7fa8781d4ac55192c"
time="2017-09-12T15:57:30Z" level=error msg="couldn't find subnetPrefixSize in network ipam config"
time="2017-09-12T15:57:30Z" level=info msg=Reconfiguring
time="2017-09-12T15:57:31Z" level=info msg="Metadata OnChange received, version: 121241-4f3c5c96fb170da7fa8781d4ac55192c"
time="2017-09-12T15:57:32Z" level=error msg="couldn't find subnetPrefixSize in network ipam config"
time="2017-09-12T15:57:32Z" level=info msg=Reconfiguring
time="2017-09-12T15:57:36Z" level=info msg="Metadata OnChange received, version: 121242-4f3c5c96fb170da7fa8781d4ac55192c"
time="2017-09-12T15:57:36Z" level=error msg="couldn't find subnetPrefixSize in network ipam config"
time="2017-09-12T15:57:36Z" level=info msg=Reconfiguring
14[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
14[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
08[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
08[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
05[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
05[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
07[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
09[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
10[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
10[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
05[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
06[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
06[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
05[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
12[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
12[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
10[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
10[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
05[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
05[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
11[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
11[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
14[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
14[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
16[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
16[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
06[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
06[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
09[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
09[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
08[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
08[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
12[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
12[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
13[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
13[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
06[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
06[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete
15[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
15[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete




14[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.33.32
14[JOB] CHILD_SA ESP/0x00000000/10.202.33.32 not found for delete

Note that 10.202.33.32 is the IP of docker-3.

on docker-3 i see thousands of these entries:

16[JOB] CHILD_SA ESP/0x00000000/10.202.31.33 not found for delete
06[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.202.31.33
06[JOB] CHILD_SA ESP/0x00000000/10.202.31.33 not found for delete

Note that while investigating further we seem to have discovered more hosts within this cluster that cannot communicate with each other. They display the same symptoms as above.

This is for our QA environment. We checked multiple production environments and could not find instances of this happening outside of QA, production is on all of the same versions however. We did upgrade 2 days ago and didn't immediately notice any issues so I cant say for sure if it was broken from

Restarting the ipsec container on each host seemed to resolve the issue. but we are unsure of what triggered it and why it didnt or wouldn't self recover.

@cjellick
Copy link

@leodotcloud do you want to take a look at this?

@leodotcloud leodotcloud self-assigned this Sep 19, 2017
@leodotcloud
Copy link
Collaborator

@chkelly Can you please collect logs using: https://github.com/rancher/rancher-logs-collector and share them with me? (https://slack.rancher.io)

@tholcman
Copy link

Any progress here? We are probably facing same issue.

@superseb
Copy link
Contributor

@leodotcloud Can you confirm this will be fixed in #9971 ?

@superseb superseb added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Oct 17, 2017
@leodotcloud
Copy link
Collaborator

@tholcman please file a new issue with relevant logs using the logs collector.
@chkelly Since you are not hitting the issue anymore I am closing this one. Please feel free to file a new one if you are still experiencing problems.

#9971 will add retry logic for deleted SAs.

@leodotcloud leodotcloud added this to the v1.6 - October 2017 milestone Oct 21, 2017
@AndreSteenbergen
Copy link

I am on Docker 1.6.10. rancher/net is v0.11.9. I do see a lot of failed multi host communication. I would like to upgrade to 0.13.2, if that fixes the communication. But Rancher UI tells me the Ipsec is up to date. Is it possible to upgrade rancher/net to a later version?

@superseb
Copy link
Contributor

superseb commented Nov 8, 2017

In a supported/released version, no. It will be in v1.6.11 which should be ready soon.

If you have a playground or test environment, and want to test out if your case is solved you can try one of the release candidates or manually upgrade the service to the new version of the container. This is all not supported (because unreleased/untested), can break during future upgrades etc so don't do this in any production or other important environment.

@AndreSteenbergen
Copy link

Thanks. will wait for 1.6.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

7 participants