-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
馃悰 BUG: host won't reconnect to lighthouse successfully behind a ip-changed NAT #889
Comments
can you share the lighthouse logs? What should be happening in this case is, when the host's NATed IP address changes, traffic that it sends to the lighthouse has the new IP address. Lighthouse receives these packets, authenticates them, and updates its discovered IP address for that peer. That doesn't seem to be happening in your case, so something interesting is happening on your lighthouse. I'm hoping logs give a clue. |
Thanks for your reply. I've checked lighthouse logs and it seems that the host's NATed public ip did not change. Jun 02 10:33:48 ICL nebula[349]: level=info msg="Handshake message received" certName=SY fingerprint=9976ea029ee4b5a0916f79aefa760154847eac3448af92cb7d314ca66f195cff handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2307497560 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=0 udpAddr="ip1:36740" vpnIp=192.168.100.2
Jun 02 10:33:48 ICL nebula[349]: level=info msg="Handshake message sent" certName=SY fingerprint=9976ea029ee4b5a0916f79aefa760154847eac3448af92cb7d314ca66f195cff handshake="map[stage:2 style:ix_psk0]" initiatorIndex=2307497560 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=127293448 sentCachedPackets=0 udpAddr="ip1:36740" vpnIp=192.168.100.2
Jun 03 03:34:44 ICL nebula[349]: level=info msg="Tunnel status" certName=SY localIndex=127293448 remoteIndex=2307497560 tunnelCheck="map[method:active state:dead]" vpnIp=192.168.100.2
Jun 03 09:32:59 ICL nebula[349]: level=info msg="Handshake message received" certName=SY fingerprint=9976ea029ee4b5a0916f79aefa760154847eac3448af92cb7d314ca66f195cff handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1175697647 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=0 udpAddr="ip1:44710" vpnIp=192.168.100.2
Jun 03 09:32:59 ICL nebula[349]: level=info msg="Handshake message sent" certName=SY fingerprint=9976ea029ee4b5a0916f79aefa760154847eac3448af92cb7d314ca66f195cff handshake="map[stage:2 style:ix_psk0]" initiatorIndex=1175697647 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=3925596160 sentCachedPackets=0 udpAddr="ip1:44710" vpnIp=192.168.100.2 However, here's another clue that, the router(nat devices) did a wan restart cron whose wan ip happened to be kept as the same ip1, but the nat table must be flushed. Jun 3 03:30:00 cron: restart wan
Jun 3 03:30:00 K2P: Perform WAN manual reconnect
Jun 3 03:30:00 pppd[23884]: Terminating on signal 15
Jun 3 03:30:00 pppd[23884]: Connect time 1440.0 minutes.
Jun 3 03:30:00 pppd[23884]: Sent 3035167251 bytes, received 2273066402 bytes.
Jun 3 03:30:00 K2P: WAN down (ppp0)
Jun 3 03:30:00 PPPoE: Disconnected
Jun 3 03:30:00 pppd[23884]: Connection terminated.
Jun 3 03:30:00 pppd[23884]: Sent PADT
Jun 3 03:30:00 pppd[23884]: Exit.
Jun 3 03:30:01 inadyn[23926]: STOP command received, exiting.
Jun 3 03:30:01 kernel: eth3: ===> VirtualIF_close
Jun 3 03:30:01 kernel: eth3: ===> VirtualIF_open
Jun 3 03:30:01 K2P: Hardware NAT/Routing: Disabled
Jun 3 03:30:01 dnsmasq[19617]: read /etc/hosts - 3 addresses
Jun 3 03:30:01 dnsmasq[19617]: read /etc/storage/dnsmasq/hosts - 0 addresses
Jun 3 03:30:01 dnsmasq-dhcp[19617]: read /etc/dnsmasq/dhcp/dhcp-hosts.rc
Jun 3 03:30:01 dnsmasq-dhcp[19617]: read /etc/storage/dnsmasq/dhcp.conf
Jun 3 03:30:01 pppd[27602]: Plugin rp-pppoe.so loaded.
Jun 3 03:30:01 pppd[27602]: RP-PPPoE plugin version 3.12 compiled against pppd 2.4.7
Jun 3 03:30:01 pppd[27603]: pppd 2.4.7 started by admin, uid 0
Jun 3 03:30:01 pppd[27603]: PPP session is 51130 (0xc7ba)
Jun 3 03:30:01 pppd[27603]: Connected to 4c:09:b4:fc:49:10 via interface eth3
Jun 3 03:30:01 pppd[27603]: Using interface ppp0
Jun 3 03:30:01 pppd[27603]: Connect: ppp0 <--> eth3
Jun 3 03:30:01 pppd[27603]: Remote message: 0;User(07199747900) Authenticate OK, Request Accept by hb.cn
Jun 3 03:30:01 pppd[27603]: PAP authentication succeeded
Jun 3 03:30:01 pppd[27603]: peer from calling number 4C:09:B4:FC:49:10 authorized
Jun 3 03:30:01 pppd[27603]: local IP address ip1
Jun 3 03:30:01 K2P: WAN up (ppp0) After that, I've double checked other logs and you are right. When nated public ip chaged to ip2 on another cron, it reconnected successfully. Jun 05 03:31:46 ICL nebula[349]: level=info msg="Handshake message received" certName=SY fingerprint=9976ea029ee4b5a0916f79aefa760154847eac3448af92cb7d314ca66f195cff handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1272481155 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=0 udpAddr="ip2:44710" vpnIp=192.168.100.2
Jun 05 03:31:46 ICL nebula[349]: level=info msg="Taking new handshake" certName=SY localIndex=3159396583 remoteIndex=575398791 vpnIp=192.168.100.2
Jun 05 03:31:46 ICL nebula[349]: level=info msg="Handshake message sent" certName=SY fingerprint=9976ea029ee4b5a0916f79aefa760154847eac3448af92cb7d314ca66f195cff handshake="map[stage:2 style:ix_psk0]" initiatorIndex=1272481155 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=4021870838 sentCachedPackets=0 udpAddr="ip2:44710" vpnIp=192.168.100.2
Jun 05 03:31:58 ICL nebula[349]: level=info msg="Tunnel status" certName=SY localIndex=3159396583 remoteIndex=575398791 tunnelCheck="map[method:active state:dead]" vpnIp=192.168.100.2 So, when the host behind a nated device and this device keeps the public ip but flushs the nat table, weired bug happens. |
In the broken flow, it appears that Nebula is still sending its UDP packets to the lighthouse (from the same NAT IP, but from a different NAT port number), and the lighthouse is responding to the new IP/Port. However, the Nebula client is never receiving those Lighthouse replies. I'm suspicious that the NAT, after that cron job runs, is no longer routing return packets through to the client. If this is what's happening, it would explain why Nebula is unable to reconnect; the solution would be to fix the NAT device to make sure return traffic is routed correctly. Further debugging steps might include running tcpdump on the Nebula client behind the NAT to verify that return Nebula traffic is not arriving as expected, and then proceeding to debug the NAT device itself. |
Thanks a lot! It's very helpful. |
I've used tcpdump on nebula client and lighthouse. It shows nebula client send packet to lighthouse using the same port, while lighthouse cannot receive the packet. # client tcpdump logs
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:19:38.415469 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
09:19:38.600421 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
09:19:38.901614 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
09:19:39.301481 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
09:19:39.901343 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
09:19:40.500644 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
09:19:41.201538 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
09:19:42.001439 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
09:19:42.901285 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
09:19:43.901181 IP 192.168.2.2.54810 > lighhouse_ip.4242: UDP, length 232
# lighthouse tcpdump logs
# nothing received After restart nebula client, things are back to normal. # client tcpdump logs
10:45:27.457736 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 232
10:45:27.483897 IP lighthouse_ip.4242 > 192.168.2.2.51713: UDP, length 275
10:45:27.489602 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 66
10:45:27.514848 IP lighthouse_ip.4242 > 192.168.2.2.51713: UDP, length 42
10:45:38.658354 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:45:44.158125 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:45:49.657637 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:45:55.158556 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:46:01.157115 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:46:07.157549 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:46:12.658183 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:46:18.157539 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:46:23.657866 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:46:27.158744 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 66
10:46:27.183994 IP lighthouse_ip.4242 > 192.168.2.2.51713: UDP, length 42
10:46:35.157989 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
10:46:40.657130 IP 192.168.2.2.51713 > lighthouse_ip.4242: UDP, length 1
# lighthouse tcpdump logs
10:45:25.790125 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 232
10:45:25.790881 IP lighthouse_ip.4242 > client_wanip.51713: UDP, length 275
10:45:25.821700 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 66
10:45:25.821880 IP lighthouse_ip.4242 > client_wanip.51713: UDP, length 42
10:45:36.979362 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:45:42.473716 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:45:47.967715 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:45:53.463544 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:45:59.455749 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:46:05.450467 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:46:10.946000 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:46:16.439597 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:46:21.934627 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:46:25.432274 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 66
10:46:25.432437 IP lighthouse_ip.4242 > client_wanip.51713: UDP, length 42
10:46:33.423356 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1
10:46:38.916960 IP client_wanip.51713 > lighthouse_ip.4242: UDP, length 1 It's true that a problematic nat device would cause this issue. It may be better to handle such corner cases, as iptables nat (in my case) is commonly used. |
@brad-defined Hi, sorry to bother again. I've noticed that in n2n (a lagency p2p vpn similar to nebula), handles this complex NAT situation. I think it is necessary feature. |
Possible dupe of #515? |
Thank you @johnmaguire. It seems a little bit different, as in that scenario, the client's nated external IP changed and issue happens randomly. While in my scenario, when IP no changed issue happens and after restarting nebula everything comes bak to normal. I believe the issue cames from the incoming direction of nat, as it's not able forward packet to nebula client. (Maybe it loses partial table info after restarted and not try to fully reload when wan IP not changed?). However, this firewall cannot be controlled by me and I think maybe change reload nebula connection would be very helpful. |
What version of
nebula
are you using?1.7.2
What operating system are you using?
Linux
Describe the Bug
Maybe there should be a counter or pivot which will reload service when tries fail.
Logs from affected hosts
Config files from affected hosts
The text was updated successfully, but these errors were encountered: