Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: NAT Setup #33

Closed
jatsrt opened this issue Nov 23, 2019 · 86 comments
Closed

Question: NAT Setup #33

jatsrt opened this issue Nov 23, 2019 · 86 comments
Labels
bug Something isn't working

Comments

@jatsrt
Copy link

jatsrt commented Nov 23, 2019

I seem to be missing something important. If I setup a mesh of hosts with all direct public IP addresses, it works fine. However, if I have a network with a light house(public IP), then all nodes behind NAT, they will not connect to each other. The lighthouse is able to communicate with all hosts, but hosts are not able to communicate with each other.

Watching the logs I see connections trying to be made to both the NAT public, and the private IPs.

I have enabled punchy and punch back, but does not seem to help.

Hope it is something simple?

@jatsrt
Copy link
Author

jatsrt commented Nov 23, 2019

Also, to note in this setup all nodes are behind different NATs on different networks. Hub and spoke with the hub being the lighthouse and spokes going to hosts on different networks.

@rawdigits
Copy link
Collaborator

rawdigits commented Nov 23, 2019

My best guess (because I just messed this up in a live demo), is that am_lighthouse may be set to "true" on the individual nodes.

Either way, can you post your lighthouse config and one of your node configs?

(feel free to replace any sensitive IP/config bits, just put consistent placeholders in their place)

@nfam
Copy link

nfam commented Nov 23, 2019

Hi, I have the same issue. My lighthouse is on a DigitalOcean droplet with public IP. My MacBook and Linux Laptop at home are on the same network both connected to lighthouse. I can ping lighthouse from both laptop, but I cannot ping from one laptop to the other.

Lighthouse config

pki:
  ca: /data/cert/nebula/ca.crt
  cert: /data/cert/nebula/lighthouse.crt
  key: /data/cert/nebula/lighthouse.key
static_host_map:
  "192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
  am_lighthouse: true
  interval: 60
  hosts:
listen:
  host: 0.0.0.0
  port: 4242
punchy: true
tun:
  dev: neb0
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
logging:
  level: info
  format: text
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any
    - port: 443
      proto: tcp
      groups:
        - laptop

Macbook config

pki:
  ca: /Volumes/code/cert/nebula/ca.crt
  cert: /Volumes/code/cert/nebula/mba.crt
  key: /Volumes/code/cert/nebula/mba.key
static_host_map:
  "192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "LIGHTHOUSE_PUBLIC_IP"
punchy: true
tun:
  dev: neb0
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
logging:
  level: debug
  format: text
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any
    - port: 443
      proto: tcp
      groups:
        - laptop

Linux laptop config

pki:
  ca: /data/cert/nebula/ca.crt
  cert: /data/cert/nebula/server.crt
  key: /data/cert/nebula/server.key
static_host_map:
  "192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "LIGHTHOUSE_PUBLIC_IP"
punchy: true
listen:
  host: 0.0.0.0
  port: 4242
tun:
  dev: neb0
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
logging:
  level: info
  format: text
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any
    - port: 443
      proto: tcp
      groups:
        - laptop

@rawdigits
Copy link
Collaborator

@nfam thanks for sharing the config. My next best guess is that nat isn't reflecting and for some reason nodes also aren't finding each other locally.

Try setting the local_range config setting on the two laptops, which can give them a hint about the local network range to use for establishing the direct tunnel.

@jatsrt
Copy link
Author

jatsrt commented Nov 23, 2019

@nfam similar setup. Public lighthouse on digital ocean, laptop on home nat, and server in AWS behind a NAT. Local and AWS are using different private ranges(though overlap should be handled)

@nfam
Copy link

nfam commented Nov 23, 2019

@rawdigits setting local_range does not help.
I stopped nebula on both laptops, set log on lighthouse to debug, cleared log and restarted lighthouse (no node connected to). Following is the log I got.

time="2019-11-23T20:05:18Z" level=info msg="Main HostMap created" network=192.168.100.1/24 preferredRanges="[]"
time="2019-11-23T20:05:18Z" level=info msg="UDP hole punching enabled"
time="2019-11-23T20:05:18Z" level=info msg="Nebula interface is active" build=1.0.0 interface=neb0 network=192.168.100.1/24
time="2019-11-23T20:05:18Z" level=debug msg="Error while validating outbound packet: packet is not ipv4, type: 6" packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 183 226 137 252 10 196 21 15 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 27 133 0 0 0 0]"

@jatsrt
Copy link
Author

jatsrt commented Nov 23, 2019

My Config:
nebula-cert sign -name "lighthouse" -ip "192.168.100.1/24"
nebula-cert sign -name "laptop" -ip "192.168.100.101/24" -groups "laptop"
nebula-cert sign -name "server" -ip "192.168.100.201/24" -groups "server"

Lighthouse:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/lighthouse.crt
  key: /etc/nebula/lighthouse.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: true
  interval: 60

listen:
  host: 0.0.0.0
  port: 4242

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

Laptop:

pki:
  # The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca'
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/laptop.crt
  key: /etc/nebula/laptop.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"

listen:
  host: 0.0.0.0
  port: 0

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

Server:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/server.crt
  key: /etc/nebula/server.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"

listen:
  host: 0.0.0.0
  port: 0

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

With this setup, both server and laptop can ping lighthouse, lighhouse can ping server and laptop, but laptop cannot ping server and server cannot ping laptop.

I get messages such as this as it's trying to make the connection:

INFO[0006] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0007] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0009] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0011] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0012] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0014] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0016] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201

@jatsrt
Copy link
Author

jatsrt commented Nov 23, 2019

@nfam similar error, not sure it's the problem

Error while validating outbound packet: packet is not ipv4, type: 6 packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 139 176 20 9 146 65 14 250 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 60 66 0 0 0 0]"
DEBU[0066] Error while validating outbound packet: packet is not ipv4, type: 6 packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 139 176 20 9 146 65 14 250 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 60 66 0 0 0 0]"

@rawdigits
Copy link
Collaborator

rawdigits commented Nov 23, 2019

@jatsrt

The Error while validating outbound packet can mostly be ignored. Just some types of packet nebula doesn't support bouncing off.

As far as the handshakes, for some reason hole punching isn't working. A few things to try:

  1. Add punch_back: true on the "server" and "laptop" nodes.
  2. explicitly allow all UDP in to the "server" node from the internet (via AWS security groups, just as a test)
  3. verify iptables isn't blocking anything.

Also It appears the logs with the handshake messages are from the laptop? If so can you also share nebula logs from the server as it tries to reach the laptop?

Thanks!

@rawdigits
Copy link
Collaborator

Aha, @nfam I think I spotted the config problem.

instead of

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "LIGHTHOUSE_PUBLIC_IP"

it should be

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "192.168.100.1"

@rawdigits
Copy link
Collaborator

adding #40 to cover accidental misconfiguration noted above.

@nfam
Copy link

nfam commented Nov 23, 2019

@rawdigits yes, it is. Now both laptops can ping to each other.
Thanks!

@jatsrt
Copy link
Author

jatsrt commented Nov 24, 2019

@rawdigits

  1. added punch back on "server" and "laptop"
  2. security group for that node is currently wide open for all protocols
  3. No iptables on any of these nodes, base ubuntu server for testing

Server log:

time="2019-11-24T00:25:21Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:22Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:22Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:23Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:24Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:25Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:26Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:27Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:28Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:30Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101

@jatsrt
Copy link
Author

jatsrt commented Nov 24, 2019

So, tried a few more setups, just comes down to what seems like if the two hosts that are trying to communicate with each other are both on different networks and both behind NAT, it will not work.
If the lighthouse does not facilitate the communication/tunneling, this would make sense, but is it meant to be a limitation?

@nbrownus
Copy link
Collaborator

Dual NAT scenario is a bit tricky, possibly room for improvement from nebula's perspective there. Do you have details on the type of NATs you are dealing with?

@jatsrt
Copy link
Author

jatsrt commented Nov 24, 2019

@nbrownus nothing crazy, I've done multiple AWS VPC NAT gateways with hosts behind them and they cannot connect. I've also tried "home" NAT(google WiFi router based NAT), with no success.

From a networking perspective, I get why it's "tricky" was hoping there was some trick nebula was doing.

@nbrownus
Copy link
Collaborator

@rawdigits can speak to the punching better than I can. If you are having problems in AWS then we can get a test running and sort out the issues.

@jatsrt
Copy link
Author

jatsrt commented Nov 24, 2019

Yeah, so all my tests have had at least one host behind an AWS NAT Gateway

@nbrownus nbrownus added the bug Something isn't working label Nov 24, 2019
@rawdigits
Copy link
Collaborator

Longshot, but one more thing to try until I set up an AWS NAT GW:
set the UDP port on all nodes to 4242 and let NAT remap it. One ISP I've dealt with blocks the random ephemeral udp ports above 32,000, presumably because they think every high UDP port is bittorrent.

Probably won't work, but easy to test..

@jatsrt
Copy link
Author

jatsrt commented Nov 24, 2019

@rawdigits same issue

Network combination:
Lighthouse - Digital Ocean NYC3 - Public IP
Server - AWS - Oregon - Private VPC with AWS NAT Gateway (172.31.0.0/16)
Laptop - Verizon FIOS With Google WIFI Router NAT (192.168.1.0/24)
Server2(added later to test) - AWS - Ohio Private VPC with AWS NAT Gateway (10.200.200.0/24)

I added in a second server in a different VPC on AWS to remove the FIOS variable, and had the same results, with server and server2 trying to communicate

INFO[0065] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0066] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
INFO[0067] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0069] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
INFO[0071] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0072] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201

@rawdigits
Copy link
Collaborator

@jatsrt I'll stand up a testbed this week to explore what may be the cause of the issue. Thanks!

@iamid0
Copy link

iamid0 commented Nov 27, 2019

My Config:
nebula-cert sign -name "lighthouse" -ip "192.168.100.1/24"
nebula-cert sign -name "laptop" -ip "192.168.100.101/24" -groups "laptop"
nebula-cert sign -name "server" -ip "192.168.100.201/24" -groups "server"

Lighthouse:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/lighthouse.crt
  key: /etc/nebula/lighthouse.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: true
  interval: 60

listen:
  host: 0.0.0.0
  port: 4242

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

Laptop:

pki:
  # The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca'
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/laptop.crt
  key: /etc/nebula/laptop.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"

listen:
  host: 0.0.0.0
  port: 0

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

Server:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/server.crt
  key: /etc/nebula/server.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"

listen:
  host: 0.0.0.0
  port: 0

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

With this setup, both server and laptop can ping lighthouse, lighhouse can ping server and laptop, but laptop cannot ping server and server cannot ping laptop.

I get messages such as this as it's trying to make the connection:

INFO[0006] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0007] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0009] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0011] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0012] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0014] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0016] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201

I have got the same situation.
node_A <----> lighthouse OK
node_B <----> lighthouse OK
node_A < ----> node_B Not work, cannot ping each other.

But I found, node_A and node_B can communicate with each other ONLY if both are connected to the same router, such as the same WiFi router.

PS
punch_back: true on both node_A and node_B.

No firewall on node_A, node_B and lighthouse.

@fireapp
Copy link

fireapp commented Nov 27, 2019

hole punch very difficult and random

@spencerryan
Copy link

spencerryan commented Nov 27, 2019

I also can't get nebula to work properly when both nodes are behind a typical NAT (Technically PAT) regardless of any port pinning I do in the config. They happily connect to the lighthouse I have in AWS but it seems like something isn't working properly. I've got punchy and punchback enabled on everything and it doesn't seem to help. I've tried setting the port on the nodes to 0, and also trying the same port that lighthouse is listening on.

The nodes have no issues connecting to each other over the MPLS, but we don't want that (performance reasons)

Edit: To add a bit more detail, even Meraki's AutoVPN can't deal with this. In their situation the "hub" needs to be told it's public IP and a fixed port that is open inbound. I'd be fine with that as an option, and may be the only reliable one if both nodes are behind different NATs.

Another option I had considered, what if we could use the lighthouses to hairpin traffic? I'd much rather pay AWS for the bandwidth than have to deal with unfriendly NATs everywhere.

@rawdigits
Copy link
Collaborator

I did a bit more research, and it appears that the "AWS Nat Gateway" uses Symmetric NAT, which isn't friendly to hole punching of any kind. NAT gateways also don't appear to support any type of port forwarding, so fixing this by statically assigning and forwarding a port doesn't appear to be an option.

A NAT instance would probably work, but I realize that's probably not a great option. One thing I recommend considering would be to give instances a routable IP address, but disallow all inbound traffic. This wouldn't greatly change the security of your network, since you still aren't allowing any unsolicited packets to reach the hosts, but would allow hole punching to work properly.

@spencerryan
Copy link

I don't think NAT so much is the issue but PAT (port translation). Unfortunately with that you can't predict what your public port will be and hole punching becomes impossible if both ends are behind a similar PAT. I'm going to do some testing, but I think that as long as 1 of 2 nodes has a 1:1 NAT (no port translation) a public IP on the node directly isn't a concern.

If I get particularly ambitious I may attempt to whip up some code in lighthouse to detect when one/both nodes are behind a PAT and throw a warning saying that this won't work out of the box.

@wadey
Copy link
Member

wadey commented Nov 28, 2019

If I get particularly ambitious I may attempt to whip up some code in lighthouse to detect when one/both nodes are behind a PAT and throw a warning saying that this won't work out of the box

I've thought about this before. You need at least 2 lighthouses, and I think it's best to implement as a flag on the non-lighthouses (when you query the lighthouses for a host, if you get results with the same IP but different ports then you know the remote is problematic).

@spencerryan
Copy link

I haven't dug into the handshake code but if you include the source port in the handshake the lighthouse can compare that to what it sees. If they differ you know something in the middle is doing port translation.

@jocull
Copy link

jocull commented Dec 8, 2019

Aha, @nfam I think I spotted the config problem.

instead of

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "LIGHTHOUSE_PUBLIC_IP"

it should be

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "192.168.100.1"

I bet this is also my issue... will test it soon. That section is confusing 😕

@tarrenj
Copy link

tarrenj commented Jan 5, 2022

@schuft69 Are your nodes able to connect to the lighthouse? If so, you may just need to statically set a port for each extra node and then open those up on OPNsense.

@stilsch
Copy link

stilsch commented Jan 7, 2022

@schuft69 Are your nodes able to connect to the lighthouse? If so, you may just need to statically set a port for each extra node and then open those up on OPNsense.

-> setting everything up with static ports + dyndns is working quite well. <-

I was hoping to get rid of static ports with nebula (which I have now with wireguard). The hole-punching (from lighthouse on a 1€ droplet at strato.de) is neither working on the FritzBox (where I have devices on my parents home) and not on my home (OPNSense - maybe because disabling UDP port rewriting like written earlier is not working somehow (have to ask at the OPNsense conmunity..))).
So the main benefit is that I don't need to bother with iptables (to secure the endpoints) with wireguard anymore - which at least is also a win.

@tarrenj
Copy link

tarrenj commented Jan 7, 2022

Glad you've got a workable solution!

I'm hearing from a lot of people that the NAT punching isn't as successful as I think the Nebula devs had expected. I remember reading a comment in an older issue/PR thread from one of them about being disappointed that many users aren''t able to use IPv6 since it doesn't have any of these NAT issues. I really hope they'll add in some support for partial mesh implementations soon, and update the readme to explain that it currently only supports 100% full mesh deployments.

@tcurdt
Copy link

tcurdt commented Jan 20, 2022

In this workshop video it sounds like the NAT-to-NAT traversal is supported. But here it sounds like NAT is still as messy as it always has been. What the status on UPnP/NAT-PMP support?

@tcurdt
Copy link

tcurdt commented Jan 27, 2022

I am really confused on what nebula supports or does not support in regards to NAT traversal. From reading the comments it sounds like this:

Let's say I have a lighthouse and two networks behind NATs.

network

I assume:

  • the nebula instance on machine1 allows network participants to reach printer1 (unsafe_routes)
  • the nebula instance on machine2 allows network participants to reach printer2 (unsafe_routes)
  • machine1 can reach machine2 (because port forwarding to machine2 is setup for NAT2)
  • machine2 can reach machine1 only with port forwarding also setup for NAT1
  • machine1 can reach printer 2 (because port forwarding to machine2 is setup for NAT2)
  • machine2 can reach printer 1 only with port forwarding also setup for NAT1
  • forwarding the port to a single nebula instance will make all nebula instances behind the NAT accessible
  • nebula does not support UPnP/NAT-PMP and requires manual port forwarding
  • dyndns is not required as the lighthouse knows about the external IPs of the NATs

Are these assumption correct?

@tarrenj
Copy link

tarrenj commented Jan 27, 2022

  • the nebula instance on machine1 allows network participants to reach printer1 (unsafe_routes)

Sort of, but not really. Machine1 needs the network the printer is on specified within its cert (--submnets argument), and the unsafe_routes entry needs to be made at every OTHER nebula instance that you want to connect to printer1 (machine2 and lighthouse).

  • machine1 can reach machine2 (because port forwarding to machine2 is setup for NAT2)
  • machine2 can reach machine1 only with port forwarding also setup for NAT1

Yes, that should be the case

  • forwarding the port to a single nebula instance will make all nebula instances behind the NAT accessible

No, Nebula does not have a "proxy", "routing" or "connection hopping" mechanism built in.

  • nebula does not support UPnP/NAT-PMP and requires manual port forwarding

PMP is not supported (but there's a PR to add it!) and UPnP should work.

  • dyndns is not required as the lighthouse knows about the external IPs of the NATs

Correct

@tcurdt
Copy link

tcurdt commented Jan 27, 2022

Thanks for the help, @tarrenj

the nebula instance on machine1 allows network participants to reach printer1 (unsafe_routes)

Sort of, but not really. Machine1 needs the network the printer is on specified within its cert (--submnets argument), and the unsafe_routes entry needs to be made at every OTHER nebula instance that you want to connect to printer1 (machine2 and lighthouse).

So with --subnets I'd pass in the network that is behind NAT2 for the cert of machine1
So one would have to re-generate a cert to give access to another network.

Where I am still a bit lost is the "OTHER".
Why would the lighthouse reach the printer with unsafe_routes,
but machine1 needs to have the network as part of the cert?

And machine2 should have a local LAN connection to the printer on another interface.
Shouldn't it be able to reach the printer even without unsafe_routes?

forwarding the port to a single nebula instance will make all nebula instances behind the NAT accessible

No, Nebula does not have a "proxy", "routing" or "connection hopping" mechanism built in.

So that means I would have to open a port to every nebula instance?!

nebula does not support UPnP/NAT-PMP and requires manual port forwarding

PMP is not supported (but there's a PR to add it!) and UPnP should work.

Found it! #148

@tarrenj
Copy link

tarrenj commented Jan 27, 2022

So with --subnets I'd pass in the network that is behind NAT2 for the cert of machine1 So one would have to re-generate a cert to give access to another network.

No, the opposite. You'd use the network printer1 is on when creating the cert for machine1, and the network printer2 is on when creating the cert for machine2. Certs are all about trust. When you generate a cert on machine1 with subnet n specified, you then have to have it signed by the CA (which all other nodes trust). This effectively tells all other nodes "According to the CA (which you already trust), machine1 is allowed to relay traffic to network n" Generating and signing the machine1 cert with the --subnet n argument basically grants machine1 "permission" to route traffic to that unsafe network.

Where I am still a bit lost is the "OTHER". Why would the lighthouse reach the printer with unsafe_routes, but machine1 needs to have the network as part of the cert?

And machine2 should have a local LAN connection to the printer on another interface. Shouldn't it be able to reach the printer even without unsafe_routes?

Adding an unsafe_route entry to the lighthouse is only required if the lighthouse needs to access the unsafe network.

The unsafe_route entries tell the local node to accept incoming traffic destined for network n, and send it to machine1 via the overlay network. Again this goes back to trust: Doing it the other way around (configuring unsafe_routes on the node that's doing the routing - they way I believe you expected it to work) would mean that my local nodes configuration is changed based on the actions of a remote node admin. What would prevent them from simply saying "Get to the WAN through me!" and then MITMing all traffic from all nodes?

So that means I would have to open a port to every nebula instance?!

Nebula assumes that each node is able to establish a direct connection with each other node (using NAT hole punching through UPnP). Machine1 would not be able to access machine2 by connecting "through" the lighthouse, in your above example.

@tcurdt
Copy link

tcurdt commented Jan 27, 2022

So to summarise: The machine1 cert would be signed for nat1, the machine2 cert would be signed for nat2 - that defines their trust relationship as "exit node" into the LAN. And specifying the unsafe_route defines the rout-ability of the traffic through the overlay. But the unsafe_route part I will figure out - I don't want to hijack the issue for these details.

I guess the real important information in the context of this issue is that every nebula instance must be directly reachable through the NAT - so is requiring a punched/forwarded port. I didn't expect that. Thanks for clearing this up!

@shantivana
Copy link

shantivana commented Jun 21, 2022

Thanks for this conversation. I might have found another way after a bunch of trial and error. I had two laptops inside my regular home network behind a NAT that could not connect to a server on another network also behind a NAT. The server network had a lighthouse with perimeter firewall rule as a lighthouse should, to which the laptops could PING but they could not the server endpoint.

Solution/workaround that worked for me: On the laptops, create an entry in the config.yml for the server endpoint as though it's a lighthouse (even though it's not actually a lighthouse), and alongside the lighthouse "hosts" entry in the config.yml. Put the external network IP and port for the other endpoint, even if there is no perimeter firewall rule to it. In my case the lighthouse has port 4242 and the server endpoint was a different port (not sure if that is necessary).

Note: I did NOT need to do anything with unsafe_route that was discussed here and I was unable to make progress with the OPNsense outbound NAT rule, although I did try to create that rule, since one of the firewalls is pfSense, but ended up removing it.

My theory it works because the server endpoint's external network has an actual lighthouse, so the laptop client knows how to reach that network, meaning that the laptop client associates the server's Nebula IP address with the external IP of the lighthouse being on the same network. The actual lighthouse knows how to get to the server endpoint and provides the path once the Nebula connection is established.

Note: It's possible that some of my other troubleshooting put some temporary route that stuck, but I don't think so because by removing the workaround entry in config.yml it stopped working again, therefore reproducible. Good luck!

@brad-defined
Copy link
Collaborator

brad-defined commented Jul 11, 2022

Nebula 1.6.0 is released with a Relay feature, to cover cases like a Symmetric NAT.
#678

Check out the example config to see how to configure a Nebula node to act as a relay, and how to configure other nodes to identify which Relay can be used by peers for access.


(edit to provide some documentation of the feature)
In order to provide 100% connectivity between Nebula peers in all networks, you may now relay Nebula traffic through a third Nebula peer. I encourage everyone to try out this feature, and let us know how it goes! The config options are included in the Nebula example config:

# EXPERIMENTAL: relay support for networks that can't establish direct connections.
relay:
  # Relays are a list of Nebula IP's that peers can use to relay packets to me.
  # IPs in this list must have am_relay set to true in their configs, otherwise
  # they will reject relay requests.
  #relays:
    #- 192.168.100.1
    #- <other Nebula VPN IPs of hosts used as relays to access me>
  # Set am_relay to true to permit other hosts to list my IP in their relays config. Default false.
  am_relay: false
  # Set use_relays to false to prevent this instance from attempting to establish connections through relays.
  # default true
  use_relays: true

For most personal users of Nebula, the Lighthouse is the ideal relay. To use Relays on your network, do the following:

  • Install Nebula 1.6.0 on all Nebula hosts in your network
  • Edit the config.yml of your lighthouse and set relay.am_relay: true
  • Edit the config.yml of your Nebula peers to specify the lighthouse’s Nebula IP as a relay by setting relay.relays: [<lighthouse Nebula IP>].

Some rules around Relays:

  • Nebula will not acs as a Relay unless configured to do so (relay.am_relay: true).
  • Relays do not have to be Lighthouses.
  • Like Lighthouses, Relay nodes should be deployed with a public internet IP and firewall rules that permit Nebula’s UDP traffic inbound (default UDP port 4242.)
  • Nebula config identifies which hosts may be used by peers as Relays for connectivity. (relay.relays: [ip, ip, ip]) Each of the ip's specified must have relay.am_relay: true set in their configs. Note that you can specify more than one Relay, for high availability.
  • Nebula will not attempt to use a Relay to connect to a peer unless configured to do so (relay.use_relays: true)
  • You aren’t limited to using a single Relay in your network. Each Nebula node can specify its own list of Relays for access. For instance, if you have some Nebula hosts in a private AWS VPC, you can set up a Relay host dedicated to enable connectivity to the peers in that VPC.
  • You can't relay to a Relay. Meaning, hosts configured to act as a relay (with relay.am_relay: true set) may not specify other relays (relay.relays: ) to be used for access.

@sfxworks
Copy link

sfxworks commented Jul 12, 2022

wow, I was running into this issue and this post appeared 15 hours ago! Thanks, nebula team.

Can confirm it works! Home PC to a remote server (192.168.32.4) on a remote node with public IP acting as a lighthouse (192.168.32.1)

[root@sam-manjaro ~]# ping 192.168.32.1
PING 192.168.32.1 (192.168.32.1) 56(84) bytes of data.
64 bytes from 192.168.32.1: icmp_seq=1 ttl=64 time=22.2 ms
64 bytes from 192.168.32.1: icmp_seq=2 ttl=64 time=21.3 ms
^C
--- 192.168.32.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 21.310/21.742/22.174/0.432 ms
[root@sam-manjaro ~]# ping 192.168.32.4
PING 192.168.32.4 (192.168.32.4) 56(84) bytes of data.
64 bytes from 192.168.32.4: icmp_seq=1 ttl=64 time=334 ms
64 bytes from 192.168.32.4: icmp_seq=2 ttl=64 time=23.1 ms
64 bytes from 192.168.32.4: icmp_seq=3 ttl=64 time=22.0 ms
^C
--- 192.168.32.4 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 22.030/126.251/333.623/146.634 ms

@sfxworks
Copy link

sfxworks commented Jul 13, 2022

Something I've noticed in using this, I've had to lower mtu a bit from the original 1300. Not sure if it's because another end of mine doing 1:1 nat or if it's relay related. Other than that no problems.

╭─ ~ ▓▒░──────────────────────────────────────────────────░▒▓ ✔  at 21:20:39 ─╮
╰─ ping 192.168.32.4 -s 1216                                                    ─╯
PING 192.168.32.4 (192.168.32.4) 1216(1244) bytes of data.
1224 bytes from 192.168.32.4: icmp_seq=1 ttl=64 time=26.6 ms
1224 bytes from 192.168.32.4: icmp_seq=2 ttl=64 time=26.0 ms
1224 bytes from 192.168.32.4: icmp_seq=3 ttl=64 time=25.2 ms
^C
--- 192.168.32.4 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 25.186/25.913/26.592/0.575 ms

╭─ ~ ▓▒░──────────────────────────────────────────────────░▒▓ ✔  at 21:20:43 ─╮
╰─ ping 192.168.32.4 -s 1217                                                    ─╯
PING 192.168.32.4 (192.168.32.4) 1217(1245) bytes of data.
^C
--- 192.168.32.4 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2039ms

@brad-defined
Copy link
Collaborator

@sfxworks thanks for the feedback! You're spot on - when relaying, Nebula sticks additional headers onto the packets, which will impact the MTU.

@noseshimself
Copy link

Nebula 1.6.0 is released with a Relay feature, to cover cases like a Symmetric NAT.

There is still a special case needing attention (or yet another type of node) which I don't get my head wrapped around: Gateways between two or more meshes. A server wiht several instances of nebula running on different addresse and/or ports could act as a relay node between them, permitting segmentation between the equivalent of VLANs. But that would probably also require some separated DNS service that can be shared among the meshes.

@sfxworks
Copy link

Something I noticed when dealing with more MTU issues:

  mtu: 1200
  # Route based MTU overrides, you have known vpn ip paths that can support larger MTUs you can increase/decrease them here
  routes:
    #- mtu: 8800
    #  route: 10.0.0.0/16
  # Unsafe routes allows you to route traffic over nebula to non-nebula nodes
  # Unsafe routes should be avoided unless you have hosts/services that cannot run nebula
  # NOTE: The nebula certificate of the "via" node *MUST* have the "route" defined as a subnet in its certificate
  # `mtu` will default to tun mtu if this option is not specified
  # `metric` will default to 0 if this option is not specified
  unsafe_routes:
   - route: 192.168.8.0/23
     via: 192.168.32.5
     mtu: 1300

Even with one unsafe route specified as 1300, the entire tunnel was configured to be 1300 instead of 1200. So while I could reach my unsafe route area,

14: nebula1: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1200 qdisc fq_codel state UNKNOWN group default qlen 500
    link/none 
    inet 192.168.32.6/19 scope global nebula1
       valid_lft forever preferred_lft forever

I could not reach my office server that required the mtu of 1200 even though the default was set to 1200.
Removing that field or setting it to 1200 worked fine. Didn't test it as a lesser value for that route as that wasn't needed.

@sfxworks
Copy link

sfxworks commented Aug 13, 2022

Wait no, that's not the issue I was having. Logs from home PC:

sg="Attempt to relay through hosts" relayIps="[192.168.32.1 192.167.32.7]" vpnIp=192.168.32.4
sg="Re-send CreateRelay request" relay=192.168.32.1 vpnIp=192.168.32.4
sg="Establish tunnel to relay target." error="unable to find host" relay=192.167.32.7 vpnIp=192.168.32.4
sg=handleCreateRelayResponse hostInfo=192.168.32.1 initiatorIdx=3445190669 relayFrom=192.168.32.6 relayTarget=192.168.32.4 responderIdx=4235128187
sg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1462476829 udpAddrs="[10.0.0.231:4242 192.168.67.32:4242]" vpnIp=192.168.32.4
sg="Attempt to relay through hosts" relayIps="[192.168.32.1 192.167.32.7]" vpnIp=192.168.32.4
sg="Send handshake via relay" relay=192.168.32.1 vpnIp=192.168.32.4
sg="Establish tunnel to relay target." error="unable to find host" relay=192.167.32.7 vpnIp=192.168.32.4
sg="Handshake message received" certName=office-server-1 durationNs=424617429 fingerprint=4cdb758bbdf9130f18d0be3994fd79c0966ddc9d5364ec63d1afc5c24a6f74df handshake="map[stage>
sg="Tunnel status" certName=office-server-1 tunnelCheck="map[method:active state:dead]" vpnIp=192.168.32.4

I can ping 192.168.32.7 from my Home PC just fine. So, not sure why I'm getting unable to find host

This occurs on occasion with my config. I have two relay hosts based on two lighthouses with public IPs. These also act as routers for their respective zones.

So I have

Home PC (192.168.32.6) and Office Server (192.168.32.4)

relay:
  relays:
    - 192.168.32.1
    - 192.168.32.7
  am_relay: false
use_relays: true

With Home Router (192.168.32.7)

relay:
  relays:
   - 192.168.32.1
  am_relay: true
use_relays: true

And Office Router (192.168.32.1)

relay:
  relays:
    - 192.168.32.7
  am_relay: true
use_relays: true

The thing is, after either a systemctl restart and/or some time, the issue resolves itself and I can reach my office server again. It's intermittent. Is one of my relays just bad, or is this somehow the wrong way to set this up?


Edit:
I added a third node outside of the other two networks. They seem to work ok like that. Though this third node is in another network in which I wished to add soon. So I'm worried I may run into the same problem.

Edit 2:
Just now seeing
"You can't relay to a Relay"
So I wonder if this is related

@brad-defined
Copy link
Collaborator

@sfxworks unable to find host is a misleading message - it means that the Relay doesn't have a direct connection to that host at that time.
When that happens, the Relay will attempt to establish a direct connection to the target host. If the Relay receives another CreateRelayRequest message after it's successfully established a direct connection to the target host, it'll be able to successfully complete the relay connection.
When tunneling through a Relay, Nebula will include an extra header (16 bytes) and an extra AEAD signature (16 bytes.) So Relays add 32 bytes in total to your existing Nebula traffic.

@brad-defined
Copy link
Collaborator

@noseshimself I think the Nebula way to join two Nebula networks together is to run multiple instances of Nebula on all hosts joined to both networks, rather than on one Gateway host to join the networks.

With direct connections between the peers, you get all the identity fidelity and corresponding firewall rules. If hosts are joined by an intermediary, their identity is lost - you will only have the identity of the gateway host, not the identity of the peer.

That being said, I think the existing unsafe routes feature would accomplish what you described. (It's called unsafe due to the loss of identity information of the connection.)

@noseshimself
Copy link

I think the Nebula way to join two Nebula networks together is to run multiple instances of Nebula on all hosts joined to both networks, rather than on one Gateway host to join the networks.

I prefer doing packet filtering on dedicated systems. Imagine having a set of server systems that are supposed to be reachable by "accounting" and "thieves" and I don't want the thieves to be able to access the systems in the accounting network while not trusting the administrators of the servers either (but trusting the networking staff due to being under my control). I could of course trust the Nebula certificates taking care of that but I don't know if $asshole-from-thieves would install a modified client removing the restriction.

@johnmaguire
Copy link
Collaborator

Hi all! There's a lot of questions, answers, and information in this thread, but it's gotten a bit hard to follow.

We believe that the relay feature should be sufficient for most tricky NAT scenarios. As such, I'm going to close this issue out as solved. If you're continuing to experience connectivity issues, please feel free to open up a new issue or join us on Slack. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests