Support Path MTU Discovery #16

wadey · 2019-11-21T03:56:42Z

Overview

This change allows the discovery and propagation of the path MTU. Until further
testing in the wild is done, you must set tun.path_mtu_discovery = true to
enable it.

This is change meant for situations like on AWS where you have a default
interface with MTU 9001 that only supports 1500 to internet destinations. It is
not anticipated to be used to discover the MTU over the internet, only to
the local gateway (although it should work in all cases)

Set tun.mtu to the highest mtu supported minus 60. Nebula will check for
DontFragment on outgoing packets and respond with ICMP "Destination Unreachable
Fragmentation Needed" if the packet is too large, which allows TCP
connections to the tun device to automatically set MSS (see RFC 1191).

An example config that could be used:

preferred_ranges:
  - 10.0.0.0/9
tun:
  path_mtu_discovery: true
  mtu: 8941 # (9001 - 60)

This would advertise a high MTU on the tun device, and use path MTU
discovery to reduce the MTU on any paths that end up over the internet.

Design

The initial implementation is only for Linux.

There are multiple ways to implement this, each with their own trade
offs:

Use IP_PMTUDISC_WANT on the outside socket.

Pros: simple support for sending packets that want fragmentation.
Cons: Need cache invalidation. There could be a race between checking
the current MTU and the OS discovering the actual MTU if other
processes are connecting to the same remote.

Use IP_PMTUDISC_DO and manualy fragment large packets.

Pros: Don't need cache invalidation.
Cons: Nebula meta packets would need fragmentation support,
complicating the implementation.

Use IP_PMTUDISC_DO and use raw sockets to send large packets.

Pros: Don't have to deal with fragmentation logic in Nebula.
Cons: Two code paths for sending, and requires raw socket support.

Use IP_PMTUDISC_DO and use SO_REUSEPORT with a second socket to send
large packets.

Pros: Same as (3) but you don't need raw sockets.
Cons: Using SO_REUSEPORT allows weird things to happen if you
accidentally start two Nebula processes or something else listening on
the same port.

After reviewing these options, (1) was selected for the first
implementation because it is fairly easy and the race conditions should
be rare. Switching to (2) or (3) in the future might be a good idea if
we review the trade offs more.

Implementation

IP_RECVERR is set on the outside UDP socket so that we can receive the
FragmentationNeeded packets that come from upstream. We set up
a goroutine to poll for these events and update our cache of the route MTU.

When a new route is attempted, we first look up IP_MTU on the socket to
see what the OS has currently cached as the MTU for the route. Because
we are using IP_PMTUDISC_WANT we will only receive error events when
the OS cached IP_MTU is wrong, so we need to ensure we check it before
sending packets. We internally cache the MTU for 1 minute before checking
IP_MTU again.

Overview -------- This change allows the discovery and propagation of the path MTU. Until further testing in the wild is done, you must set `tun.path_mtu_discovery = true` to enable it. This is change meant for situations like on AWS where you have a default interface with MTU 9001 that only supports 1500 to internet destinations. It is not anticipated to be used to discover the MTU over the internet, only to the local gateway (although it should work in all cases) Set `tun.mtu` to the highest mtu supported minus 60. Nebula will check for DontFragment on outgoing packets and respond with ICMP "Destination Unreachable Fragmentation Needed" if the packet is too large, which allows TCP connections to the tun device to automatically set MSS (see RFC 1191). An example config that could be used: preferred_ranges: - 10.0.0.0/9 tun: path_mtu_discovery: true mtu: 8941 # (9001 - 60) This would advertise a high MTU on the tun device, and use path MTU discovery to reduce the MTU on any paths that end up over the internet. Design ------ The initial implementation is only for Linux. There are multiple ways to implement this, each with their own trade offs: 1) Use IP_PMTUDISC_WANT on the outside socket. - Pros: simple support for sending packets that want fragmentation. - Cons: Need cache invalidation. There could be a race between checking the current MTU and the OS discovering the actual MTU if other processes are connecting to the same remote. 2) Use IP_PMTUDISC_DO and manually fragment large packets. - Pros: Don't need cache invalidation. - Cons: Nebula meta packets would need fragmentation support, complicating the implementation. 3) Use IP_PMTUDISC_DO and use raw sockets to send large packets. - Pros: Don't have to deal with fragmentation logic in Nebula. - Cons: Two code paths for sending, and requires raw socket support. 4) Use IP_PMTUDISC_DO and use SO_REUSEPORT with a second socket to send large packets. - Pros: Same as (3) but you don't need raw sockets. - Cons: Using SO_REUSEPORT allows weird things to happen if you accidentally start two Nebula processes or something else listening on the same port. After reviewing these options, (1) was selected for the first implementation because it is fairly easy and the race conditions should be rare. Switching to (2) or (3) in the future might be a good idea if we review the trade offs more. Implementation -------------- `IP_RECVERR` is set on the outside UDP socket so that we can receive the FragmentationNeeded packets that come from upstream. We set up a goroutine to poll for these events and update our cache of the route MTU. When a new route is attempted, we first look up IP_MTU on the socket to see what the OS has currently cached as the MTU for the route. Because we are using IP_PMTUDISC_WANT we will only receive error events when the OS cached IP_MTU is wrong, so we need to ensure we check it before sending packets. We internally cache the MTU for 1 minute before checking IP_MTU again.

CLAassistant · 2021-02-17T21:00:56Z

All committers have signed the CLA.

gboddin · 2021-08-09T05:38:16Z

Hey,

Could this missing feature help protocols like vxlan ( that won't fragment ) ?

ATM running vxlan with a default mtu of 1500 breaks under nebula 1.4.0 with mashed-up packets received by the other end.

It might prevent most container solutions ( swarm, k8s ) from running their network over nebula as they require developers to specify custom network settings in their stack files.

nbrownus

Hey,

Could this missing feature help protocols like vxlan ( that won't fragment ) ?

ATM running vxlan with a default mtu of 1500 breaks under nebula 1.4.0 with mashed-up packets received by the other end.

It might prevent most container solutions ( swarm, k8s ) from running their network over nebula as they require developers to specify custom network settings in their stack files.

Yes and you should be able to accomplish this today by ensuring your VXLAN MTU is no larger that nebulas tun.mtu setting.

nbrownus · 2019-12-13T17:38:16Z

udp_linux.go

@@ -73,6 +81,12 @@ func NewListener(ip string, port int, multi bool) (*udpConn, error) {
 		return nil, fmt.Errorf("unable to set SO_REUSEPORT: %s", err)
 	}

+	if pathMTUDiscovery {
+		if err = syscall.SetsockoptInt(fd, syscall.IPPROTO_IP, syscall.IP_RECVERR, 1); err != nil {


We should switch these to unix. instead of syscall.

nbrownus · 2019-12-13T17:47:59Z

udp_linux.go

+
+	var addr syscall.SockaddrInet4
+	copy(addr.Addr[:], target.To4())
+	if err := syscall.Connect(*ss, &addr); err != nil {


unix. instead of syscall.

nbrownus · 2019-12-13T17:53:00Z

udp_linux.go

+								ipOnly := &udpAddr{IP: ipInt}
+								hosts := f.hostMap.QueryRemoteIP(ipOnly)
+								for _, host := range hosts {
+									host.SetRemoteMTU(ipOnly, int(mtu))


Shouldn't we just be setting the MTU on the ip we got the MTU for? If we roam to a new remote and that remote has a smaller MTU, we will punish the tunnel for a bit?

nbrownus · 2019-12-13T17:54:34Z

udp_linux.go

@@ -232,6 +404,15 @@ func (u *udpConn) WriteTo(b []byte, addr *udpAddr) error {
 		)

 		if err != 0 {
+			if u.pathMTUDiscovery && (err == syscall.EMSGSIZE || err == syscall.ECONNREFUSED) && tryCount < 10 {


Does this actually result in a failure to send?

gboddin · 2021-08-09T19:04:56Z

Hey,
Could this missing feature help protocols like vxlan ( that won't fragment ) ?
ATM running vxlan with a default mtu of 1500 breaks under nebula 1.4.0 with mashed-up packets received by the other end.
It might prevent most container solutions ( swarm, k8s ) from running their network over nebula as they require developers to specify custom network settings in their stack files.

Yes and you should be able to accomplish this today by ensuring your VXLAN MTU is no larger that nebulas tun.mtu setting.

Yeah I confirmed it working with an MTU lower than Nebula's interface.

Sadly docker swarm ( for instance ) doesn't make it easy to fix the issue : moby/moby#34981 , which makes re-using code for infra impossible :')

Thanks for the confirmation, I'll keep looking.

wadey · 2024-04-29T20:51:47Z

Closing because I think we want to rework this to just to do Path MTU discovery on the overlay

rawdigits added the enhancement New feature or request label Dec 10, 2019

nbrownus added this to the v1.1.0 milestone Dec 11, 2019

wadey force-pushed the pmutd branch from 9caa3cf to cf18e44 Compare December 12, 2019 04:30

wadey modified the milestones: v1.1.0, v1.2.0 Jan 21, 2020

nbrownus modified the milestones: v1.2.0, future Apr 6, 2020

wadey modified the milestones: v1.3.0, v1.4.0 Sep 15, 2020

wadey removed this from the v1.4.0 milestone Apr 12, 2021

nbrownus reviewed Aug 9, 2021

View reviewed changes

wadey closed this Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Path MTU Discovery #16

Support Path MTU Discovery #16

wadey commented Nov 21, 2019

CLAassistant commented Feb 17, 2021 •

edited

Loading

gboddin commented Aug 9, 2021 •

edited

Loading

nbrownus left a comment

nbrownus Dec 13, 2019

nbrownus Dec 13, 2019

nbrownus Dec 13, 2019

nbrownus Dec 13, 2019

gboddin commented Aug 9, 2021

wadey commented Apr 29, 2024

Support Path MTU Discovery #16

Support Path MTU Discovery #16

Conversation

wadey commented Nov 21, 2019

Overview

Design

Implementation

CLAassistant commented Feb 17, 2021 • edited Loading

gboddin commented Aug 9, 2021 • edited Loading

nbrownus left a comment

Choose a reason for hiding this comment

nbrownus Dec 13, 2019

Choose a reason for hiding this comment

nbrownus Dec 13, 2019

Choose a reason for hiding this comment

nbrownus Dec 13, 2019

Choose a reason for hiding this comment

nbrownus Dec 13, 2019

Choose a reason for hiding this comment

gboddin commented Aug 9, 2021

wadey commented Apr 29, 2024

CLAassistant commented Feb 17, 2021 •

edited

Loading

gboddin commented Aug 9, 2021 •

edited

Loading