-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Path MTU Discovery #16
Conversation
Overview -------- This change allows the discovery and propagation of the path MTU. Until further testing in the wild is done, you must set `tun.path_mtu_discovery = true` to enable it. This is change meant for situations like on AWS where you have a default interface with MTU 9001 that only supports 1500 to internet destinations. It is not anticipated to be used to discover the MTU over the internet, only to the local gateway (although it should work in all cases) Set `tun.mtu` to the highest mtu supported minus 60. Nebula will check for DontFragment on outgoing packets and respond with ICMP "Destination Unreachable Fragmentation Needed" if the packet is too large, which allows TCP connections to the tun device to automatically set MSS (see RFC 1191). An example config that could be used: preferred_ranges: - 10.0.0.0/9 tun: path_mtu_discovery: true mtu: 8941 # (9001 - 60) This would advertise a high MTU on the tun device, and use path MTU discovery to reduce the MTU on any paths that end up over the internet. Design ------ The initial implementation is only for Linux. There are multiple ways to implement this, each with their own trade offs: 1) Use IP_PMTUDISC_WANT on the outside socket. - Pros: simple support for sending packets that want fragmentation. - Cons: Need cache invalidation. There could be a race between checking the current MTU and the OS discovering the actual MTU if other processes are connecting to the same remote. 2) Use IP_PMTUDISC_DO and manually fragment large packets. - Pros: Don't need cache invalidation. - Cons: Nebula meta packets would need fragmentation support, complicating the implementation. 3) Use IP_PMTUDISC_DO and use raw sockets to send large packets. - Pros: Don't have to deal with fragmentation logic in Nebula. - Cons: Two code paths for sending, and requires raw socket support. 4) Use IP_PMTUDISC_DO and use SO_REUSEPORT with a second socket to send large packets. - Pros: Same as (3) but you don't need raw sockets. - Cons: Using SO_REUSEPORT allows weird things to happen if you accidentally start two Nebula processes or something else listening on the same port. After reviewing these options, (1) was selected for the first implementation because it is fairly easy and the race conditions should be rare. Switching to (2) or (3) in the future might be a good idea if we review the trade offs more. Implementation -------------- `IP_RECVERR` is set on the outside UDP socket so that we can receive the FragmentationNeeded packets that come from upstream. We set up a goroutine to poll for these events and update our cache of the route MTU. When a new route is attempted, we first look up IP_MTU on the socket to see what the OS has currently cached as the MTU for the route. Because we are using IP_PMTUDISC_WANT we will only receive error events when the OS cached IP_MTU is wrong, so we need to ensure we check it before sending packets. We internally cache the MTU for 1 minute before checking IP_MTU again.
Hey, Could this missing feature help protocols like vxlan ( that won't fragment ) ? ATM running vxlan with a default mtu of 1500 breaks under nebula 1.4.0 with mashed-up packets received by the other end. It might prevent most container solutions ( swarm, k8s ) from running their network over nebula as they require developers to specify custom network settings in their stack files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey,
Could this missing feature help protocols like vxlan ( that won't fragment ) ?
ATM running vxlan with a default mtu of 1500 breaks under nebula 1.4.0 with mashed-up packets received by the other end.
It might prevent most container solutions ( swarm, k8s ) from running their network over nebula as they require developers to specify custom network settings in their stack files.
Yes and you should be able to accomplish this today by ensuring your VXLAN MTU is no larger that nebulas tun.mtu
setting.
@@ -73,6 +81,12 @@ func NewListener(ip string, port int, multi bool) (*udpConn, error) { | |||
return nil, fmt.Errorf("unable to set SO_REUSEPORT: %s", err) | |||
} | |||
|
|||
if pathMTUDiscovery { | |||
if err = syscall.SetsockoptInt(fd, syscall.IPPROTO_IP, syscall.IP_RECVERR, 1); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should switch these to unix.
instead of syscall.
|
||
var addr syscall.SockaddrInet4 | ||
copy(addr.Addr[:], target.To4()) | ||
if err := syscall.Connect(*ss, &addr); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unix.
instead of syscall.
ipOnly := &udpAddr{IP: ipInt} | ||
hosts := f.hostMap.QueryRemoteIP(ipOnly) | ||
for _, host := range hosts { | ||
host.SetRemoteMTU(ipOnly, int(mtu)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just be setting the MTU on the ip we got the MTU for? If we roam to a new remote and that remote has a smaller MTU, we will punish the tunnel for a bit?
@@ -232,6 +404,15 @@ func (u *udpConn) WriteTo(b []byte, addr *udpAddr) error { | |||
) | |||
|
|||
if err != 0 { | |||
if u.pathMTUDiscovery && (err == syscall.EMSGSIZE || err == syscall.ECONNREFUSED) && tryCount < 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually result in a failure to send?
Yeah I confirmed it working with an MTU lower than Nebula's interface. Sadly docker swarm ( for instance ) doesn't make it easy to fix the issue : moby/moby#34981 , which makes re-using code for infra impossible :') Thanks for the confirmation, I'll keep looking. |
Closing because I think we want to rework this to just to do Path MTU discovery on the overlay |
Overview
This change allows the discovery and propagation of the path MTU. Until further
testing in the wild is done, you must set
tun.path_mtu_discovery = true
toenable it.
This is change meant for situations like on AWS where you have a default
interface with MTU 9001 that only supports 1500 to internet destinations. It is
not anticipated to be used to discover the MTU over the internet, only to
the local gateway (although it should work in all cases)
Set
tun.mtu
to the highest mtu supported minus 60. Nebula will check forDontFragment on outgoing packets and respond with ICMP "Destination Unreachable
Fragmentation Needed" if the packet is too large, which allows TCP
connections to the tun device to automatically set MSS (see RFC 1191).
An example config that could be used:
This would advertise a high MTU on the tun device, and use path MTU
discovery to reduce the MTU on any paths that end up over the internet.
Design
The initial implementation is only for Linux.
There are multiple ways to implement this, each with their own trade
offs:
the current MTU and the OS discovering the actual MTU if other
processes are connecting to the same remote.
complicating the implementation.
large packets.
accidentally start two Nebula processes or something else listening on
the same port.
After reviewing these options, (1) was selected for the first
implementation because it is fairly easy and the race conditions should
be rare. Switching to (2) or (3) in the future might be a good idea if
we review the trade offs more.
Implementation
IP_RECVERR
is set on the outside UDP socket so that we can receive theFragmentationNeeded packets that come from upstream. We set up
a goroutine to poll for these events and update our cache of the route MTU.
When a new route is attempted, we first look up IP_MTU on the socket to
see what the OS has currently cached as the MTU for the route. Because
we are using IP_PMTUDISC_WANT we will only receive error events when
the OS cached IP_MTU is wrong, so we need to ensure we check it before
sending packets. We internally cache the MTU for 1 minute before checking
IP_MTU again.