Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Path MTU Discovery #16

Closed
wants to merge 1 commit into from
Closed

Support Path MTU Discovery #16

wants to merge 1 commit into from

Conversation

wadey
Copy link
Member

@wadey wadey commented Nov 21, 2019

Overview

This change allows the discovery and propagation of the path MTU. Until further
testing in the wild is done, you must set tun.path_mtu_discovery = true to
enable it.

This is change meant for situations like on AWS where you have a default
interface with MTU 9001 that only supports 1500 to internet destinations. It is
not anticipated to be used to discover the MTU over the internet, only to
the local gateway (although it should work in all cases)

Set tun.mtu to the highest mtu supported minus 60. Nebula will check for
DontFragment on outgoing packets and respond with ICMP "Destination Unreachable
Fragmentation Needed" if the packet is too large, which allows TCP
connections to the tun device to automatically set MSS (see RFC 1191).

An example config that could be used:

preferred_ranges:
  - 10.0.0.0/9
tun:
  path_mtu_discovery: true
  mtu: 8941 # (9001 - 60)

This would advertise a high MTU on the tun device, and use path MTU
discovery to reduce the MTU on any paths that end up over the internet.

Design

The initial implementation is only for Linux.

There are multiple ways to implement this, each with their own trade
offs:

  1. Use IP_PMTUDISC_WANT on the outside socket.
  • Pros: simple support for sending packets that want fragmentation.
  • Cons: Need cache invalidation. There could be a race between checking
    the current MTU and the OS discovering the actual MTU if other
    processes are connecting to the same remote.
  1. Use IP_PMTUDISC_DO and manualy fragment large packets.
  • Pros: Don't need cache invalidation.
  • Cons: Nebula meta packets would need fragmentation support,
    complicating the implementation.
  1. Use IP_PMTUDISC_DO and use raw sockets to send large packets.
  • Pros: Don't have to deal with fragmentation logic in Nebula.
  • Cons: Two code paths for sending, and requires raw socket support.
  1. Use IP_PMTUDISC_DO and use SO_REUSEPORT with a second socket to send
    large packets.
  • Pros: Same as (3) but you don't need raw sockets.
  • Cons: Using SO_REUSEPORT allows weird things to happen if you
    accidentally start two Nebula processes or something else listening on
    the same port.

After reviewing these options, (1) was selected for the first
implementation because it is fairly easy and the race conditions should
be rare. Switching to (2) or (3) in the future might be a good idea if
we review the trade offs more.

Implementation

IP_RECVERR is set on the outside UDP socket so that we can receive the
FragmentationNeeded packets that come from upstream. We set up
a goroutine to poll for these events and update our cache of the route MTU.

When a new route is attempted, we first look up IP_MTU on the socket to
see what the OS has currently cached as the MTU for the route. Because
we are using IP_PMTUDISC_WANT we will only receive error events when
the OS cached IP_MTU is wrong, so we need to ensure we check it before
sending packets. We internally cache the MTU for 1 minute before checking
IP_MTU again.

@rawdigits rawdigits added the enhancement New feature or request label Dec 10, 2019
@nbrownus nbrownus added this to the v1.1.0 milestone Dec 11, 2019
Overview
--------

This change allows the discovery and propagation of the path MTU. Until further
testing in the wild is done, you must set `tun.path_mtu_discovery = true` to
enable it.

This is change meant for situations like on AWS where you have a default
interface with MTU 9001 that only supports 1500 to internet destinations. It is
not anticipated to be used to discover the MTU over the internet, only to
the local gateway (although it should work in all cases)

Set `tun.mtu` to the highest mtu supported minus 60. Nebula will check for
DontFragment on outgoing packets and respond with ICMP "Destination Unreachable
Fragmentation Needed" if the packet is too large, which allows TCP
connections to the tun device to automatically set MSS (see RFC 1191).

An example config that could be used:

    preferred_ranges:
      - 10.0.0.0/9
    tun:
      path_mtu_discovery: true
      mtu: 8941 # (9001 - 60)

This would advertise a high MTU on the tun device, and use path MTU
discovery to reduce the MTU on any paths that end up over the internet.

Design
------

The initial implementation is only for Linux.

There are multiple ways to implement this, each with their own trade
offs:

1) Use IP_PMTUDISC_WANT on the outside socket.

- Pros: simple support for sending packets that want fragmentation.
- Cons: Need cache invalidation. There could be a race between checking
  the current MTU and the OS discovering the actual MTU if other
  processes are connecting to the same remote.

2) Use IP_PMTUDISC_DO and manually fragment large packets.

- Pros: Don't need cache invalidation.
- Cons: Nebula meta packets would need fragmentation support,
  complicating the implementation.

3) Use IP_PMTUDISC_DO and use raw sockets to send large packets.

- Pros: Don't have to deal with fragmentation logic in Nebula.
- Cons: Two code paths for sending, and requires raw socket support.

4) Use IP_PMTUDISC_DO and use SO_REUSEPORT with a second socket to send
   large packets.

- Pros: Same as (3) but you don't need raw sockets.
- Cons: Using SO_REUSEPORT allows weird things to happen if you
  accidentally start two Nebula processes or something else listening on
  the same port.

After reviewing these options, (1) was selected for the first
implementation because it is fairly easy and the race conditions should
be rare. Switching to (2) or (3) in the future might be a good idea if
we review the trade offs more.

Implementation
--------------

`IP_RECVERR` is set on the outside UDP socket so that we can receive the
FragmentationNeeded packets that come from upstream. We set up
a goroutine to poll for these events and update our cache of the route MTU.

When a new route is attempted, we first look up IP_MTU on the socket to
see what the OS has currently cached as the MTU for the route. Because
we are using IP_PMTUDISC_WANT we will only receive error events when
the OS cached IP_MTU is wrong, so we need to ensure we check it before
sending packets. We internally cache the MTU for 1 minute before checking
IP_MTU again.
@wadey wadey modified the milestones: v1.1.0, v1.2.0 Jan 21, 2020
@nbrownus nbrownus modified the milestones: v1.2.0, future Apr 6, 2020
@wadey wadey modified the milestones: v1.3.0, v1.4.0 Sep 15, 2020
@CLAassistant
Copy link

CLAassistant commented Feb 17, 2021

CLA assistant check
All committers have signed the CLA.

@wadey wadey removed this from the v1.4.0 milestone Apr 12, 2021
@gboddin
Copy link

gboddin commented Aug 9, 2021

Hey,

Could this missing feature help protocols like vxlan ( that won't fragment ) ?

ATM running vxlan with a default mtu of 1500 breaks under nebula 1.4.0 with mashed-up packets received by the other end.

It might prevent most container solutions ( swarm, k8s ) from running their network over nebula as they require developers to specify custom network settings in their stack files.

Copy link
Collaborator

@nbrownus nbrownus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,

Could this missing feature help protocols like vxlan ( that won't fragment ) ?

ATM running vxlan with a default mtu of 1500 breaks under nebula 1.4.0 with mashed-up packets received by the other end.

It might prevent most container solutions ( swarm, k8s ) from running their network over nebula as they require developers to specify custom network settings in their stack files.

Yes and you should be able to accomplish this today by ensuring your VXLAN MTU is no larger that nebulas tun.mtu setting.

@@ -73,6 +81,12 @@ func NewListener(ip string, port int, multi bool) (*udpConn, error) {
return nil, fmt.Errorf("unable to set SO_REUSEPORT: %s", err)
}

if pathMTUDiscovery {
if err = syscall.SetsockoptInt(fd, syscall.IPPROTO_IP, syscall.IP_RECVERR, 1); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should switch these to unix. instead of syscall.


var addr syscall.SockaddrInet4
copy(addr.Addr[:], target.To4())
if err := syscall.Connect(*ss, &addr); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unix. instead of syscall.

ipOnly := &udpAddr{IP: ipInt}
hosts := f.hostMap.QueryRemoteIP(ipOnly)
for _, host := range hosts {
host.SetRemoteMTU(ipOnly, int(mtu))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just be setting the MTU on the ip we got the MTU for? If we roam to a new remote and that remote has a smaller MTU, we will punish the tunnel for a bit?

@@ -232,6 +404,15 @@ func (u *udpConn) WriteTo(b []byte, addr *udpAddr) error {
)

if err != 0 {
if u.pathMTUDiscovery && (err == syscall.EMSGSIZE || err == syscall.ECONNREFUSED) && tryCount < 10 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually result in a failure to send?

@gboddin
Copy link

gboddin commented Aug 9, 2021

Hey,
Could this missing feature help protocols like vxlan ( that won't fragment ) ?
ATM running vxlan with a default mtu of 1500 breaks under nebula 1.4.0 with mashed-up packets received by the other end.
It might prevent most container solutions ( swarm, k8s ) from running their network over nebula as they require developers to specify custom network settings in their stack files.

Yes and you should be able to accomplish this today by ensuring your VXLAN MTU is no larger that nebulas tun.mtu setting.

Yeah I confirmed it working with an MTU lower than Nebula's interface.

Sadly docker swarm ( for instance ) doesn't make it easy to fix the issue : moby/moby#34981 , which makes re-using code for infra impossible :')

Thanks for the confirmation, I'll keep looking.

@wadey
Copy link
Member Author

wadey commented Apr 29, 2024

Closing because I think we want to rework this to just to do Path MTU discovery on the overlay

@wadey wadey closed this Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants