Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove NAT logic #115

Merged
merged 5 commits into from Jan 31, 2020
Merged

Remove NAT logic #115

merged 5 commits into from Jan 31, 2020

Conversation

@kaechele
Copy link
Contributor

kaechele commented Jan 2, 2020

Turns out, the NAT crutch to run all tunnels through one external port wasn't needed after all. The kernel handles this just fine.

Depends on #110

@kaechele kaechele mentioned this pull request Jan 2, 2020
@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jan 2, 2020

Oh wow, this is amazing! Somehow it feels too good to be true. :D

Do you have any hypothesis why this NAT logic was added in the first place? Cc @mitar @kostko

One hypothesis: maybe whatever issues prompted the introduction of the NAT logic were fixed by #64?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 2, 2020

One hypothesis: maybe whatever issues prompted the introduction of the NAT logic were fixed by #64?

I believe this to be the case.

BTW. I've been running this version of the code (actually the IPv6 version of it) at Freifunk Leverkusen on all Gateways ever since I finished working on it. So it's definitely working.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jan 2, 2020

BTW. I've been running this version of the code (actually the IPv6 version of it) at Freifunk Leverkusen on all Gateways ever since I finished working on it. So it's definitely working.

Field-testing, great. :) Which kernel versions have you been using there?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 2, 2020

Which kernel versions have you been using there?

Currently all gateways run 5.3.16-300.fc31.x86_64
When I started off using this code we were running a 5.2 series Kernel.

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Jan 2, 2020

So for us it was important that both data and control packages work over the same port (UDP port 52, DNS) so that simple routers and others do not filter it out. Sadly, at least at that time, each tunnel required a different port.

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 2, 2020

So for us it was important that both data and control packages work over the same port (UDP port 52, DNS) so that simple routers and others do not filter it out. Sadly, at least at that time, each tunnel required a different port.

In case anyone is interested: The way this works in the kernel these days is that once you link a socket to an L2TPv3 tunnel the kernel will snatch all packets it deems as belonging to an active tunnel session and process it directly. It distinguishes sessions by the session ID in the L2TPv3 data packet header (the tunnel ID has no significance here). It then drops the deencapsulated data packets onto the according interfaces for each tunnel (aka. session). If the header of the packet does not match a L2TPv3 data packet or the data packet is invalid in the context of the session the packet will drop through to userspace where the python broker can then pull it out of the socket and deal with it. That's why the broker can still see the control channel packets even though the kernel deals with the L2TPv3 data packets by itself at a lower level without the broker's involvement. In fact, the broker won't even see any valid L2TPv3 data packets as the kernel will already have pulled them out of the socket before the python broker could get to it.

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Jan 2, 2020

Yes, in the past they used ports to discriminate tunnels, not the session ID. Great that this was fixed. This then allows also IPv6 to work!

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Jan 2, 2020

BTW, with these major changes, I suggest you do major version bump when releasing the next version.

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 2, 2020

This then allows also IPv6 to work!

Yes, see #116 :)

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Jan 2, 2020

Yes, I saw it! :-)

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jan 7, 2020

The way this works in the kernel these days is that once you link a socket to an L2TPv3 tunnel the kernel will snatch all packets it deems as belonging to an active tunnel session and process it directly. It distinguishes sessions by the session ID in the L2TPv3 data packet header (the tunnel ID has no significance here). It then drops the deencapsulated data packets onto the according interfaces for each tunnel (aka. session). If the header of the packet does not match a L2TPv3 data packet or the data packet is invalid in the context of the session the packet will drop through to userspace where the python broker can then pull it out of the socket and deal with it.

My concern here is: isn't it possible that some package of the tunneldigger protocol "looks like" an l2tp package, and ends up being interpreted by the kernel instead of forwarded to userspace?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 7, 2020

My concern here is: isn't it possible that some package of the tunneldigger protocol "looks like" an l2tp package, and ends up being interpreted by the kernel instead of forwarded to userspace?

This would be highly unlikely. Also, it would be a problem even with today's implementation since that piece didn't change.

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Jan 7, 2020

I mean, we can assure this is not possible, not just that it is unlikely. So we should just make sure that payload L2TP header does not look like control packet header.

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Jan 7, 2020

If I remember correctly, L2TP even has a proposed structure for the control packets. Maybe we are already using it.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jan 7, 2020

it would be a problem even with today's implementation since that piece didn't change.

How that? Don't we use NAT today to avoid ever actually having control data on a tunnel port?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 7, 2020

How that? Don't we use NAT today to avoid ever actually having control data on a tunnel port?

Nope. All the NAT stuff does is to map traffic coming from a certain IP to the broker port onto another port internally. Control and Payload traffic arrive in the same socket, just as they do with this change.

The only real change here is that we reuse the port and fully rely on the kernel to handle the separation of packets based on the socket 4-tuple (src host, src port, dest host, dest port). Basically what iptables has been doing for us before.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jan 7, 2020

Okay then, makes sense.

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Jan 7, 2020

fully rely on the kernel to handle the separation of packets based on the socket 4-tuple (src host, src port, dest host, dest port)

It is separating based on 4-tuple or based on payload? I understood it is looking into headers to determine data packets?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 8, 2020

It is separating based on 4-tuple or based on payload? I understood it is looking into headers to determine data packets?

Sorry, actually I am mistaken. I did some more reading.

When using SO_REUSEPORT the traffic is evenly distributed between all listening sockets by the Kernel. For us this doesn't matter since only the Kernel L2TP driver and the broker will be listening there and thus any packet arriving on any socket would drop into the same code path anyway.
On an L2TP level separation is indeed done using the session ID information from the L2TP header.

Using SO_REUSEPORT ensures that only the user that bound the first socket gets to bind any subsequent shared port sockets.

@kaechele kaechele force-pushed the kaechele:no-nat branch from e65e730 to a208217 Jan 8, 2020
@mitar

This comment has been minimized.

Copy link
Member

mitar commented Jan 8, 2020

On an L2TP level separation is indeed done using the session ID information from the L2TP header.

So are control packets different enough from L2TP data packets so that it cannot ever be confused which packet should go where?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 8, 2020

So are control packets different enough from L2TP data packets so that it cannot ever be confused which packet should go where?

Yes.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jan 8, 2020

@kaechele So part of this PR is to use SO_REUSEPORT, I see. But I don't understand what you said about how this helps the kernel tell apart the packages.

Also, which constraints does this put on our control message protocol, if any? I extended that protocol some time ago and I definitely did not try to avoid colliding with L2TP.^^

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 8, 2020

So part of this PR is to use SO_REUSEPORT, I see. But I don't understand what you said about how this helps the kernel tell apart the packages.

It doesn't. SO_REUSEPORT allows us to re-use a port for multiple sockets. The separation is done in the L2TP Kernel code. That code has been there since at least 2010, so technically it would have worked right from the beginning.

Also, which constraints does this put on our control message protocol, if any? I extended that protocol some time ago and I definitely did not try to avoid colliding with L2TP.^^

None. The first bit of the header magic of the packets the TD client constructs is set to 1 (either on purpose or by coincidence). This indicates a control packet (as opposed to a data packet). The Kernel doesn't even look beyond this and passes all control packets to userspace (i.e. in our case the broker). After that we're free to do what we want in the broker with that packet.
So as long as we don't change that header we should be good.

I did notice that client and broker however do not produce valid control packets by L2TP standards. But that doesn't matter since we deal with these packets only in our own code.

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Jan 8, 2020

The first bit of the header magic of the packets the TD client constructs is set to 1 (either on purpose or by coincidence).

Yes, that was on purpose.

I did notice that client and broker however do not produce valid control packets by L2TP standards. But that doesn't matter since we deal with these packets only in our own code.

That was also on purpose. We didn't want to get bugged down with a standard we do not really care about (also we were unsure how it maps to mesh networks and so on).

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jan 13, 2020

I thought those "l2tp control packets" that the RFC spoke about were something that the kernel was using to coordinate with the other side? Are you saying now that actually those are expected to be handled by the userspace anyway even in a non-tunneldigger environment? You make it sound like the kernel is only implementing the "data packets" part of l2tp. What would be an example implementation of the "actual" l2tp control protocol?

Yes, that was on purpose.

Oh... given the total lack of comments for these magic numbers, I assumed they were basically picked at random. If there was any thought that went into picking that particular sequence of bytes (\x80\x73\xA7\x01), that would be good to get documented. I know the last byte is the version number (that's indicated by some variable names somewhere), but the other three are not ever named anything but "magic".

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 13, 2020

You make it sound like the kernel is only implementing the "data packets" part of l2tp.

Yes. That is because this is the case.
From the Kernel's Documentation/networking/l2tp.txt:

Design

The L2TP protocol separates control and data frames. The L2TP kernel
drivers handle only L2TP data frames; control frames are always
handled by userspace. L2TP control frames carry messages between L2TP
clients/servers and are used to setup / teardown tunnels and
sessions. An L2TP client or server is implemented in userspace.

What would be an example implementation of the "actual" l2tp control protocol?

One would be https://prol2tp.com/ (used to be open source as OpenL2TP). This is from the company that created the Kernel L2TP driver. I know of no other open-source L2TPv3 specific broker.

If there was any thought that went into picking that particular sequence of bytes (\x80\x73\xA7\x01), that would be good to get documented.

See
https://tools.ietf.org/html/rfc3931#section-4.1.2.1

Let's discuss https://github.com/torvalds/linux/blob/master/net/l2tp/l2tp_core.c#L792

The Kernel in this case looks at the first 16 bits (L#828).
Tunneldigger sends 0x8073 as it's first 16 bits.
The Kernel checks the version of the L2TP packet (L#832):

0x8073 & 0x000F = 310

OK cool. This is an L2TPv3 packet, as we expected.
Then the Kernel goes to check whether the packet is a control packet (L#843):

0x8073 & 0x8000 = 0x8000

thus is not null and the code within the if statement is executed. It goto error; which puts the UDP header back and dumps the packet to userspace.

What exactly the significance of any of the other bits in the magic header is in full detail I don't know. That would be a question for @kostko or @mitar .

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jan 13, 2020

@kaechele all right, I think I am convinced. Thanks for your patience. :)

Could you submit a PR putting some of these findings into comments at a suitable place in the code? (If you don't have time, I'll try to find some.)

@kaechele kaechele force-pushed the kaechele:no-nat branch from a208217 to 2edb931 Jan 13, 2020
@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Jan 31, 2020

I agree that we should revert until we figure out why this is going all wrong on Debian.

I will try to find some time this weekend to test it on Debian.

I thought I had had developed a good understanding of why it ended up working with this approach but maybe that was just wrong.
What surprises me is that I am having zero issues with this on Fedora.

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 1, 2020

So just re-tested this on Fedora 31 with all debugging enabled.
145 tunnels later and I can say that there are no problems at all on Fedora.

I have just set up a Debian 9 VM (my first Debian install, btw :P) and will conduct some testing tomorrow.

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 1, 2020

OK, curiosity got the better of me. Had to do a quick preliminary test.

Your issue is reproducible on:
Debian 9 / 4.19.0-0.bpo.6-amd64
Debian 10 / 4.19.0-6-amd64

It is not reproducible on Debian 10 / 5.4.0-0.bpo.2-amd64

Looks like there's something in newer kernels to make this work correctly. That's why I'm not seeing this in Fedora, as Fedora usually ships the latest kernels.

Now on to put the finger on exactly when it starts working.

@RalfJung RalfJung mentioned this pull request Feb 1, 2020
@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 1, 2020

Your issue is reproducible on:

Good to hear that it's not just our setup. :)

I'll see if I can try this on Debian 10 (Buster) with the backbport kernel.

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 3, 2020

Okay. So this is a bug in all kernels newer than 4.4.x and older than 5.2.17. This commit fixes it in 5.2.17:

commit 82369aa61ec7f3a006f8e687dd0b80c5782f8a83
Author: Willem de Bruijn <willemb@google.com>
Date:   Thu Sep 12 21:16:39 2019 -0400

    udp: correct reuseport selection with connected sockets
    
    [ Upstream commit acdcecc61285faed359f1a3568c32089cc3a8329 ]
    
    UDP reuseport groups can hold a mix unconnected and connected sockets.
    Ensure that connections only receive all traffic to their 4-tuple.
    
    Fast reuseport returns on the first reuseport match on the assumption
    that all matches are equal. Only if connections are present, return to
    the previous behavior of scoring all sockets.
    
    Record if connections are present and if so (1) treat such connected
    sockets as an independent match from the group, (2) only return
    2-tuple matches from reuseport and (3) do not return on the first
    2-tuple reuseport match to allow for a higher scoring match later.
    
    New field has_conns is set without locks. No other fields in the
    bitmap are modified at runtime and the field is only ever set
    unconditionally, so an RMW cannot miss a change.
    
    Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
    Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.com
    Signed-off-by: Willem de Bruijn <willemb@google.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Acked-by: Craig Gallek <kraig@google.com>
    Signed-off-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

We need to think about whether we can bump our minimum required Kernel version that high, since all LTS Kernels (and thus the Kernel of all LTS distros and Debian) will have this bug.
I'm fine with it as I don't use any LTS Kernels or Debian, but I believe Debian or Ubuntu LTS is the most popular choice among admins here.

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 3, 2020

Lucky us. It was backported to 4.19.75 as fdd60d80c4294b7203d6f9d075a57da0a8d85fba.
But it's not in 4.14 or any earlier LTS kernel.

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Feb 3, 2020

Do we need reuseport?

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 3, 2020

@kaechele good catch! My attempt at upgrading one more of our gateways to Debian 10 so I could test the later backports kernels ended with that gateway not responding any more... we'll have to sort that out before I can make any further experiments.

However, I don't think I am entirely following. Given that UDP is stateless, what is the "connection" the commit message is talking about? I think the previous ("buggy") behavior of the kernel is perfectly in line with the documented behavior of SO_REUSEPORT. Which extra guarantees are provided by this patch?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 3, 2020

However, I don't think I am entirely following. Given that UDP is stateless, what is the "connection" the commit message is talking about? I think the previous ("buggy") behavior of the kernel is perfectly in line with the documented behavior of SO_REUSEPORT. Which extra guarantees are provided by this patch?

From the connect() manpage:

If the socket sockfd is of type SOCK_DGRAM, then addr is the address
to which datagrams are sent by default, and the only address from
which datagrams are received.

So we are basically setting the default sendto address and also specifying from which source address we will be accepting packages.
This patch ensures the Kernel actually uses this information to assign incoming packets to the according sockets rather than assigning them randomly to the sockets listening on the same port.

You are right, the documented behaviour is incomplete as it does not contain this information about connect()ing sockets and it's implications.

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 3, 2020

Do we need reuseport?

Yes, if we want to keep the old behaviour of all data flowing into one single port on the remote side.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 3, 2020

Yes, if we want to keep the old behaviour of all data flowing into one single port on the remote side.

I am not convinced of this. I think we can alternatively have a single socket (per port) on the broker side and dispatch messages to the Tunnel instances based on the remote IP+port.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 3, 2020

You are right, the documented behaviour is incomplete as it does not contain this information about connect()ing sockets and it's implications.

So let me see if I got this straight...
UDP is stateless, but there are still 2 kinds of UDP sockets: those opened with bind and a 2-tuple (local IP+port), and those opened with connect and a 4-tuple (local+remote IP+port). In tunneldigger, the "main" per-port broker socket is of the former kind, and the per-connection Tunnel socket is of the latter kind.
The bug here is that with SO_REUSEPORT, packets that come in can end up in any of the sockets of the reuseport group, even if that socket is of the 4-tuple kind and ought to just get packets from the particular peer it connected to. So, with the fix, we again have the guarantee that all packets coming in at a 4-tuple-UDP-socket actually match that 4-tuple.

Is that right? I think it matches both the behavior I observed and your statements.

And it does sound like it is almost what we need, except for one part: do we also have a guarantee that all packets that come in to a reuseport group with the remote IP+port matching a 4-tuple socket in that group, are dispatched to that socket? Or is it legal for the kernel to dispatch such packets to a 2-tuple-UDP-socket? Doing so would not violate the guarantee I stated above, but it would mean we couldn't use this for tunneldigger.

Given that your patch seems to work fine in recent kernels, I assume the answer is that current implementations will also guarantee that if a matching 4-tuple-UDP-socket exists, packets will not be dispatched to a 2-tuple-UDP-socket. Are we sure we can rely on this? If yes, that sounds quite elegant indeed (I didn't realize when reading the patch originally, and think we should have comments explaining this). Except, of course, that many kernels people will actually use for Freifunk fail to properly implement this...

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 3, 2020

Is that right? I think it matches both the behavior I observed and your statements.

Yes, that is right.

And it does sound like it is almost what we need, except for one part: do we also have a guarantee that all packets that come in to a reuseport group with the remote IP+port matching a 4-tuple socket in that group, are dispatched to that socket? Or is it legal for the kernel to dispatch such packets to a 2-tuple-UDP-socket? Doing so would not violate the guarantee I stated above, but it would mean we couldn't use this for tunneldigger.

The Kernel scores the sockets based on it's specificity (e.g. a 4-tuple connected socket that matches has a higher score than the 2-tuple bound one). So in any case, if there is a valid 4-tuple connected socket, it would return this socket over any 2-tuple ones.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 3, 2020

The Kernel scores the sockets based on it's specificity (e.g. a 4-tuple connected socket that matches has a higher score than the 2-tuple bound one). So in any case, if there is a valid 4-tuple connected socket, it would return this socket over any 2-tuple ones.

Okay. So assuming this is indeed guaranteed to remain the case (and it seems quite sensible, the usefulness being demonstrated by you relying on it), we basically have two options:

  • Rely on this behavior, and declare kernels with the bug unsupported. I understand that 5.2.17 is fixed and so is 4.19.75. Which mainline kernel is the first to have the fix, 5.3? Also, are all other kernels affected, so would we be effectively raising the minimal kernel version to 4.19.75? If the latter is the case, I think we should wait until at least Debian 10 and Ubuntu 18.04 have a fixed kernel. (Currently, Debian 10 is at 4.19.67, not sure what Ubuntu 18.04 is at.) Even then, I am worried about moving too fast.
  • Reengineer the broker to use a single socket per port, instead of opening an extra socket for each active tunnel -- and do the dispatching of messages to the tunnel instances ourselves. I don't think this is hard, but I could understand if you wouldn't be very motivated to implement this, and it might not mesh well with the broker's mixin structure.
@mitar

This comment has been minimized.

Copy link
Member

mitar commented Feb 3, 2020

For the first case, is there a way to test this on daemon startup time, so that we can warn people with the message (instead of relaying on the kernel version or something)?

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 4, 2020

I don't see a good way to test this at startup time... but we could certainly add a message here to inform people that this is likely caused by the kernel being buggy.

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 4, 2020

Yes, I agree. This is the same spot I was thinking of putting a notice. Maybe we can create a GitHub Issue that explains the situation and link it in that message. That way we can create maximum clarity on what the situation is and how it can be mitigated.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 4, 2020

Maybe we can create a GitHub Issue that explains the situation and link it in that message. That way we can create maximum clarity on what the situation is and how it can be mitigated.

That makes sense.

However, I am still on the fence if it wouldn't be better to avoid SO_REUSEPORT. @kaechele do you agree my plan for this would be feasible (independent of whether you think it is sensible)?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 4, 2020

However, I am still on the fence if it wouldn't be better to avoid SO_REUSEPORT. @kaechele do you agree my plan for this would be feasible (independent of whether you think it is sensible)?

My take is that it is possible, technically, but only if we sacrifice a central design decision of Tunneldigger.

And let me just preface this with saying that I am not a Linux networking expert. So I may be missing an option that would achieve all we want without sacrificing anything, that I just don't know about.

Let me elaborate:
It is a design decision of Tunneldigger to run all traffic exclusively through one port. The reasoning behind this is that circumventing firewalls that only allow, for example, port 53 UDP is possible this way.
I work under the assumption that we cannot touch this design decision.

If we just ignore that for a second and accept that

  1. without SO_REUSEPORT we cannot bind more than one socket to the same local port
  2. without being able to do the above we would need NAT in the Kernel to de-multiplex incoming requests on that port to the respective tunnel sockets running on unique ports of their own locally

we are left with an option where we would both avoid NAT and also avoid SO_REUSEPORT:
The client would be assigned a unique destination port it will have to use to set up the tunnel.
To do that we would need to implement a way for the client to receive that information and act accordingly when setting up the tunnel.
This will lock out clients that are unable to deal with this (i.e. older clients) with no option to mitigate this with a compatibility layer that doesn't introduce either NAT or SO_REUSEPORT.

I initially removed NAT to make it less complex to implement IPv6 support. ip6tables, I am sure, is also able to do NAT the same way we do it on v4 for now. But I personally have no interest in carrying the NAT hack forth if there is a (in my opinion) better way to do it.

If we are trying to avoid SO_REUSEPORT for the mere benefit of not breaking Tunneldigger on Distros that are using buggy kernels (for which there is a fix in every LTS stream that is supported today) I have no gripes with waiting for these distros (from my testing it's only Debian 10 and RHEL 8.1) to catch up.

For your reference:
Testing and working (assuming latest updates, which is the case after netinstall):

  • Fedora 30
  • Fedora 31
  • Debian 10 with 5.3 backports Kernel
  • Ubuntu 16.04 LTS
  • Ubuntu 18.04 LTS (with linux-image-unsigned-5.3.0-28-generic)

Tested and broken:

  • CentOS/RHEL 8.1 (incl. with latest updates)
  • Debian 10 (incl. with latest updates)
  • Ubuntu 18.04 LTS (incl. with latest updates)

Assumed working:

  • Arch Linux (usually has a very recent kernel)
  • anything newer than Ubuntu 19.10 (comes with Kernel 5.3)

Assumed broken:

  • Ubuntu 19.04 (unless newer kernel image is used)
@mitar

This comment has been minimized.

Copy link
Member

mitar commented Feb 4, 2020

I think @RalfJung is asking if we could use bind instead of connect once (so only one socket), and Tunneldigger would then receive all control packets for all tunnels and then forward them internally to tunnels. Instead of us having each tunnel opening a new socket.

I think using only one port is the main advantage of Tunneldigger.

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 4, 2020

I think @RalfJung is asking if we could use bind instead of connect once (so only one socket), and Tunneldigger would then receive all control packets for all tunnels and then forward them internally to tunnels. Instead of us having each tunnel opening a new socket.

Ahh. Now I got it 😅. Sorry for my ignorance.

No, this wouldn't be possible. The kernel implementation of L2TPv3 over UDP is limited in that it requires each tunnel to have a unique socket. It doesn't care if this socket uses a re-used port but it needs to be an individual socket, unfortunately.

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Feb 4, 2020

OK, then I think we should go with clean codebase and support newer kernels. I think it is not too hard to control kernels on brokers, no?

We could also mention that if you need older kernels, you could use an older version of Tunneldigger with NAT. The question is: could we make it work so that you can have some brokers with NAT and some without and they all work together?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 4, 2020

OK, then I think we should go with clean codebase and support newer kernels. I think it is not too hard to control kernels on brokers, no?

I agree.

We could also mention that if you need older kernels, you could use an older version of Tunneldigger with NAT.

And, at this point, you wouldn't even lose any functionality. You'll just not get any new stuff such as IPv6 support.
I did my Python 3 changes before removing NAt (and thus even fixing the whole netfilter code) for that reason. That way you have a version of the broker that runs on a supported version of Python with the old behaviour.

The question is: could we make it work so that you can have some brokers with NAT and some without and they all work together?

The removal of NAT is completely transparent to the client. The client doesn't care if it is talking to a broker that uses NAT internally vs. one that doesn't.
So mixing old and new brokers just works.

Likewise running the old code on newer kernels works just fine. Even running old and new code running alongside each other serving different ports works fine.
The only thing that doesn't work is running the new code on a broken kernel.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 4, 2020

I certainly have no intent of keeping the NAT hack or changing the client-observable behavior.

No, this wouldn't be possible. The kernel implementation of L2TPv3 over UDP is limited in that it requires each tunnel to have a unique socket. It doesn't care if this socket uses a re-used port but it needs to be an individual socket, unfortunately.

Well, that's a shame. I think we could still do something where we just pool all those sockets on the broker side, put all messages into one bag, and then dispatch based on remote address+port ourselves (basically, doing the job that the kernel ought to be doing). But this starts to feel more like a hack...

So, I don't have a strong preference between the two approaches, and since you are both in favor of the clean approach that relies on a fixed kernel, we should go with that. But I would prefer to wait until non-backports Debian 10 + Ubuntu 18.08 are fixed, hoping that this will not take too long. On the Debian side, 4.19.98 is already in stable-proposed-updates. (I personally care mostly about Debian but it seems prudent to also wait for Ubuntu... it looks like both are applying LTS updates, after all.)

Working: [...] Ubuntu 16.04 LTS

Ah, so sufficiently old kernels are also unaffected?

@kaechele

This comment has been minimized.

Copy link
Contributor Author

kaechele commented Feb 4, 2020

Ah, so sufficiently old kernels are also unaffected?

Yes. 4.4 and older are fine. The bug pops up in 4.5 and is around until 5.2.17.

As for Ubuntu 18.04 my guess is that unless someone opens a bug report they'll not automatically backport this fix to their in-house LTS version of 4.15. Newer Kernels are available for easy installing though.

Haven't tested 4.9 (which is also an LTS release) yet but I assume it's broken. I don't know of any still supported distribution that ships with 4.9 today.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 4, 2020

in-house LTS version of 4.15

Oh, I didn't know Ubuntu did that. :/

Well, at least Debian should be getting the backport, I think. Last time it took 6 weeks though from stable-proposed-updates to stable.

@mitar

This comment has been minimized.

Copy link
Member

mitar commented Feb 5, 2020

But I would prefer to wait until non-backports Debian 10 + Ubuntu 18.08 are fixed, hoping that this will not take too long.

I think this is reasonable.

Also, how much effort would be to keep both NAT and non-NAT version around for now? So keep both after a command line switch while deprecating NAT and then removing it in a year or so?

As for Ubuntu 18.04 my guess is that unless someone opens a bug report they'll not automatically backport this fix to their in-house LTS version of 4.15.

So let's open a bug report? I can upvote it. :-)

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 5, 2020

Also, how much effort would be to keep both NAT and non-NAT version around for now? So keep both after a command line switch while deprecating NAT and then removing it in a year or so?

We don't exactly have an overabundance of manpower, so I wouldn't want to commit to maintaining two branches.

But the bug report can point to the commit (or maybe tag) of master right before the NAT removal gets merged; I don't foresee that broker to stop working any time soon.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 10, 2020

Debian stable now has 4.19.98.

Debian oldstable backports is still at 4.19.67, though. Not sure if they plan to update that (so far it remained roughly in sync with stable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.