-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove NAT logic #115
Remove NAT logic #115
Conversation
I believe this to be the case. BTW. I've been running this version of the code (actually the IPv6 version of it) at Freifunk Leverkusen on all Gateways ever since I finished working on it. So it's definitely working. |
Field-testing, great. :) Which kernel versions have you been using there? |
Currently all gateways run 5.3.16-300.fc31.x86_64 |
So for us it was important that both data and control packages work over the same port (UDP port 52, DNS) so that simple routers and others do not filter it out. Sadly, at least at that time, each tunnel required a different port. |
In case anyone is interested: The way this works in the kernel these days is that once you link a socket to an L2TPv3 tunnel the kernel will snatch all packets it deems as belonging to an active tunnel session and process it directly. It distinguishes sessions by the session ID in the L2TPv3 data packet header (the tunnel ID has no significance here). It then drops the deencapsulated data packets onto the according interfaces for each tunnel (aka. session). If the header of the packet does not match a L2TPv3 data packet or the data packet is invalid in the context of the session the packet will drop through to userspace where the python broker can then pull it out of the socket and deal with it. That's why the broker can still see the control channel packets even though the kernel deals with the L2TPv3 data packets by itself at a lower level without the broker's involvement. In fact, the broker won't even see any valid L2TPv3 data packets as the kernel will already have pulled them out of the socket before the python broker could get to it. |
Yes, in the past they used ports to discriminate tunnels, not the session ID. Great that this was fixed. This then allows also IPv6 to work! |
BTW, with these major changes, I suggest you do major version bump when releasing the next version. |
Yes, see #116 :) |
Yes, I saw it! :-) |
My concern here is: isn't it possible that some package of the tunneldigger protocol "looks like" an l2tp package, and ends up being interpreted by the kernel instead of forwarded to userspace? |
This would be highly unlikely. Also, it would be a problem even with today's implementation since that piece didn't change. |
I mean, we can assure this is not possible, not just that it is unlikely. So we should just make sure that payload L2TP header does not look like control packet header. |
If I remember correctly, L2TP even has a proposed structure for the control packets. Maybe we are already using it. |
How that? Don't we use NAT today to avoid ever actually having control data on a tunnel port? |
Nope. All the NAT stuff does is to map traffic coming from a certain IP to the broker port onto another port internally. Control and Payload traffic arrive in the same socket, just as they do with this change. The only real change here is that we reuse the port and fully rely on the kernel to handle the separation of packets based on the socket 4-tuple (src host, src port, dest host, dest port). Basically what iptables has been doing for us before. |
Okay then, makes sense. |
It is separating based on 4-tuple or based on payload? I understood it is looking into headers to determine data packets? |
Sorry, actually I am mistaken. I did some more reading. When using SO_REUSEPORT the traffic is evenly distributed between all listening sockets by the Kernel. For us this doesn't matter since only the Kernel L2TP driver and the broker will be listening there and thus any packet arriving on any socket would drop into the same code path anyway. Using SO_REUSEPORT ensures that only the user that bound the first socket gets to bind any subsequent shared port sockets. |
So are control packets different enough from L2TP data packets so that it cannot ever be confused which packet should go where? |
Yes. |
@kaechele So part of this PR is to use Also, which constraints does this put on our control message protocol, if any? I extended that protocol some time ago and I definitely did not try to avoid colliding with L2TP.^^ |
It doesn't.
None. The first bit of the header magic of the packets the TD client constructs is set to 1 (either on purpose or by coincidence). This indicates a control packet (as opposed to a data packet). The Kernel doesn't even look beyond this and passes all control packets to userspace (i.e. in our case the broker). After that we're free to do what we want in the broker with that packet. I did notice that client and broker however do not produce valid control packets by L2TP standards. But that doesn't matter since we deal with these packets only in our own code. |
Yes, that was on purpose.
That was also on purpose. We didn't want to get bugged down with a standard we do not really care about (also we were unsure how it maps to mesh networks and so on). |
I thought those "l2tp control packets" that the RFC spoke about were something that the kernel was using to coordinate with the other side? Are you saying now that actually those are expected to be handled by the userspace anyway even in a non-tunneldigger environment? You make it sound like the kernel is only implementing the "data packets" part of l2tp. What would be an example implementation of the "actual" l2tp control protocol?
Oh... given the total lack of comments for these magic numbers, I assumed they were basically picked at random. If there was any thought that went into picking that particular sequence of bytes ( |
Yes. That is because this is the case.
One would be https://prol2tp.com/ (used to be open source as OpenL2TP). This is from the company that created the Kernel L2TP driver. I know of no other open-source L2TPv3 specific broker.
See Let's discuss https://github.com/torvalds/linux/blob/master/net/l2tp/l2tp_core.c#L792 The Kernel in this case looks at the first 16 bits (L#828). 0x8073 & 0x000F = 310 OK cool. This is an L2TPv3 packet, as we expected. 0x8073 & 0x8000 = 0x8000 thus is not null and the code within the if statement is executed. It What exactly the significance of any of the other bits in the magic header is in full detail I don't know. That would be a question for @kostko or @mitar . |
@kaechele all right, I think I am convinced. Thanks for your patience. :) Could you submit a PR putting some of these findings into comments at a suitable place in the code? (If you don't have time, I'll try to find some.) |
Good to hear that it's not just our setup. :) I'll see if I can try this on Debian 10 (Buster) with the backbport kernel. |
Okay. So this is a bug in all kernels newer than 4.4.x and older than 5.2.17. This commit fixes it in 5.2.17:
We need to think about whether we can bump our minimum required Kernel version that high, since all LTS Kernels (and thus the Kernel of all LTS distros and Debian) will have this bug. |
Lucky us. It was backported to 4.19.75 as |
Do we need reuseport? |
@kaechele good catch! My attempt at upgrading one more of our gateways to Debian 10 so I could test the later backports kernels ended with that gateway not responding any more... we'll have to sort that out before I can make any further experiments. However, I don't think I am entirely following. Given that UDP is stateless, what is the "connection" the commit message is talking about? I think the previous ("buggy") behavior of the kernel is perfectly in line with the documented behavior of SO_REUSEPORT. Which extra guarantees are provided by this patch? |
From the
So we are basically setting the default sendto address and also specifying from which source address we will be accepting packages. You are right, the documented behaviour is incomplete as it does not contain this information about |
Yes, if we want to keep the old behaviour of all data flowing into one single port on the remote side. |
I am not convinced of this. I think we can alternatively have a single socket (per port) on the broker side and dispatch messages to the |
So let me see if I got this straight... Is that right? I think it matches both the behavior I observed and your statements. And it does sound like it is almost what we need, except for one part: do we also have a guarantee that all packets that come in to a reuseport group with the remote IP+port matching a 4-tuple socket in that group, are dispatched to that socket? Or is it legal for the kernel to dispatch such packets to a 2-tuple-UDP-socket? Doing so would not violate the guarantee I stated above, but it would mean we couldn't use this for tunneldigger. Given that your patch seems to work fine in recent kernels, I assume the answer is that current implementations will also guarantee that if a matching 4-tuple-UDP-socket exists, packets will not be dispatched to a 2-tuple-UDP-socket. Are we sure we can rely on this? If yes, that sounds quite elegant indeed (I didn't realize when reading the patch originally, and think we should have comments explaining this). Except, of course, that many kernels people will actually use for Freifunk fail to properly implement this... |
Yes, that is right.
The Kernel scores the sockets based on it's specificity (e.g. a 4-tuple connected socket that matches has a higher score than the 2-tuple bound one). So in any case, if there is a valid 4-tuple connected socket, it would return this socket over any 2-tuple ones. |
Okay. So assuming this is indeed guaranteed to remain the case (and it seems quite sensible, the usefulness being demonstrated by you relying on it), we basically have two options:
|
For the first case, is there a way to test this on daemon startup time, so that we can warn people with the message (instead of relaying on the kernel version or something)? |
I don't see a good way to test this at startup time... but we could certainly add a message here to inform people that this is likely caused by the kernel being buggy. |
Yes, I agree. This is the same spot I was thinking of putting a notice. Maybe we can create a GitHub Issue that explains the situation and link it in that message. That way we can create maximum clarity on what the situation is and how it can be mitigated. |
That makes sense. However, I am still on the fence if it wouldn't be better to avoid |
My take is that it is possible, technically, but only if we sacrifice a central design decision of Tunneldigger. And let me just preface this with saying that I am not a Linux networking expert. So I may be missing an option that would achieve all we want without sacrificing anything, that I just don't know about. Let me elaborate: If we just ignore that for a second and accept that
we are left with an option where we would both avoid NAT and also avoid I initially removed NAT to make it less complex to implement IPv6 support. ip6tables, I am sure, is also able to do NAT the same way we do it on v4 for now. But I personally have no interest in carrying the NAT hack forth if there is a (in my opinion) better way to do it. If we are trying to avoid For your reference:
Tested and broken:
Assumed working:
Assumed broken:
|
I think @RalfJung is asking if we could use I think using only one port is the main advantage of Tunneldigger. |
Ahh. Now I got it 😅. Sorry for my ignorance. No, this wouldn't be possible. The kernel implementation of L2TPv3 over UDP is limited in that it requires each tunnel to have a unique socket. It doesn't care if this socket uses a re-used port but it needs to be an individual socket, unfortunately. |
OK, then I think we should go with clean codebase and support newer kernels. I think it is not too hard to control kernels on brokers, no? We could also mention that if you need older kernels, you could use an older version of Tunneldigger with NAT. The question is: could we make it work so that you can have some brokers with NAT and some without and they all work together? |
I agree.
And, at this point, you wouldn't even lose any functionality. You'll just not get any new stuff such as IPv6 support.
The removal of NAT is completely transparent to the client. The client doesn't care if it is talking to a broker that uses NAT internally vs. one that doesn't. Likewise running the old code on newer kernels works just fine. Even running old and new code running alongside each other serving different ports works fine. |
I certainly have no intent of keeping the NAT hack or changing the client-observable behavior.
Well, that's a shame. I think we could still do something where we just pool all those sockets on the broker side, put all messages into one bag, and then dispatch based on remote address+port ourselves (basically, doing the job that the kernel ought to be doing). But this starts to feel more like a hack... So, I don't have a strong preference between the two approaches, and since you are both in favor of the clean approach that relies on a fixed kernel, we should go with that. But I would prefer to wait until non-backports Debian 10 + Ubuntu 18.08 are fixed, hoping that this will not take too long. On the Debian side, 4.19.98 is already in stable-proposed-updates. (I personally care mostly about Debian but it seems prudent to also wait for Ubuntu... it looks like both are applying LTS updates, after all.)
Ah, so sufficiently old kernels are also unaffected? |
Yes. 4.4 and older are fine. The bug pops up in 4.5 and is around until 5.2.17. As for Ubuntu 18.04 my guess is that unless someone opens a bug report they'll not automatically backport this fix to their in-house LTS version of 4.15. Newer Kernels are available for easy installing though. Haven't tested 4.9 (which is also an LTS release) yet but I assume it's broken. I don't know of any still supported distribution that ships with 4.9 today. |
Oh, I didn't know Ubuntu did that. :/ Well, at least Debian should be getting the backport, I think. Last time it took 6 weeks though from stable-proposed-updates to stable. |
I think this is reasonable. Also, how much effort would be to keep both NAT and non-NAT version around for now? So keep both after a command line switch while deprecating NAT and then removing it in a year or so?
So let's open a bug report? I can upvote it. :-) |
We don't exactly have an overabundance of manpower, so I wouldn't want to commit to maintaining two branches. But the bug report can point to the commit (or maybe tag) of master right before the NAT removal gets merged; I don't foresee that broker to stop working any time soon. |
Debian stable now has 4.19.98. Debian oldstable backports is still at 4.19.67, though. Not sure if they plan to update that (so far it remained roughly in sync with stable). |
Given that this has been on stable for a few weeks now, shall we proceed with re-landing this patch? |
Turns out, the NAT crutch to run all tunnels through one external port wasn't needed after all. The kernel handles this just fine.
Depends on #110