Incoming messages get incorrectly dispatched to broker #129

RalfJung · 2020-03-07T12:38:43Z

We re-landed @kaechele's NAT removal, but things are still going very wrong: once I have more than just 1 or 2 clients (I am not sure if there is a fixed limit, but with 2 clients things still seemed to work fine), connections start to drop. The server thinks that the client timed out.

I added some extra logging, and from what I can see, incoming messages from the client often arrive at the (2-tuple) broker socket, not at the (4-tuple) tunnel socket. So despite what @kaechele said here, there does not seem to be a guarantee that messages with a matching 4-tuple socket do indeed get delivered to that socket.

This is on Debian buster:

$ uname -a
Linux gw1.saar.freifunk.net 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

RalfJung · 2020-03-07T14:14:25Z

When using the backports kernel, the issues goes away.

$ uname -a
Linux gw1.saar.freifunk.net 5.4.0-0.bpo.3-amd64 #1 SMP Debian 5.4.13-1~bpo10+1 (2020-02-07) x86_64 GNU/Linux

However, requiring a 5.4 kernel seems a bit excessive...

RalfJung · 2020-03-07T16:47:13Z

I also tried some awful hacks that only use one socket per broker. However, then I get "OSError: Netlink error: Device or resource busy (16)" when the second tunnel is created. Maybe the kernel does not support more than one l2tp tunnel per socket?
Also the tunnel didn't actually seem to transport any data, but that could easily be some silly mistake somewhere in my hacks.
EDIT: Yeah, l2tp_validate_socket looks like it returns EBUSY when there already is a tunnel on that socket. So, there's no way to do what we want without SO_REUSEPORT.

Note that just processing messages with the right tunnel fixed the timeouts, but user traffic did not work -- likely because the l2tp traffic still arrived at the wrong socket and thus was never processed by the kernel.

RalfJung · 2020-03-07T19:24:58Z

I'll make a guess and say that this commit could be the one fixing the bug here. It first appeared in 5.0, and unfortunately does not seem to have been backported.

kaechele · 2020-03-09T20:14:11Z

I'll make a guess and say that this commit could be the one fixing the bug here. It first appeared in 5.0, and unfortunately does not seem to have been backported.

Yes. This is actually the relevant commit. So that'd bump the minimum Kernel requirement up to at least 5.2.17 which would exclude quite a number of popular distros running their default non-backported Kernels.

RalfJung · 2020-03-09T20:21:18Z

Yeah, 5.2.17, 5.3.1, and then 5.4 and later.

RalfJung · 2020-03-12T21:46:41Z

So, how do we proceed? I have to admit that I'd really like to get rid of this NAT stuff from our gateways.^^ And using Debian stable with backports kernels is not an entirely uncommon setup I hope.

I see two options:

We make the use of the NAT hack configurable. So all the code would stick around for now, but those of us with sufficiently modern kernels at least don't have to run it.
We start a "legacy" branch at 50e22df, recommend people with older kernels use that, and if something sufficiently critical for the broker comes up we backport it. Nothing like that has come up since the session ID thing, so based on that it will probably not be too much work to support the legacy branch for a few years until fixed kernels have spread more.

@kaechele @mitar any opinions?

mitar · 2020-03-12T22:25:47Z

I think the second option is a good one, provided that we document that well and also make it easy for people to install one or the other version. (Which reminds me, we do not offer really any packages of Tunneldigger, no?)

kaechele · 2020-03-13T03:00:09Z

I agree. A secondary legacy branch should not be too hard to maintain and it allows us to move on and start developing new features on the lean codebase.

RalfJung · 2020-03-16T21:41:45Z

All right, let's go for that then. I created a legacy branch. I also updated #126 to describe the new situation.

provided that we document that well

So what else should we do for that? We should probably call this out in the broker docs, describing the symptoms and linking to #126.

We could also detect this problem similar to how we already detect the "packet arrives at the wrong 4-tuple socket", so that we can print a very targeted error message. However, doing that requires a loop over every tunnel for each packet that arrives at the broker. Do you think that will be a performance problem?

(Which reminds me, we do not offer really any packages of Tunneldigger, no?)

Of the broker? Not that I know of.

mitar · 2020-03-17T06:14:06Z

So what else should we do for that?

I think README should be a good place, explaining that there are two versions and two branches.

However, doing that requires a loop over every tunnel for each packet that arrives at the broker. Do you think that will be a performance problem?

How often? Could we throttle this somehow? In a way this message has to appear only once in logs to point you to the right direction.

RalfJung · 2020-03-17T07:14:37Z

How often? Could we throttle this somehow? In a way this message has to appear only once in logs to point you to the right direction.

The issue is rather the case where the error has not happened yet (e.g. because we are on a good kernel). I am not worried about performance on broken kernels, stuff won't work anyway.

However, we only have to do this for packets arriving at the broker socket, not packets arriving for any already established connection.

kaechele · 2020-03-17T16:32:53Z

(Which reminds me, we do not offer really any packages of Tunneldigger, no?)

I maintain Fedora packages here: https://copr.fedorainfracloud.org/coprs/heffer/tunneldigger/
At some point I will submit them for inclusion with Fedora proper.

RalfJung · 2020-03-17T22:06:56Z

Submitted PR: #135

RalfJung mentioned this issue Mar 7, 2020

Reintroduce: Remove NAT logic (v2) #127

Merged

RalfJung mentioned this issue Mar 17, 2020

Better error and docs for our kernel requirements #135

Merged

RalfJung closed this as completed in #135 Apr 6, 2020

RalfJung mentioned this issue Sep 13, 2020

Some Kernels have broken SO_REUSEPORT handling #126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incoming messages get incorrectly dispatched to broker #129

Incoming messages get incorrectly dispatched to broker #129

RalfJung commented Mar 7, 2020

RalfJung commented Mar 7, 2020

RalfJung commented Mar 7, 2020 •

edited

Loading

RalfJung commented Mar 7, 2020

kaechele commented Mar 9, 2020

RalfJung commented Mar 9, 2020

RalfJung commented Mar 12, 2020

mitar commented Mar 12, 2020

kaechele commented Mar 13, 2020

RalfJung commented Mar 16, 2020

mitar commented Mar 17, 2020

RalfJung commented Mar 17, 2020

kaechele commented Mar 17, 2020

RalfJung commented Mar 17, 2020

Incoming messages get incorrectly dispatched to broker #129

Incoming messages get incorrectly dispatched to broker #129

Comments

RalfJung commented Mar 7, 2020

RalfJung commented Mar 7, 2020

RalfJung commented Mar 7, 2020 • edited Loading

RalfJung commented Mar 7, 2020

kaechele commented Mar 9, 2020

RalfJung commented Mar 9, 2020

RalfJung commented Mar 12, 2020

mitar commented Mar 12, 2020

kaechele commented Mar 13, 2020

RalfJung commented Mar 16, 2020

mitar commented Mar 17, 2020

RalfJung commented Mar 17, 2020

kaechele commented Mar 17, 2020

RalfJung commented Mar 17, 2020

RalfJung commented Mar 7, 2020 •

edited

Loading