Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incoming messages get incorrectly dispatched to broker #129

Closed
RalfJung opened this issue Mar 7, 2020 · 13 comments · Fixed by #135
Closed

Incoming messages get incorrectly dispatched to broker #129

RalfJung opened this issue Mar 7, 2020 · 13 comments · Fixed by #135

Comments

@RalfJung
Copy link
Member

RalfJung commented Mar 7, 2020

We re-landed @kaechele's NAT removal, but things are still going very wrong: once I have more than just 1 or 2 clients (I am not sure if there is a fixed limit, but with 2 clients things still seemed to work fine), connections start to drop. The server thinks that the client timed out.

I added some extra logging, and from what I can see, incoming messages from the client often arrive at the (2-tuple) broker socket, not at the (4-tuple) tunnel socket. So despite what @kaechele said here, there does not seem to be a guarantee that messages with a matching 4-tuple socket do indeed get delivered to that socket.

This is on Debian buster:

$ uname -a
Linux gw1.saar.freifunk.net 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux
@RalfJung
Copy link
Member Author

RalfJung commented Mar 7, 2020

When using the backports kernel, the issues goes away.

$ uname -a
Linux gw1.saar.freifunk.net 5.4.0-0.bpo.3-amd64 #1 SMP Debian 5.4.13-1~bpo10+1 (2020-02-07) x86_64 GNU/Linux

However, requiring a 5.4 kernel seems a bit excessive...

@RalfJung
Copy link
Member Author

RalfJung commented Mar 7, 2020

I also tried some awful hacks that only use one socket per broker. However, then I get "OSError: Netlink error: Device or resource busy (16)" when the second tunnel is created. Maybe the kernel does not support more than one l2tp tunnel per socket?
Also the tunnel didn't actually seem to transport any data, but that could easily be some silly mistake somewhere in my hacks.
EDIT: Yeah, l2tp_validate_socket looks like it returns EBUSY when there already is a tunnel on that socket. So, there's no way to do what we want without SO_REUSEPORT.

Note that just processing messages with the right tunnel fixed the timeouts, but user traffic did not work -- likely because the l2tp traffic still arrived at the wrong socket and thus was never processed by the kernel.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 7, 2020

I'll make a guess and say that this commit could be the one fixing the bug here. It first appeared in 5.0, and unfortunately does not seem to have been backported.

@kaechele
Copy link
Contributor

kaechele commented Mar 9, 2020

I'll make a guess and say that this commit could be the one fixing the bug here. It first appeared in 5.0, and unfortunately does not seem to have been backported.

Yes. This is actually the relevant commit. So that'd bump the minimum Kernel requirement up to at least 5.2.17 which would exclude quite a number of popular distros running their default non-backported Kernels.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 9, 2020

Yeah, 5.2.17, 5.3.1, and then 5.4 and later.

@RalfJung
Copy link
Member Author

So, how do we proceed? I have to admit that I'd really like to get rid of this NAT stuff from our gateways.^^ And using Debian stable with backports kernels is not an entirely uncommon setup I hope.

I see two options:

  • We make the use of the NAT hack configurable. So all the code would stick around for now, but those of us with sufficiently modern kernels at least don't have to run it.
  • We start a "legacy" branch at 50e22df, recommend people with older kernels use that, and if something sufficiently critical for the broker comes up we backport it. Nothing like that has come up since the session ID thing, so based on that it will probably not be too much work to support the legacy branch for a few years until fixed kernels have spread more.

@kaechele @mitar any opinions?

@mitar
Copy link
Member

mitar commented Mar 12, 2020

I think the second option is a good one, provided that we document that well and also make it easy for people to install one or the other version. (Which reminds me, we do not offer really any packages of Tunneldigger, no?)

@kaechele
Copy link
Contributor

I agree. A secondary legacy branch should not be too hard to maintain and it allows us to move on and start developing new features on the lean codebase.

@RalfJung
Copy link
Member Author

All right, let's go for that then. I created a legacy branch. I also updated #126 to describe the new situation.

provided that we document that well

So what else should we do for that? We should probably call this out in the broker docs, describing the symptoms and linking to #126.

We could also detect this problem similar to how we already detect the "packet arrives at the wrong 4-tuple socket", so that we can print a very targeted error message. However, doing that requires a loop over every tunnel for each packet that arrives at the broker. Do you think that will be a performance problem?

(Which reminds me, we do not offer really any packages of Tunneldigger, no?)

Of the broker? Not that I know of.

@mitar
Copy link
Member

mitar commented Mar 17, 2020

So what else should we do for that?

I think README should be a good place, explaining that there are two versions and two branches.

However, doing that requires a loop over every tunnel for each packet that arrives at the broker. Do you think that will be a performance problem?

How often? Could we throttle this somehow? In a way this message has to appear only once in logs to point you to the right direction.

@RalfJung
Copy link
Member Author

How often? Could we throttle this somehow? In a way this message has to appear only once in logs to point you to the right direction.

The issue is rather the case where the error has not happened yet (e.g. because we are on a good kernel). I am not worried about performance on broken kernels, stuff won't work anyway.

However, we only have to do this for packets arriving at the broker socket, not packets arriving for any already established connection.

@kaechele
Copy link
Contributor

(Which reminds me, we do not offer really any packages of Tunneldigger, no?)

I maintain Fedora packages here: https://copr.fedorainfracloud.org/coprs/heffer/tunneldigger/
At some point I will submit them for inclusion with Fedora proper.

@RalfJung
Copy link
Member Author

Submitted PR: #135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants