[pull] master from upstream #18

pull · 2025-12-03T14:59:34Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

The warning message we print of SO_BINDTODEVICE fails is incorrect: we include EPOLL_TYPE_STR(proto), but the proto variable is not an epoll type so this will generate misleading garbage. This has been wrong ever since 3401644 but the message is rare enough that we never noticed. Correct that, and reword the message a bit for clarity while we're there. Fixes: 3401644 ("epoll: Generalize epoll_ref to cover things other than sockets") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

sock_probe_mem() currently checks whether we're able to allocate large socket buffers. Extend it to also check whether the SO_BINDTODEVICE socket option is available. Rename to sock_probe_features() to reflect the new functionality. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Add whitespace around "-" in sock_probe_features()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Before 5.7, the kernel didn't allow SO_BINDTODEVICE to be called unprivileged. That means for earlier kernels, we can't implement binding listening sockets to a specific interface (e.g. -t %eth0/80). Currently we'll generate an error on this at the point we actually attempt the SO_BINDTODEVICE setsockopt(), at which point the connection to the command line option might not be entirely clear. Use the fact we now probe for SO_BINDTODEVICE support to make a clearer error message at the time we parse the forwarding option. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

This was there when the structure was created, but never used. Looks like a copy-pasta error. Fixes: 781164e ("flow: Helper to create sockets based on flowside") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

sockaddr_inany can contain either an IPv4 or IPv6 socket address, so the relevant length for bind() or connect() can vary. In pif_sockaddr() we return that length, and in sock_l4_sa() we take it as an extra parameter. However, sockaddr_inany always contains exactly a sockaddr_in or a sockaddr_in6 each with a fixed size. Therefore we can derive the relevant length from the family, and don't need to pass it around separately. Make a tiny helper to get the relevant address length, and update all interfaces to use that approach instead. In the process, fix a buglet in tcp_flow_repair_bind(): we passed sizeof(union sockaddr_inany) to bind() instead of the specific length for the address family. Since the sizeof() is always longer than the specific length, this is probably fine, but not theoretically correct. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

sock_l4_sa() has a somewhat confusing 'v6only' option controlling whether to set the IPV6_V6ONLY socket option. Usually it's set when the given address is IPv6, but not when we want to create a dual stack listening socket. The latter only makes sense when the address is :: however. Clarify this by only keeping the v6only option in an internal helper sock_l4_(). External users will call either sock_l4() which always creates a socket bound to a specific IP version, or sock_l4_dualstack() which creates a dual stack socket, but takes only a port not an address. We drop the '_sa' suffix while we're at it - it exists because this used to be an internal version with a sock_l4() wrapper. The wrapper no longer exists so the '_sa' is no longer useful. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Surprisingly little logic is shared between the path for creating a listen()ing socket in the guest namespace versus in the host namespace. Improve this, by extending tcp_sock_init_one() to take a pif parameter indicating where it should open the socket. This allows tcp_ns_sock_init[46]() to be removed entirely. We generalise tcp_sock_init() in the same way, although we don't fully use it yet, due to some subtle differences in how we bind for -t versus -T. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

udp_sock_init() takes an 'ns' parameter determining if it creates a socket in the guest namespace or host namespace. Alter it to take a pif parameter instead, like tcp_sock_init(), and use that change to slightly reduce code duplication between the HOST and SPLICE cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Inbound sockets are bound to the unspecified address by default, whereas outbound sockets are bound to the loopback address. This is currently handled in udp_sock_init() which ignores its addr argument in the outbound case. Move the handling of this special case to the caller, udp_port_rebind(). This makes the semantics of the 'addr' argument more consistent, and will make future changes easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it to 1, which we typically do on all IPv6 sockets except those explicitly for dual stack listening. That's not quite right in two ways: * Although IPV6_V6ONLY==0 is normally the default on Linux, that can be changed with the net.ipv6.bindv6only sysctl. It may also have different defaults on other OSes if we ever support them. We know we need it off for dual stack sockets, so explicitly set it to 0 in that case. * At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a specific address is harmless but pointless. Don't set the option at all in this case, saving a syscall. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

To save kernel memory we try to use "dual stack" sockets which can listen for both IPv4 and IPv6 connections where possible. To support kernels which don't allow dual stack sockets, we fall back to creating individual sockets for IPv4 and IPv6. This fallback causes some mild ugliness now, and will cause more difficulty with upcoming improvements to the forwarding logic. I don't think we need the fallback on the following grounds: 1) The fallback was broken since inception: The fallback was triggered if pif_sock_l4() failed attempting to create the dual stack socket. But even if the kernel didn't support them, pif_sock_l4() would not report a failure. - Dual stack sockets are distinguished by having the IPV6_V6ONLY sockopt set to 0. However, until the last patch, we only called setsockopt() if we wanted to set this to 1, so there was no kernel operation which could fail for dual stack sockets - we'd silently create a IPv6 only socket instead. - Even if we did call the setsockopt(), we only printed a debug() message for failures, we didn't report it to the caller 2) Dual stack sockets are not just a Linux extension The dual stack socket interface is described in RFC3493, specifically section 3.7 and section 5.3. It is supported on BSD: https://man.freebsd.org/cgi/man.cgi?query=ip6 and on Windows: https://learn.microsoft.com/en-us/windows/win32/winsock/ipproto-ipv6-socket-options 3) Linux has supported dual stack sockets for over 20 years According to ipv6(7) the IPV6_V6ONLY socket option was introduced in Linux 2.6 and Linux 2.4.21 both from 2003. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

…ress Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1. This has some disadvantages: * The guest can't connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6. The restriction exists for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications. We can achieve this a different way, however. Don't bind to a specific address, but _do_ use SO_BINDTODEVICE to restrict the sockets to the "lo" interface. We fall back to the old behaviour for older kernels where SO_BINDTODEVICE is not available unprivileged. Note that although traffic to a local but non-loopback address is passed over the 'lo' interface (as seen by netfilter and dumpcap), it doesn't count as attached to that interface for the purposes of SO_BINDTODEVICE (information from the routing table overrides the "physical" interface). So, this change doesn't help for bug 100. It's also not a complete fix for bug 113, it does however: * Get us a step closer to fixing bug 113 * Slightly simplify the code * Make things a bit easier to allow more flexible binding on the guest in in future Link: https://bugs.passt.top/show_bug.cgi?id=113 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Stefano correctly noted that the fact a socket is dual-stack doesn't necessarily imply that it is bound to a wildcard address. While that's the only case we use for dual-stack sockets, there may be others. Therefore rename this function to make it clearer that it always uses a wildcard bind. Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

When we receive a TCP connection, we get the peer address from the accept() call. In the case of a listening socket with an unspecified address (:: or 0.0.0.0) the local address of the accept()ed socket could vary. We don't get that from the accept() - we must explicitly call getsockname() to get it. Currently we avoid the latency of that extra syscall, and therefore don't populate the initiating 'oaddr' field of a flow created by an incoming TCP socket connection. This more or less works, because we rarely need that local address, but it does cause some oddities: * For migration we need the local address to recreate the socket on the destination, so we *do* call getsockname() in vhost-user mode * It limits our options in terms of forwarding connections flexibly based on the address to which they're received * It differs from UDP, where we explicitly use the IP_PKTINFO cmsg to populate oaddr. * It means (some) flow debug messages will contain wildcards instead of real local addresses In theory we can elide this call when accept()ing from a socket bound to a specific address instead of a wildcard. However to do that will need revisions to the data structures we use to keep track of listening sockets. The lack of this information is making it hard to implement some fixes we want. So, pay the price of the extra syscall to get this information, with the hope that we can later optimise it away for some cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

When forwarding "spliced" connections outwards (-T or -U) we listen on the guest's loopback and always forward to 127.0.0.1 (or ::1) on the host. However, it's also possible for clients on the guest to attempt connecting to other addresses in 127.0.0.0/8 (systemd-resolved uses 127.0.0.53 in practice). If the host side server is only listening on that specific non-standard loopback address, the forward won't work. Fix this by preserving the specific (loopback) address when forwarding such connections. Link: https://bugs.passt.top/show_bug.cgi?id=113 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

dgibson added 15 commits December 2, 2025 23:05

pull bot locked and limited conversation to collaborators Dec 3, 2025

pull bot added the ⤵️ pull label Dec 3, 2025

pull bot merged commit 237a9d9 into master Dec 3, 2025

pull bot deleted the upstream branch December 3, 2025 14:59

kroese restored the upstream branch December 3, 2025 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from upstream #18

[pull] master from upstream #18

Uh oh!

pull bot commented Dec 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[pull] master from upstream #18

[pull] master from upstream #18

Uh oh!

Conversation

pull bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pull bot commented Dec 3, 2025 •

edited

Loading