[pull] master from upstream #20

pull · 2025-12-08T14:59:30Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

IEEE 802.3 requires a minimum frame payload of 46 bytes, without a 802.1Q tag. Add padding for the simple tap_send_single() case using a zero-filled 60-byte buffer and copying data to it if needed. In theory, we could add a further element in the iovec array, say: uint8_t padding[ETH_ZLEN] = { 0 }; struct iovec iov[3]; ... if (l2len < ETH_ZLEN) { iov[iovcnt].iov_base = (void *)padding; iov[iovcnt].iov_len = ETH_ZLEN - l2len; iovcnt++; } and avoid a copy, but that would substantially complicate the vhost-user path, and it's questionable whether passing a reference to a further buffer actually causes lower overhead than the simple copy. Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Laurent Vivier <lvivier@redhat.com>

...as I'm going to add a new value to it. Fixes: 9560123 ("tcp: Replace TCP buffer structure by an iovec array") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com>

...as I'm going to add a new value to it. Fixes: 3f9bd86 ("udp: Split tap-bound UDP packets into multiple buffers using io vector") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com>

…-user modes Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a zero-filled buffer, and send out a part of it if needed to reach the minimum frame length given by 802.3, that is, 60 bytes altogether. The frames we might need to pad are IPv4 only (the IPv6 header is larger), and are typically TCP ACK segments but can also be small data segments or datagrams. Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

…minimum) For both TCP and UDP, instead of just expecting the first provided buffer to be large enough to contain the headers we need (from 42 bytes for UDP data over IPv4 to 82 bytes for TCP with options over IPv6), change that assumption to make sure that buffers are anyway at least ETH_ZLEN-sized buffers (60 bytes). This looks reasonable because there are no known vhost-user hypervisor implementations that would give us smaller buffers than that, and we would anyway hit an assertion failure with IPv6 if we ever had less than 60 bytes per buffer. At this point, all we need to do is to pad the first (and only) buffer, should data and headers use less than that. Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

I happened to touch these functions for a change that turned out to be unnecessary, but coding style and documentation could still benefit from some rephrasing and fixes. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

@x

…use it Right now, the only need for this kind of function comes from tcp_get_sndbuf(), which calculates the amount of sending buffer we want to use depending on its own size: we want to use more of it if it's smaller, as bookkeeping overhead is usually lower and we rely on auto-tuning there, and use less of it when it's bigger. For this purpose, the new function is overly generic: @x is the same as @y, that is, we want to use more or less of the buffer depending on the size of the buffer itself. However, an upcoming change will need that generality, as we'll want to scale the amount of sending buffer we use depending on another (scaled) factor. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Now that we have a new clamped_scale() function, which makes it simple to specify a precise usage factor, change the amount of sending buffer we want to use at and above 4 MiB: 75% looks perfectly safe. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

For non-local connections, we advertise the same window size as what the peer in turn advertises to us, and limit it to the buffer size reported via SO_SNDBUF. That's not quite correct: in order to later avoid failures while queueing data to the socket, we need to limit the window to the available buffer size, not the total one. Use the SIOCOUTQ ioctl and subtract the number of outbound queued bytes from the total buffer size, then clamp to this value. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

…hecks A fixed 10 ms ACK_INTERVAL timer value served us relatively well until the previous change, because we would generally cause retransmissions for non-local outbound transfers with relatively high (> 100 Mbps) bandwidth and non-local but low (< 5 ms) RTT. Now that retransmissions are less frequent, we don't have a proper trigger to check for acknowledged bytes on the socket, and will generally block the sender for a significant amount of time while we could acknowledge more data, instead. Store the RTT reported by the kernel using an approximation (exponent), to keep flow storage size within two (typical) cachelines. Check for socket updates when half of this time elapses: it should be a good indication of the one-way delay we're interested in (peer to us). Representable values are between 100 us and 3.2768 s, and any value outside this range is clamped to these bounds. This choice appears to be a good trade-off between additional overhead and throughput. This mechanism partially overlaps with the "low RTT" destinations, which we use to infer that a socket is connected to an endpoint to the same machine (while possibly in a different namespace) if the RTT is reported as 10 us or less. This change doesn't, however, conflict with it: we are reading TCP_INFO parameters for local connections anyway, so we can always store the RTT approximation opportunistically. Then, if the RTT is "low", we don't really need a timer to acknowledge data as we'll always acknowledge everything to the sender right away. However, we have limited space in the array where we store addresses of local destination, so the low RTT property of a connection might toggle frequently. Because of this, it's actually helpful to always have the RTT approximation stored. This could probably benefit from a future rework, though, introducing a more integrated approach between these two mechanisms. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

We correctly avoid doing that at the beginning of tcp_prepare_flags(), but we might clear the flag later on if we actually end up sending a "flag" segment. Make sure we don't, otherwise we might delay window updates after a zero-window condition significantly, and significantly affect throughput. In some cases, we're forcing peers to send zero-window probes or keep-alive segments. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

…ctive ...instead of checking if the current sending buffer is less than SNDBUF_SMALL, because this isn't simply an optimisation to coalesce ACK segments: we rely on having enough data at once from the sender to make the buffer grow by means of TCP buffer size tuning implemented in the Linux kernel. This is important if we're trying to maximise throughput, but not desirable for interactive traffic, where we want to be transparent as possible and avoid introducing unnecessary latency. Use the tcpi_delivery_rate field reported by the Linux kernel, if available, to calculate the current bandwidth-delay product: if it's significantly smaller than the available sending buffer, conclude that we're not bandwidth-bound and this is likely to be interactive traffic, so acknowledge data only as it's acknowledged by the peer. Conversely, if the bandwidth-delay product is comparable to the size of the sending buffer (more than 5%), we're probably bandwidth-bound or... bound to be: acknowledge everything in that case. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

If the sender uses data clumping (including Nagle's algorithm) for Silly Window Syndrome (SWS) avoidance, advertising less than a MSS means the sender might stop sending altogether, and window updates after a low window condition are just as important as they are in a zero-window condition. For simplicity, approximate that limit to zero, as we have an implementation forcing window updates after zero-sized windows. This matches the suggestion from RFC 813, section 4. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

…rtisements If the remote peer is advertising a bigger value than our current sending buffer, it means that a bigger sending buffer is likely to benefit throughput. We can get a bigger sending buffer by means of the buffer size auto-tuning performed by the Linux kernel, which is triggered by aggressively filling the sending buffer. Use an adaptive boost factor, up to 150%, depending on: - how much data we sent so far: we don't want to risk retransmissions for short-lived connections, as the latency cost would be unacceptable, and - the current RTT value, as we need a bigger buffer for higher transmission delays The factor we use is not quite a bandwidth-delay product, as we're missing the time component of the bandwidth, which is not interesting here: we are trying to make the buffer grow at the beginning of a connection, progressively, as more data is sent. The tuning of the amount of boost factor we want to apply was done somewhat empirically but it appears to yield the available throughput in rather different scenarios (from ~ 10 Gbps bandwidth with 500ns to ~ 1 Gbps with 300 ms RTT) and it allows getting there rather quickly, within a few seconds for the 300 ms case. Note that we want to apply this factor only if the window advertised by the peer is bigger than the current sending buffer, as we only need this for auto-tuning, and we absolutely don't want to incur unnecessary retransmissions otherwise. The related condition in tcp_update_seqack_wnd() is not redundant as there's a subtractive factor, sendq, in the calculation of the window limit. If the sending buffer is smaller than the peer's advertised window, the additional limit we might apply might be lower than we would do otherwise. Assuming that the sending buffer is reported as 100k, sendq is 20k, we could have these example cases: 1. tinfo->tcpi_snd_wnd is 120k, which is bigger than the sending buffer, so we boost its size to 150k, and we limit the window to 120k 2. tinfo->tcpi_snd_wnd is 90k, which is smaller than the sending buffer, so we aren't trying to trigger buffer auto-tuning and we'll stick to the existing, more conservative calculation, by limiting the window to 100 - 20 = 80k If we omitted the new condition, we would always use the boosted value, that is, 120k, even if potentially causing unnecessary retransmissions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

...in order to trigger a fast retransmit as soon as possible. There's no benefit in forcing the sender to wait for a longer time than that. We already do this on partial failures (short socket writes), but for historical reason not on complete failures. Make these two cases consistent between each other. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

...we'll send a duplicate ACK right away in this case, and this redundant, earlier check is not just useless, but it might actually be harmful as we'll now send a triple ACK which might cause two retransmissions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

… Copr builders For some reason, on Copr: Building target platforms: aarch64 Building for target aarch64 error: line 42: Unknown tag: %selinux_requires_min Child return code was: 1 Use %selinux_requires_min starting from current Rawhide / Fedora 44, there it works. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

sbrivio-rh added 17 commits December 8, 2025 04:46

pull bot locked and limited conversation to collaborators Dec 8, 2025

pull bot added the ⤵️ pull label Dec 8, 2025

pull bot merged commit 4bb3687 into master Dec 8, 2025

pull bot deleted the upstream branch December 8, 2025 14:59

kroese restored the upstream branch December 8, 2025 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from upstream #20

[pull] master from upstream #20

Uh oh!

pull bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[pull] master from upstream #20

[pull] master from upstream #20

Uh oh!

Conversation

pull bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pull bot commented Dec 8, 2025 •

edited

Loading