Skip to content

Conversation

@pull
Copy link
Contributor

@pull pull bot commented Dec 8, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

IEEE 802.3 requires a minimum frame payload of 46 bytes, without a
802.1Q tag. Add padding for the simple tap_send_single() case using
a zero-filled 60-byte buffer and copying data to it if needed.

In theory, we could add a further element in the iovec array, say:

	uint8_t padding[ETH_ZLEN] = { 0 };
	struct iovec iov[3];

	...

		if (l2len < ETH_ZLEN) {
			iov[iovcnt].iov_base = (void *)padding;
			iov[iovcnt].iov_len = ETH_ZLEN - l2len;
			iovcnt++;
		}

and avoid a copy, but that would substantially complicate the
vhost-user path, and it's questionable whether passing a reference
to a further buffer actually causes lower overhead than the simple
copy.

Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
...as I'm going to add a new value to it.

Fixes: 9560123 ("tcp: Replace TCP buffer structure by an iovec array")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
...as I'm going to add a new value to it.

Fixes: 3f9bd86 ("udp: Split tap-bound UDP packets into multiple buffers using io vector")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
…-user modes

Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and
UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a
zero-filled buffer, and send out a part of it if needed to reach
the minimum frame length given by 802.3, that is, 60 bytes altogether.

The frames we might need to pad are IPv4 only (the IPv6 header is
larger), and are typically TCP ACK segments but can also be small
data segments or datagrams.

Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
…minimum)

For both TCP and UDP, instead of just expecting the first provided
buffer to be large enough to contain the headers we need (from 42
bytes for UDP data over IPv4 to 82 bytes for TCP with options over
IPv6), change that assumption to make sure that buffers are anyway
at least ETH_ZLEN-sized buffers (60 bytes).

This looks reasonable because there are no known vhost-user
hypervisor implementations that would give us smaller buffers than
that, and we would anyway hit an assertion failure with IPv6 if we
ever had less than 60 bytes per buffer.

At this point, all we need to do is to pad the first (and only)
buffer, should data and headers use less than that.

Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I happened to touch these functions for a change that turned out to be
unnecessary, but coding style and documentation could still benefit
from some rephrasing and fixes.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
…use it

Right now, the only need for this kind of function comes from
tcp_get_sndbuf(), which calculates the amount of sending buffer we
want to use depending on its own size: we want to use more of it
if it's smaller, as bookkeeping overhead is usually lower and we rely
on auto-tuning there, and use less of it when it's bigger.

For this purpose, the new function is overly generic: @x is the same
as @y, that is, we want to use more or less of the buffer depending
on the size of the buffer itself.

However, an upcoming change will need that generality, as we'll want
to scale the amount of sending buffer we use depending on another
(scaled) factor.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we have a new clamped_scale() function, which makes it simple
to specify a precise usage factor, change the amount of sending buffer
we want to use at and above 4 MiB: 75% looks perfectly safe.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For non-local connections, we advertise the same window size as what
the peer in turn advertises to us, and limit it to the buffer size
reported via SO_SNDBUF.

That's not quite correct: in order to later avoid failures while
queueing data to the socket, we need to limit the window to the
available buffer size, not the total one.

Use the SIOCOUTQ ioctl and subtract the number of outbound queued
bytes from the total buffer size, then clamp to this value.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
…hecks

A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (> 100 Mbps)
bandwidth and non-local but low (< 5 ms) RTT.

Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.

Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).

Representable values are between 100 us and 3.2768 s, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.

This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.

This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.

Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.

This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We correctly avoid doing that at the beginning of tcp_prepare_flags(),
but we might clear the flag later on if we actually end up sending a
"flag" segment.

Make sure we don't, otherwise we might delay window updates after a
zero-window condition significantly, and significantly affect
throughput.

In some cases, we're forcing peers to send zero-window probes or
keep-alive segments.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
…ctive

...instead of checking if the current sending buffer is less than
SNDBUF_SMALL, because this isn't simply an optimisation to coalesce
ACK segments: we rely on having enough data at once from the sender
to make the buffer grow by means of TCP buffer size tuning
implemented in the Linux kernel.

This is important if we're trying to maximise throughput, but not
desirable for interactive traffic, where we want to be transparent as
possible and avoid introducing unnecessary latency.

Use the tcpi_delivery_rate field reported by the Linux kernel, if
available, to calculate the current bandwidth-delay product: if it's
significantly smaller than the available sending buffer, conclude that
we're not bandwidth-bound and this is likely to be interactive
traffic, so acknowledge data only as it's acknowledged by the peer.

Conversely, if the bandwidth-delay product is comparable to the size
of the sending buffer (more than 5%), we're probably bandwidth-bound
or... bound to be: acknowledge everything in that case.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If the sender uses data clumping (including Nagle's algorithm) for
Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
means the sender might stop sending altogether, and window updates
after a low window condition are just as important as they are in
a zero-window condition.

For simplicity, approximate that limit to zero, as we have an
implementation forcing window updates after zero-sized windows.
This matches the suggestion from RFC 813, section 4.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
…rtisements

If the remote peer is advertising a bigger value than our current
sending buffer, it means that a bigger sending buffer is likely to
benefit throughput.

We can get a bigger sending buffer by means of the buffer size
auto-tuning performed by the Linux kernel, which is triggered by
aggressively filling the sending buffer.

Use an adaptive boost factor, up to 150%, depending on:

- how much data we sent so far: we don't want to risk retransmissions
  for short-lived connections, as the latency cost would be
  unacceptable, and

- the current RTT value, as we need a bigger buffer for higher
  transmission delays

The factor we use is not quite a bandwidth-delay product, as we're
missing the time component of the bandwidth, which is not interesting
here: we are trying to make the buffer grow at the beginning of a
connection, progressively, as more data is sent.

The tuning of the amount of boost factor we want to apply was done
somewhat empirically but it appears to yield the available throughput
in rather different scenarios (from ~ 10 Gbps bandwidth with 500ns to
~ 1 Gbps with 300 ms RTT) and it allows getting there rather quickly,
within a few seconds for the 300 ms case.

Note that we want to apply this factor only if the window advertised
by the peer is bigger than the current sending buffer, as we only need
this for auto-tuning, and we absolutely don't want to incur
unnecessary retransmissions otherwise.

The related condition in tcp_update_seqack_wnd() is not redundant as
there's a subtractive factor, sendq, in the calculation of the window
limit. If the sending buffer is smaller than the peer's advertised
window, the additional limit we might apply might be lower than we
would do otherwise.

Assuming that the sending buffer is reported as 100k, sendq is
20k, we could have these example cases:

1. tinfo->tcpi_snd_wnd is 120k, which is bigger than the sending
   buffer, so we boost its size to 150k, and we limit the window
   to 120k

2. tinfo->tcpi_snd_wnd is 90k, which is smaller than the sending
   buffer, so we aren't trying to trigger buffer auto-tuning and
   we'll stick to the existing, more conservative calculation,
   by limiting the window to 100 - 20 = 80k

If we omitted the new condition, we would always use the boosted
value, that is, 120k, even if potentially causing unnecessary
retransmissions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...in order to trigger a fast retransmit as soon as possible. There's
no benefit in forcing the sender to wait for a longer time than that.

We already do this on partial failures (short socket writes), but for
historical reason not on complete failures. Make these two cases
consistent between each other.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...we'll send a duplicate ACK right away in this case, and this
redundant, earlier check is not just useless, but it might actually
be harmful as we'll now send a triple ACK which might cause two
retransmissions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
… Copr builders

For some reason, on Copr:

  Building target platforms: aarch64
  Building for target aarch64
  error: line 42: Unknown tag: %selinux_requires_min
  Child return code was: 1

Use %selinux_requires_min starting from current Rawhide / Fedora 44,
there it works.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
@pull pull bot locked and limited conversation to collaborators Dec 8, 2025
@pull pull bot added the ⤵️ pull label Dec 8, 2025
@pull pull bot merged commit 4bb3687 into master Dec 8, 2025
@pull pull bot deleted the upstream branch December 8, 2025 14:59
@kroese kroese restored the upstream branch December 8, 2025 15:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants