net: tcp: Rework data queueing #65251

rlubos · 2023-11-15T16:53:47Z

Rework how data is queued for the TCP connections:

net_context no longer allocates net_pkt for TCP connections. This
was not only inefficient (net_context has no knowledge of the TX
window size), but also error-prone in certain configurations (for
example when IP fragmentation was enabled, net_context may attempt
to allocate enormous packet, instead of letting the data be fragmented
for the TCP stream).
Instead, implement already defined net_tcp_queue() API, which
takes raw buffer and length. This allows to take TX window into
account and also better manage the allocated net_buf's (like for
example avoid allocation if there's still room in the buffer). In
result, the TCP stack will not only no longer exceed the TX window,
but also prevent empty gaps in allocated net_buf's, which should
lead to less out-of-mem issues with the stack.
As net_pkt-based net_tcp_queue_data() is no longer in use, it was
removed.

Add new function to allocate additional buffers for net_pkt, w/o any additional preconditions/checks. Just allocate what was requested. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

jukkar

I like this idea, but we need to tweak the code a bit.

jukkar · 2023-11-16T08:45:46Z

subsys/net/ip/tcp.c

If we could only queue partial data from msg, what should happen to the rest of it?

In previous version, we bailed out if we could not create net_pkt and returned error to the user. This is the way that is described in sendmsg() manual page.

If the message is too long to pass atomically through the underlying protocol, the error EMSGSIZE is returned, and the message is not transmitted.

But now we queue the partial data so this sending is no longer atomic.

tl;dr; I don't think we modify the already existing behavior after all, see explaination below.

In previous version, we bailed out if we could not create net_pkt and returned error to the user.

That's only correct for datagram sockets. For stream, it'd still be possible to send partial data, however we'd ignore the window size and use MTU as the delimeter of the maximum transfer length.

The packet is allocated here (up to the MTU size):
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/net_context.c#L1938

Verify the allocated length here (but only fail in case of DGRAM sock):
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/net_context.c#L1944

And finally copy data to net_pkt, but only up to allocated space:
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/net_context.c#L2011

So, the behaviour was pretty much the same, the only difference is that now we respect the window size.

If the message is too long to pass atomically through the underlying protocol,
the error EMSGSIZE is returned, and the message is not transmitted.

But now we queue the partial data so this sending is no longer atomic.

That's one of those ambiguities of the POSIX documentation... Does that comment really apply to STREAM sockets? The same comment can be found in regular send() documentation. Plus, the error code description says EMSGSIZE only applies to socket types that require message atomicity (I'd interprate that as DGRAM sockets?):

EMSGSIZE
The socket type requires that message be sent atomically, and the size of the message to be sent made this impossible.

If we could only queue partial data from msg, what should happen to the rest of it?

The application would need to verify length sent, modify msghdr content, and try again. We do it like this for example in mqtt already (https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/lib/mqtt/mqtt_transport_socket_tcp.c#L108)

For datagram sockets the situation is clear. But for stream socket the situation is more hazy as you say. I think earlier we made sure that the net_pkt was created properly (so that we had enough net_bufs) before returning to the user. It is totally different thing how the TCP queueing is done, from application point of view the message is now sent (even if it is queued).
Earlier in context_write_data() we tried to create full net_pkt and if it failed, then we returned error to the user without trying to send partial data.

The documentation is indeed a bit vague, https://pubs.opengroup.org/onlinepubs/009604599/functions/sendmsg.html
It says for example that

Successful completion of a call to sendmsg() does not guarantee delivery of the message. A return value of -1 indicates only locally-detected errors. If space is not available at the sending socket to hold the message to be transmitted and the socket file descriptor does not have O_NONBLOCK set, the sendmsg() function shall block until space is available. If space is not available at the sending socket to hold the message to be transmitted and the socket file descriptor does have O_NONBLOCK set, the sendmsg() function shall fail.

So we could block if trying to send and there is no net_bufs available.
I checked also the code in zephyr tree and it seems that for example mqtt and websocket is trying to send partial data. But for example TLS socket did not.

So we could block if trying to send and there is no net_bufs available.

But what about the non-blocking case. For example, part of the message was already queued for sending, what then? We can't return EAGAIN, as part of the message was already queued, and we shoudn't really block either.

I'm not really a fan of returning error if the message size > TCP TX window size, because it inflicts artificial limits on stream sockets using sendmsg and creates ambiguity. What should the application do in case of such an error? It'd need to guess, whether it's ok to try again later, or maybe the message shall be trimmed as it exceeds the maximum TX window length (or formerly interface MTU). IMO, partial send is much less ambiguous in such case.

Plus it seems, that other implementations (Linux in this case) allow for partial sendmsg sends:
https://stackoverflow.com/questions/57792687/how-to-deal-with-sendmsg2-when-it-send-a-message-partially-in-linux

But for example TLS socket did not.

That indeed looks a bit fishy TBH, could use some improvement. Seems potentially faulty in case of non-blocking socsk.

Ok, I am convinced, partial sendmsg seems to be a thing although the documentation is a bit ambiguous. Lets go with this as it improves TCP a lot.

jukkar · 2023-11-16T19:54:48Z

@ssharks ping, please take a look if you have time.

ssharks

Nice modification. just one location where I found an potential problem.
A similar problem is allocating too much buffers is happening in the RX path, where there is a pending issue: #55671

ssharks · 2023-11-17T07:05:26Z

subsys/net/ip/tcp.c

What happens of send_win < send_data_total. I think here is a potential for an unintended wrap when the send_win decreases during runtime.

The diff is poorly formatted so it's not clearly visible - the tcp_window_full() check just a few lines above verifies that send_data_total < send_win (otherwise we have a full window case and quit), and as we hold the connection mutex, neither of the two variables should change in between.

I see it could not go wrong for that reason,, I think it would be better to add a comment that this edge case is already covered. Somenoe modifying this code later on might not see this relation.

Sure - I've elaborated in the comment why it's safe to do the subtraction.

ssharks

Thanks, good work

subsys/net/ip/net_context.c

talih0 · 2023-11-17T15:48:17Z

It may be good to say in the commit message what throughput increase this change gives

Rework how data is queued for the TCP connections: * net_context no longer allocates net_pkt for TCP connections. This was not only inefficient (net_context has no knowledge of the TX window size), but also error-prone in certain configuration (for example when IP fragmentation was enabled, net_context may attempt to allocate enormous packet, instead of let the data be fragmented for the TCP stream. * Instead, implement already defined `net_tcp_queue()` API, which takes raw buffer and length. This allows to take TX window into account and also better manage the allocated net_buf's (like for example avoid allocation if there's still room in the buffer). In result, the TCP stack will not only no longer exceed the TX window, but also prevent empty gaps in allocated net_buf's, which should lead to less out-of-mem issues with the stack. * As net_pkt-based `net_tcp_queue_data()` is no longer in use, it was removed. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

Otherwise, if the application was for example blocked on poll() pending POLLOUT, it won't be notified. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

In case underlying socket reported error while waiting for TX, the errno value was not set accordingly. This commit fixes this. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

Make sure we send the entire packet buffer before bumping the packet counter, send() does not guarantee that all of the requested data will be sent at once with STREAM socket. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

rlubos · 2023-11-17T15:57:33Z

It may be good to say in the commit message what throughput increase this change gives

The throughput is more or less the same (I've observed just a marginal increase with these changes), but that was not the reason for this rework. The core reason was better net_buf management, so that we don't hit out-of-mem issues that often:

w/o these changes, we could exceed the TX window by the size of the interface MTU, which could easily consume most of the buffers for the queued TCP data, leading to out-of-mem issue during the actual transmission in configurations with smaller net_buf count
and regardless of the window size, in case of mutliple send()/sendto() calls with small data portions, we could run out of buffers even w/o exceeding the TCP TX window. This was the case because each data chunk was placed in a separate net_buf. With these changes, queued data is now written into already allocated net_buf if possible.

talih0 · 2023-11-17T16:38:19Z

It may be good to say in the commit message what throughput increase this change gives

The throughput is more or less the same (I've observed just a marginal increase with these changes), but that was not the reason for this rework. The core reason was better net_buf management, so that we don't hit out-of-mem issues that often:
* w/o these changes, we could exceed the TX window by the size of the interface MTU, which could easily consume most of the buffers for the queued TCP data, leading to out-of-mem issue during the actual transmission in configurations with smaller net_buf count

* and regardless of the window size, in case of mutliple `send()`/`sendto()` calls with small data portions, we could run out of buffers even w/o exceeding the TCP TX window. This was the case because each data chunk was placed in a separate net_buf. With these changes, queued data is now written into already allocated net_buf if possible.

@rlubos, thanks for clarifying.

…onditions Add new function to allocate additional buffers for net_pkt, w/o any additional preconditions/checks. Just allocate what was requested. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

Rework how data is queued for the TCP connections: * net_context no longer allocates net_pkt for TCP connections. This was not only inefficient (net_context has no knowledge of the TX window size), but also error-prone in certain configuration (for example when IP fragmentation was enabled, net_context may attempt to allocate enormous packet, instead of let the data be fragmented for the TCP stream. * Instead, implement already defined `net_tcp_queue()` API, which takes raw buffer and length. This allows to take TX window into account and also better manage the allocated net_buf's (like for example avoid allocation if there's still room in the buffer). In result, the TCP stack will not only no longer exceed the TX window, but also prevent empty gaps in allocated net_buf's, which should lead to less out-of-mem issues with the stack. * As net_pkt-based `net_tcp_queue_data()` is no longer in use, it was removed. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

Otherwise, if the application was for example blocked on poll() pending POLLOUT, it won't be notified. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

In case underlying socket reported error while waiting for TX, the errno value was not set accordingly. This commit fixes this. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

Make sure we send the entire packet buffer before bumping the packet counter, send() does not guarantee that all of the requested data will be sent at once with STREAM socket. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

ssharks

Great work

…onditions Add new function to allocate additional buffers for net_pkt, w/o any additional preconditions/checks. Just allocate what was requested. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

Rework how data is queued for the TCP connections: * net_context no longer allocates net_pkt for TCP connections. This was not only inefficient (net_context has no knowledge of the TX window size), but also error-prone in certain configuration (for example when IP fragmentation was enabled, net_context may attempt to allocate enormous packet, instead of let the data be fragmented for the TCP stream. * Instead, implement already defined `net_tcp_queue()` API, which takes raw buffer and length. This allows to take TX window into account and also better manage the allocated net_buf's (like for example avoid allocation if there's still room in the buffer). In result, the TCP stack will not only no longer exceed the TX window, but also prevent empty gaps in allocated net_buf's, which should lead to less out-of-mem issues with the stack. * As net_pkt-based `net_tcp_queue_data()` is no longer in use, it was removed. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

Otherwise, if the application was for example blocked on poll() pending POLLOUT, it won't be notified. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

In case underlying socket reported error while waiting for TX, the errno value was not set accordingly. This commit fixes this. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

Make sure we send the entire packet buffer before bumping the packet counter, send() does not guarantee that all of the requested data will be sent at once with STREAM socket. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

net: pkt: Add function for allocating buffers w/o preconditions

e60c2e4

Add new function to allocate additional buffers for net_pkt, w/o any additional preconditions/checks. Just allocate what was requested. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

rlubos requested review from jukkar and tbursztyka as code owners November 15, 2023 16:53

zephyrbot added area: Sockets Networking sockets area: Networking labels Nov 15, 2023

zephyrbot requested a review from ssharks November 15, 2023 16:54

zephyrbot assigned rlubos Nov 15, 2023

rlubos force-pushed the net/rework-tcp-data-queueing branch 3 times, most recently from a45bcb9 to 29c3c37 Compare November 15, 2023 17:57

jukkar requested changes Nov 16, 2023

View reviewed changes

jukkar previously approved these changes Nov 16, 2023

View reviewed changes

ssharks requested changes Nov 17, 2023

View reviewed changes

rlubos dismissed jukkar’s stale review via d28950c November 17, 2023 15:26

rlubos force-pushed the net/rework-tcp-data-queueing branch from 29c3c37 to d28950c Compare November 17, 2023 15:26

ssharks previously approved these changes Nov 17, 2023

View reviewed changes

talih0 reviewed Nov 17, 2023

View reviewed changes

subsys/net/ip/net_context.c Outdated Show resolved Hide resolved

rlubos added 3 commits November 17, 2023 16:48

net: tcp: Feed TX semaphore on connection close

81ab2e3

Otherwise, if the application was for example blocked on poll() pending POLLOUT, it won't be notified. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

net: sockets: tls: Set errno on TX waiting error

c04398a

In case underlying socket reported error while waiting for TX, the errno value was not set accordingly. This commit fixes this. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

net: zperf: Fix TCP packet counting

7cdd019

Make sure we send the entire packet buffer before bumping the packet counter, send() does not guarantee that all of the requested data will be sent at once with STREAM socket. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>

rlubos dismissed ssharks’s stale review via 7cdd019 November 17, 2023 15:48

rlubos force-pushed the net/rework-tcp-data-queueing branch from d28950c to 7cdd019 Compare November 17, 2023 15:48

jukkar approved these changes Nov 17, 2023

View reviewed changes

rlubos requested a review from ssharks November 20, 2023 11:39

ssharks approved these changes Nov 20, 2023

View reviewed changes

fabiobaltieri merged commit aa6f698 into zephyrproject-rtos:main Nov 21, 2023

rlubos deleted the net/rework-tcp-data-queueing branch November 22, 2023 12:34

net: tcp: Rework data queueing #65251

net: tcp: Rework data queueing #65251

Uh oh!

Conversation

rlubos commented Nov 15, 2023

Uh oh!

jukkar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jukkar commented Nov 16, 2023

Uh oh!

ssharks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssharks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

talih0 commented Nov 17, 2023

Uh oh!

rlubos commented Nov 17, 2023

Uh oh!

talih0 commented Nov 17, 2023

Uh oh!

ssharks left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants