Skip to content

Conversation

@rlubos
Copy link
Contributor

@rlubos rlubos commented Nov 15, 2023

Rework how data is queued for the TCP connections:

  • net_context no longer allocates net_pkt for TCP connections. This
    was not only inefficient (net_context has no knowledge of the TX
    window size), but also error-prone in certain configurations (for
    example when IP fragmentation was enabled, net_context may attempt
    to allocate enormous packet, instead of letting the data be fragmented
    for the TCP stream).
  • Instead, implement already defined net_tcp_queue() API, which
    takes raw buffer and length. This allows to take TX window into
    account and also better manage the allocated net_buf's (like for
    example avoid allocation if there's still room in the buffer). In
    result, the TCP stack will not only no longer exceed the TX window,
    but also prevent empty gaps in allocated net_buf's, which should
    lead to less out-of-mem issues with the stack.
  • As net_pkt-based net_tcp_queue_data() is no longer in use, it was
    removed.

Add new function to allocate additional buffers for net_pkt, w/o any
additional preconditions/checks. Just allocate what was requested.

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
@zephyrbot zephyrbot requested a review from ssharks November 15, 2023 16:54
@rlubos rlubos force-pushed the net/rework-tcp-data-queueing branch 3 times, most recently from a45bcb9 to 29c3c37 Compare November 15, 2023 17:57
Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, but we need to tweak the code a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could only queue partial data from msg, what should happen to the rest of it?

In previous version, we bailed out if we could not create net_pkt and returned error to the user. This is the way that is described in sendmsg() manual page.

If the message is too long to pass atomically through the underlying protocol,
the error EMSGSIZE is returned, and the message is not transmitted. 

But now we queue the partial data so this sending is no longer atomic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr; I don't think we modify the already existing behavior after all, see explaination below.

In previous version, we bailed out if we could not create net_pkt and returned error to the user.

That's only correct for datagram sockets. For stream, it'd still be possible to send partial data, however we'd ignore the window size and use MTU as the delimeter of the maximum transfer length.

So, the behaviour was pretty much the same, the only difference is that now we respect the window size.

If the message is too long to pass atomically through the underlying protocol,
the error EMSGSIZE is returned, and the message is not transmitted.

But now we queue the partial data so this sending is no longer atomic.

That's one of those ambiguities of the POSIX documentation... Does that comment really apply to STREAM sockets? The same comment can be found in regular send() documentation. Plus, the error code description says EMSGSIZE only applies to socket types that require message atomicity (I'd interprate that as DGRAM sockets?):

EMSGSIZE
The socket type requires that message be sent atomically, and the size of the message to be sent made this impossible.

If we could only queue partial data from msg, what should happen to the rest of it?

The application would need to verify length sent, modify msghdr content, and try again. We do it like this for example in mqtt already (https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/lib/mqtt/mqtt_transport_socket_tcp.c#L108)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For datagram sockets the situation is clear. But for stream socket the situation is more hazy as you say. I think earlier we made sure that the net_pkt was created properly (so that we had enough net_bufs) before returning to the user. It is totally different thing how the TCP queueing is done, from application point of view the message is now sent (even if it is queued).
Earlier in context_write_data() we tried to create full net_pkt and if it failed, then we returned error to the user without trying to send partial data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation is indeed a bit vague, https://pubs.opengroup.org/onlinepubs/009604599/functions/sendmsg.html
It says for example that

Successful completion of a call to sendmsg() does not guarantee delivery of the message.
A return value of -1 indicates only locally-detected errors.

If space is not available at the sending socket to hold the message to be transmitted and
the socket file descriptor does not have O_NONBLOCK set, the sendmsg() function shall
block until space is available. If space is not available at the sending socket to hold the
message to be transmitted and the socket file descriptor does have O_NONBLOCK set,
the sendmsg() function shall fail.

So we could block if trying to send and there is no net_bufs available.
I checked also the code in zephyr tree and it seems that for example mqtt and websocket is trying to send partial data. But for example TLS socket did not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we could block if trying to send and there is no net_bufs available.

But what about the non-blocking case. For example, part of the message was already queued for sending, what then? We can't return EAGAIN, as part of the message was already queued, and we shoudn't really block either.

I'm not really a fan of returning error if the message size > TCP TX window size, because it inflicts artificial limits on stream sockets using sendmsg and creates ambiguity. What should the application do in case of such an error? It'd need to guess, whether it's ok to try again later, or maybe the message shall be trimmed as it exceeds the maximum TX window length (or formerly interface MTU). IMO, partial send is much less ambiguous in such case.

Plus it seems, that other implementations (Linux in this case) allow for partial sendmsg sends:
https://stackoverflow.com/questions/57792687/how-to-deal-with-sendmsg2-when-it-send-a-message-partially-in-linux

But for example TLS socket did not.

That indeed looks a bit fishy TBH, could use some improvement. Seems potentially faulty in case of non-blocking socsk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am convinced, partial sendmsg seems to be a thing although the documentation is a bit ambiguous. Lets go with this as it improves TCP a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

jukkar
jukkar previously approved these changes Nov 16, 2023
@jukkar
Copy link
Member

jukkar commented Nov 16, 2023

@ssharks ping, please take a look if you have time.

Copy link
Contributor

@ssharks ssharks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice modification. just one location where I found an potential problem.
A similar problem is allocating too much buffers is happening in the RX path, where there is a pending issue: #55671

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens of send_win < send_data_total. I think here is a potential for an unintended wrap when the send_win decreases during runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is poorly formatted so it's not clearly visible - the tcp_window_full() check just a few lines above verifies that send_data_total < send_win (otherwise we have a full window case and quit), and as we hold the connection mutex, neither of the two variables should change in between.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it could not go wrong for that reason,, I think it would be better to add a comment that this edge case is already covered. Somenoe modifying this code later on might not see this relation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I've elaborated in the comment why it's safe to do the subtraction.

ssharks
ssharks previously approved these changes Nov 17, 2023
Copy link
Contributor

@ssharks ssharks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good work

@talih0
Copy link
Contributor

talih0 commented Nov 17, 2023

It may be good to say in the commit message what throughput increase this change gives

Rework how data is queued for the TCP connections:
  * net_context no longer allocates net_pkt for TCP connections. This
    was not only inefficient (net_context has no knowledge of the TX
    window size), but also error-prone in certain configuration (for
    example when IP fragmentation was enabled, net_context may attempt
    to allocate enormous packet, instead of let the data be fragmented
    for the TCP stream.
  * Instead, implement already defined `net_tcp_queue()` API, which
    takes raw buffer and length. This allows to take TX window into
    account and also better manage the allocated net_buf's (like for
    example avoid allocation if there's still room in the buffer). In
    result, the TCP stack will not only no longer exceed the TX window,
    but also prevent empty gaps in allocated net_buf's, which should
    lead to less out-of-mem issues with the stack.
  * As net_pkt-based `net_tcp_queue_data()` is no longer in use, it was
    removed.

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Otherwise, if the application was for example blocked on poll() pending
POLLOUT, it won't be notified.

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
In case underlying socket reported error while waiting for TX, the
errno value was not set accordingly. This commit fixes this.

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Make sure we send the entire packet buffer before bumping the packet
counter, send() does not guarantee that all of the requested data will
be sent at once with STREAM socket.

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
@rlubos
Copy link
Contributor Author

rlubos commented Nov 17, 2023

It may be good to say in the commit message what throughput increase this change gives

The throughput is more or less the same (I've observed just a marginal increase with these changes), but that was not the reason for this rework. The core reason was better net_buf management, so that we don't hit out-of-mem issues that often:

  • w/o these changes, we could exceed the TX window by the size of the interface MTU, which could easily consume most of the buffers for the queued TCP data, leading to out-of-mem issue during the actual transmission in configurations with smaller net_buf count
  • and regardless of the window size, in case of mutliple send()/sendto() calls with small data portions, we could run out of buffers even w/o exceeding the TCP TX window. This was the case because each data chunk was placed in a separate net_buf. With these changes, queued data is now written into already allocated net_buf if possible.

@talih0
Copy link
Contributor

talih0 commented Nov 17, 2023

It may be good to say in the commit message what throughput increase this change gives

The throughput is more or less the same (I've observed just a marginal increase with these changes), but that was not the reason for this rework. The core reason was better net_buf management, so that we don't hit out-of-mem issues that often:

* w/o these changes, we could exceed the TX window by the size of the interface MTU, which could easily consume most of the buffers for the queued TCP data, leading to out-of-mem issue during the actual transmission in configurations with smaller net_buf count

* and regardless of the window size, in case of mutliple `send()`/`sendto()` calls with small data portions, we could run out of buffers even w/o exceeding the TCP TX window. This was the case because each data chunk was placed in a separate net_buf. With these changes, queued data is now written into already allocated net_buf if possible.

@rlubos, thanks for clarifying.

rlubos added a commit to rlubos/sdk-zephyr that referenced this pull request Nov 20, 2023
…onditions

Add new function to allocate additional buffers for net_pkt, w/o any
additional preconditions/checks. Just allocate what was requested.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
rlubos added a commit to rlubos/sdk-zephyr that referenced this pull request Nov 20, 2023
Rework how data is queued for the TCP connections:
  * net_context no longer allocates net_pkt for TCP connections. This
    was not only inefficient (net_context has no knowledge of the TX
    window size), but also error-prone in certain configuration (for
    example when IP fragmentation was enabled, net_context may attempt
    to allocate enormous packet, instead of let the data be fragmented
    for the TCP stream.
  * Instead, implement already defined `net_tcp_queue()` API, which
    takes raw buffer and length. This allows to take TX window into
    account and also better manage the allocated net_buf's (like for
    example avoid allocation if there's still room in the buffer). In
    result, the TCP stack will not only no longer exceed the TX window,
    but also prevent empty gaps in allocated net_buf's, which should
    lead to less out-of-mem issues with the stack.
  * As net_pkt-based `net_tcp_queue_data()` is no longer in use, it was
    removed.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
rlubos added a commit to rlubos/sdk-zephyr that referenced this pull request Nov 20, 2023
Otherwise, if the application was for example blocked on poll() pending
POLLOUT, it won't be notified.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
rlubos added a commit to rlubos/sdk-zephyr that referenced this pull request Nov 20, 2023
In case underlying socket reported error while waiting for TX, the
errno value was not set accordingly. This commit fixes this.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
rlubos added a commit to rlubos/sdk-zephyr that referenced this pull request Nov 20, 2023
Make sure we send the entire packet buffer before bumping the packet
counter, send() does not guarantee that all of the requested data will
be sent at once with STREAM socket.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
@rlubos rlubos requested a review from ssharks November 20, 2023 11:39
Copy link
Contributor

@ssharks ssharks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work

@fabiobaltieri fabiobaltieri merged commit aa6f698 into zephyrproject-rtos:main Nov 21, 2023
rlubos added a commit to nrfconnect/sdk-zephyr that referenced this pull request Nov 21, 2023
…onditions

Add new function to allocate additional buffers for net_pkt, w/o any
additional preconditions/checks. Just allocate what was requested.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
rlubos added a commit to nrfconnect/sdk-zephyr that referenced this pull request Nov 21, 2023
Rework how data is queued for the TCP connections:
  * net_context no longer allocates net_pkt for TCP connections. This
    was not only inefficient (net_context has no knowledge of the TX
    window size), but also error-prone in certain configuration (for
    example when IP fragmentation was enabled, net_context may attempt
    to allocate enormous packet, instead of let the data be fragmented
    for the TCP stream.
  * Instead, implement already defined `net_tcp_queue()` API, which
    takes raw buffer and length. This allows to take TX window into
    account and also better manage the allocated net_buf's (like for
    example avoid allocation if there's still room in the buffer). In
    result, the TCP stack will not only no longer exceed the TX window,
    but also prevent empty gaps in allocated net_buf's, which should
    lead to less out-of-mem issues with the stack.
  * As net_pkt-based `net_tcp_queue_data()` is no longer in use, it was
    removed.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
rlubos added a commit to nrfconnect/sdk-zephyr that referenced this pull request Nov 21, 2023
Otherwise, if the application was for example blocked on poll() pending
POLLOUT, it won't be notified.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
rlubos added a commit to nrfconnect/sdk-zephyr that referenced this pull request Nov 21, 2023
In case underlying socket reported error while waiting for TX, the
errno value was not set accordingly. This commit fixes this.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
rlubos added a commit to nrfconnect/sdk-zephyr that referenced this pull request Nov 21, 2023
Make sure we send the entire packet buffer before bumping the packet
counter, send() does not guarantee that all of the requested data will
be sent at once with STREAM socket.

Upstream PR: zephyrproject-rtos/zephyr#65251

Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
@rlubos rlubos deleted the net/rework-tcp-data-queueing branch November 22, 2023 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants