-
Notifications
You must be signed in to change notification settings - Fork 8.3k
net: tcp: Rework data queueing #65251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: tcp: Rework data queueing #65251
Conversation
Add new function to allocate additional buffers for net_pkt, w/o any additional preconditions/checks. Just allocate what was requested. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
a45bcb9 to
29c3c37
Compare
jukkar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, but we need to tweak the code a bit.
subsys/net/ip/tcp.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could only queue partial data from msg, what should happen to the rest of it?
In previous version, we bailed out if we could not create net_pkt and returned error to the user. This is the way that is described in sendmsg() manual page.
If the message is too long to pass atomically through the underlying protocol,
the error EMSGSIZE is returned, and the message is not transmitted.
But now we queue the partial data so this sending is no longer atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr; I don't think we modify the already existing behavior after all, see explaination below.
In previous version, we bailed out if we could not create net_pkt and returned error to the user.
That's only correct for datagram sockets. For stream, it'd still be possible to send partial data, however we'd ignore the window size and use MTU as the delimeter of the maximum transfer length.
- The packet is allocated here (up to the MTU size):
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/net_context.c#L1938 - Verify the allocated length here (but only fail in case of DGRAM sock):
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/net_context.c#L1944 - And finally copy data to net_pkt, but only up to allocated space:
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/net_context.c#L2011
So, the behaviour was pretty much the same, the only difference is that now we respect the window size.
If the message is too long to pass atomically through the underlying protocol,
the error EMSGSIZE is returned, and the message is not transmitted.But now we queue the partial data so this sending is no longer atomic.
That's one of those ambiguities of the POSIX documentation... Does that comment really apply to STREAM sockets? The same comment can be found in regular send() documentation. Plus, the error code description says EMSGSIZE only applies to socket types that require message atomicity (I'd interprate that as DGRAM sockets?):
EMSGSIZE
The socket type requires that message be sent atomically, and the size of the message to be sent made this impossible.
If we could only queue partial data from msg, what should happen to the rest of it?
The application would need to verify length sent, modify msghdr content, and try again. We do it like this for example in mqtt already (https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/lib/mqtt/mqtt_transport_socket_tcp.c#L108)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For datagram sockets the situation is clear. But for stream socket the situation is more hazy as you say. I think earlier we made sure that the net_pkt was created properly (so that we had enough net_bufs) before returning to the user. It is totally different thing how the TCP queueing is done, from application point of view the message is now sent (even if it is queued).
Earlier in context_write_data() we tried to create full net_pkt and if it failed, then we returned error to the user without trying to send partial data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation is indeed a bit vague, https://pubs.opengroup.org/onlinepubs/009604599/functions/sendmsg.html
It says for example that
Successful completion of a call to sendmsg() does not guarantee delivery of the message.
A return value of -1 indicates only locally-detected errors.
If space is not available at the sending socket to hold the message to be transmitted and
the socket file descriptor does not have O_NONBLOCK set, the sendmsg() function shall
block until space is available. If space is not available at the sending socket to hold the
message to be transmitted and the socket file descriptor does have O_NONBLOCK set,
the sendmsg() function shall fail.
So we could block if trying to send and there is no net_bufs available.
I checked also the code in zephyr tree and it seems that for example mqtt and websocket is trying to send partial data. But for example TLS socket did not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we could block if trying to send and there is no net_bufs available.
But what about the non-blocking case. For example, part of the message was already queued for sending, what then? We can't return EAGAIN, as part of the message was already queued, and we shoudn't really block either.
I'm not really a fan of returning error if the message size > TCP TX window size, because it inflicts artificial limits on stream sockets using sendmsg and creates ambiguity. What should the application do in case of such an error? It'd need to guess, whether it's ok to try again later, or maybe the message shall be trimmed as it exceeds the maximum TX window length (or formerly interface MTU). IMO, partial send is much less ambiguous in such case.
Plus it seems, that other implementations (Linux in this case) allow for partial sendmsg sends:
https://stackoverflow.com/questions/57792687/how-to-deal-with-sendmsg2-when-it-send-a-message-partially-in-linux
But for example TLS socket did not.
That indeed looks a bit fishy TBH, could use some improvement. Seems potentially faulty in case of non-blocking socsk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I am convinced, partial sendmsg seems to be a thing although the documentation is a bit ambiguous. Lets go with this as it improves TCP a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
@ssharks ping, please take a look if you have time. |
ssharks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice modification. just one location where I found an potential problem.
A similar problem is allocating too much buffers is happening in the RX path, where there is a pending issue: #55671
subsys/net/ip/tcp.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens of send_win < send_data_total. I think here is a potential for an unintended wrap when the send_win decreases during runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is poorly formatted so it's not clearly visible - the tcp_window_full() check just a few lines above verifies that send_data_total < send_win (otherwise we have a full window case and quit), and as we hold the connection mutex, neither of the two variables should change in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it could not go wrong for that reason,, I think it would be better to add a comment that this edge case is already covered. Somenoe modifying this code later on might not see this relation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - I've elaborated in the comment why it's safe to do the subtraction.
29c3c37 to
d28950c
Compare
ssharks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good work
|
It may be good to say in the commit message what throughput increase this change gives |
Rework how data is queued for the TCP connections:
* net_context no longer allocates net_pkt for TCP connections. This
was not only inefficient (net_context has no knowledge of the TX
window size), but also error-prone in certain configuration (for
example when IP fragmentation was enabled, net_context may attempt
to allocate enormous packet, instead of let the data be fragmented
for the TCP stream.
* Instead, implement already defined `net_tcp_queue()` API, which
takes raw buffer and length. This allows to take TX window into
account and also better manage the allocated net_buf's (like for
example avoid allocation if there's still room in the buffer). In
result, the TCP stack will not only no longer exceed the TX window,
but also prevent empty gaps in allocated net_buf's, which should
lead to less out-of-mem issues with the stack.
* As net_pkt-based `net_tcp_queue_data()` is no longer in use, it was
removed.
Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Otherwise, if the application was for example blocked on poll() pending POLLOUT, it won't be notified. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
In case underlying socket reported error while waiting for TX, the errno value was not set accordingly. This commit fixes this. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Make sure we send the entire packet buffer before bumping the packet counter, send() does not guarantee that all of the requested data will be sent at once with STREAM socket. Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
d28950c to
7cdd019
Compare
The throughput is more or less the same (I've observed just a marginal increase with these changes), but that was not the reason for this rework. The core reason was better
|
@rlubos, thanks for clarifying. |
…onditions Add new function to allocate additional buffers for net_pkt, w/o any additional preconditions/checks. Just allocate what was requested. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Rework how data is queued for the TCP connections:
* net_context no longer allocates net_pkt for TCP connections. This
was not only inefficient (net_context has no knowledge of the TX
window size), but also error-prone in certain configuration (for
example when IP fragmentation was enabled, net_context may attempt
to allocate enormous packet, instead of let the data be fragmented
for the TCP stream.
* Instead, implement already defined `net_tcp_queue()` API, which
takes raw buffer and length. This allows to take TX window into
account and also better manage the allocated net_buf's (like for
example avoid allocation if there's still room in the buffer). In
result, the TCP stack will not only no longer exceed the TX window,
but also prevent empty gaps in allocated net_buf's, which should
lead to less out-of-mem issues with the stack.
* As net_pkt-based `net_tcp_queue_data()` is no longer in use, it was
removed.
Upstream PR: zephyrproject-rtos/zephyr#65251
Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Otherwise, if the application was for example blocked on poll() pending POLLOUT, it won't be notified. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
In case underlying socket reported error while waiting for TX, the errno value was not set accordingly. This commit fixes this. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Make sure we send the entire packet buffer before bumping the packet counter, send() does not guarantee that all of the requested data will be sent at once with STREAM socket. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
ssharks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work
…onditions Add new function to allocate additional buffers for net_pkt, w/o any additional preconditions/checks. Just allocate what was requested. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Rework how data is queued for the TCP connections:
* net_context no longer allocates net_pkt for TCP connections. This
was not only inefficient (net_context has no knowledge of the TX
window size), but also error-prone in certain configuration (for
example when IP fragmentation was enabled, net_context may attempt
to allocate enormous packet, instead of let the data be fragmented
for the TCP stream.
* Instead, implement already defined `net_tcp_queue()` API, which
takes raw buffer and length. This allows to take TX window into
account and also better manage the allocated net_buf's (like for
example avoid allocation if there's still room in the buffer). In
result, the TCP stack will not only no longer exceed the TX window,
but also prevent empty gaps in allocated net_buf's, which should
lead to less out-of-mem issues with the stack.
* As net_pkt-based `net_tcp_queue_data()` is no longer in use, it was
removed.
Upstream PR: zephyrproject-rtos/zephyr#65251
Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Otherwise, if the application was for example blocked on poll() pending POLLOUT, it won't be notified. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
In case underlying socket reported error while waiting for TX, the errno value was not set accordingly. This commit fixes this. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Make sure we send the entire packet buffer before bumping the packet counter, send() does not guarantee that all of the requested data will be sent at once with STREAM socket. Upstream PR: zephyrproject-rtos/zephyr#65251 Signed-off-by: Robert Lubos <robert.lubos@nordicsemi.no>
Rework how data is queued for the TCP connections:
was not only inefficient (net_context has no knowledge of the TX
window size), but also error-prone in certain configurations (for
example when IP fragmentation was enabled, net_context may attempt
to allocate enormous packet, instead of letting the data be fragmented
for the TCP stream).
net_tcp_queue()API, whichtakes raw buffer and length. This allows to take TX window into
account and also better manage the allocated net_buf's (like for
example avoid allocation if there's still room in the buffer). In
result, the TCP stack will not only no longer exceed the TX window,
but also prevent empty gaps in allocated net_buf's, which should
lead to less out-of-mem issues with the stack.
net_tcp_queue_data()is no longer in use, it wasremoved.