Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP connection stalls #32270

Closed
mogenslu opened this issue Feb 12, 2021 · 6 comments
Closed

TCP connection stalls #32270

mogenslu opened this issue Feb 12, 2021 · 6 comments
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug

Comments

@mogenslu
Copy link

My app is publishing MQTT messages on a TLS sockets using ethernet.
If I remove the ethernet cable a reinserts it the TCP connection stalls.

The log implies the TX buffer are all allocated and they are not sent or freed when ETH connection is reestablished
I have tried to increase the TX buffers, but it is not fixing the problem.
I can see this with CONFIG_NET_TCP1 & CONFIG_NET_TCP2 on both Zephyr 2.4.0 and 2.5.0-RC4

TCP1 seems to be able to handle it in some situations, but not all.
Getting a RST from the other end fixes it, but that is not a solution.

[00:00:19.002,000] eth_mcux: ETH_0 link down <--------- Remove ETH cable
[00:00:19.853,000] net_buf: pkt_alloc_buffer():867: Failed to get free buffer
[00:00:19.853,000] net_pkt: Data buffer (181) allocation failed (context_alloc_pkt:1321)
[00:00:20.490,000] net_if: iface 0x200060a8 is down
[00:00:20.954,000] net_buf: pkt_alloc_buffer():867: Failed to get free buffer
[00:00:20.954,000] net_pkt: Data buffer (181) allocation failed (context_alloc_pkt:1321)
[00:00:22.003,000] eth_mcux: ETH_0 enabled 100M full-duplex mode. <--------- Reinserted ETH cable
[00:00:22.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs low on buffers.
[00:00:22.054,000] net_buf: pkt_alloc_buffer():867: Failed to get free buffer
[00:00:22.054,000] net_pkt: Data buffer (181) allocation failed (context_alloc_pkt:1321)
[00:00:23.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs blocked for 1 secs
[00:00:23.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs low on buffers.
[00:00:23.154,000] net_buf: pkt_alloc_buffer():867: Failed to get free buffer
[00:00:23.154,000] net_pkt: Data buffer (181) allocation failed (context_alloc_pkt:1321)
[00:00:24.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs blocked for 2 secs
[00:00:24.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs low on buffers.
[00:00:24.254,000] net_buf: pkt_alloc_buffer():867: Failed to get free buffer
[00:00:24.254,000] net_pkt: Data buffer (181) allocation failed (context_alloc_pkt:1321)
[00:00:25.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs blocked for 3 secs
[00:00:25.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs low on buffers.
[00:00:25.354,000] net_buf: pkt_alloc_buffer():867: Failed to get free buffer
[00:00:25.354,000] net_pkt: Data buffer (181) allocation failed (context_alloc_pkt:1321)
[00:00:26.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs blocked for 4 secs
[00:00:26.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs low on buffers.
[00:00:26.455,000] net_buf: pkt_alloc_buffer():867: Failed to get free buffer
[00:00:26.455,000] net_pkt: Data buffer (181) allocation failed (context_alloc_pkt:1321)
[00:00:27.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs blocked for 5 secs
[00:00:27.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs low on buffers.
[00:00:27.555,000] net_buf: pkt_alloc_buffer():867: Failed to get free buffer
[00:00:27.555,000] net_pkt: Data buffer (181) allocation failed (context_alloc_pkt:1321)
[00:00:28.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs blocked for 6 secs
[00:00:28.003,000] net_buf: pkt_alloc_buffer():867: Pool tx_bufs low on buffers.
.....

@carlescufi carlescufi added area: Networking bug The issue is a bug, or the PR is fixing a bug labels Feb 12, 2021
jukkar added a commit to jukkar/zephyr that referenced this issue Feb 14, 2021
Make sure we send any pending data when network interface
comes up.

Fixes zephyrproject-rtos#32270

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
@jukkar
Copy link
Member

jukkar commented Feb 14, 2021

@lund-prevas could you try #32290 if it works for you. One problem is that you have too few network buffers so it might be difficult so recover from this situation.

@mogenslu
Copy link
Author

I think I can explain what is going on.
I have an app publishing MQTT messages. This continues while the ethernet is disabled and slowly drains the tcp layer for net_buf buffers.
When ethernet is again enable, the tcp layer needs to allocate a net_pkt, which it can, but not the net_buf's needed for the data.
This is unrecoverable in the tcp layer and connection is stalled.
I will handle my isssue by subscribing to the NET_EVENT_L4_CONNECTED and NET_EVENT_L4_DISCONNECTED events from the net_mgmt and stop sending while tcp is disconnected, this will stop the draining of the net_buf buffers.

I think it should be considered if the tcp_layer could be changed to be able to handle this situation

@jukkar
Copy link
Member

jukkar commented Feb 17, 2021

I think it should be considered if the tcp_layer could be changed to be able to handle this situation

Yes, I think we could do something for this. I try to cook a patch for this. Thanks for the analysis!

@jukkar
Copy link
Member

jukkar commented Feb 17, 2021

@lund-prevas I just sent PR that can prevent the memory exhaustion you saw #32423

@mogenslu
Copy link
Author

Is it possible to limit the window size in Kconfig ?
Then it could be done with some explanation in Kconfig/zephyr documentation that the window size / amount of network buffer should leave atleast this amount of free buffers.

What if I have 32 KB of buffers and sends 32 KB data in one chunk, then all buffers will also be used.

@jukkar
Copy link
Member

jukkar commented Feb 19, 2021

Is it possible to limit the window size in Kconfig ?

There is a config option for this CONFIG_NET_TCP_MAX_SEND_WINDOW_SIZE, it might be possible to use it for this. Would you be able to experiment if it works as expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants