Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net regression: Connection to Zephyr server non-deterministically leads to client timeout, ENOTCONN on server side #34964

Closed
pfalcon opened this issue May 7, 2021 · 9 comments · Fixed by #35466
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Regression Something, which was working, does not anymore
Milestone

Comments

@pfalcon
Copy link
Contributor

pfalcon commented May 7, 2021

Describe the bug

When running dumb_http_server sample for qemu_x86, and connecting to it with ab -n1000 http://192.0.2.1:8080/, the ab tool aborts due to timeout during handling of one of these 1000 requests. Previously (v2.5.0), 100K requests were served without any problem.

To Reproduce
Steps to reproduce the behavior:

  1. cd samples/net/sockets/dumb_http_server
  2. west build -b qemu_x86 -t run
  3. In another window, ab -n1000 http://192.0.2.1:8080/
  4. Example output:
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 542 requests completed

The exact request number at which ab fails, but over 10 tries, none of 1000-request runs completed successfully.

Expected behavior
Both ab -n1000 http://192.0.2.1:8080/ and ab -n100000 http://192.0.2.1:8080/ should complete successfully. v2.5.0 worked like that, and I verified that checking out that tag still works like that for me.

Impact
We used to have issue(s) like that previously (early 2.x releases). 2.4 and 2.5 passed this test. We're back to network stack instability with this issue.

Logs and console output

In addition to the ab output shown above, on the dumb_http_server sample side, the output is:

Connection #538 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #539 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #540 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #541 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #542 from 192.0.2.2
Error sending data to peer, errno: 128
Connection from 192.0.2.2 closed

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: Zephyr SDK 0.12.4
  • Commit: git master f05ea67
@pfalcon pfalcon added the bug The issue is a bug, or the PR is fixing a bug label May 7, 2021
@pfalcon
Copy link
Contributor Author

pfalcon commented May 7, 2021

Just to show that the issue is fully reproducible:

pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 344 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 542 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 240 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 152 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 368 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 12 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 317 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 303 requests completed

@pfalcon
Copy link
Contributor Author

pfalcon commented May 7, 2021

I now tested with frdm_k64f, and cannot reproduce this issue, with either -n1000 or -n100000.

How it was before, is that there was not ideal handling of retransmits and other uncommon conditions in the network stack. qemu serial emulation (as used for SLIP networking connection) is also not ideal, leading to some transmitted data loss, rexmits/timeouts in the stack. So, the fact that 100K could be handled in 2.5.0 with qemu_x86 (and even qemu_cortex_m3, which has even worse serial emulation problems) - was very good indication that rexmits, etc. are handled well in the stack. So, it's a bit unfortunate that this issue pops up again.

@carlescufi carlescufi added the Regression Something, which was working, does not anymore label May 7, 2021
@jukkar
Copy link
Member

jukkar commented May 10, 2021

I just tried this and did not see any issue with native_posix. With qemu + e1000, zephyr printed occacionally

[00:00:08.070,000] <err> net_tcp: net_conn_register(): -2
[00:00:08.070,000] <err> net_tcp: Cannot allocate a new TCP connection

which indicates that the app has too small amount of connections.
So I increased them by setting -DCONFIG_NET_MAX_CONTEXTS=16 -DCONFIG_NET_MAX_CONN=16 and after that the ab managed to do the run without issues.

Could you check if this helps in your case?

@jukkar jukkar added this to the v2.6.0 milestone May 11, 2021
@galak galak added the priority: low Low impact/importance bug label May 11, 2021
@pfalcon
Copy link
Contributor Author

pfalcon commented May 12, 2021

@jukkar:

So I increased them by setting -DCONFIG_NET_MAX_CONTEXTS=16 -DCONFIG_NET_MAX_CONN=16 and after that the ab managed to do the run without issues.

Could you check if this helps in your case?

I used standard prj.conf for the sample and test process as described in https://docs.zephyrproject.org/latest/guides/networking/qemu_setup.html?highlight=slip#basic-setup . Setting the above config options didn't help, still was getting errors from ab with about the same frequency.

I'll be doing bisect now.

@pfalcon
Copy link
Contributor Author

pfalcon commented May 12, 2021

Ok, finished bisect, it shows that the first problematic commit is dde03c6 .

@jukkar
Copy link
Member

jukkar commented May 18, 2021

The issue seems to be related to slip connection between zephyr and host. If I am using Ethernet connection via e1000 driver, I was not able to see any problems.

Edit: or to be more precise, the issue seems to present with slip connection, although the actual problem is not there. After some digging, the socket layer receives the HTTP req data in zsock_received_cb() but then the received data just disappers (or no one is reading the receive queue).

@jukkar
Copy link
Member

jukkar commented May 18, 2021

Managed to get a log from the issue

[00:00:02.660,000] <dbg> net_sock.zsock_accepted_cb: (rx_q[0]): parent=0x130900, ctx=0x130a50, st=0
[00:00:02.660,000] <dbg> net_sock.zsock_accept_ctx: (main): accept: ctx=0x130a50, fd=1
[00:00:02.660,000] <dbg> sample.main: Connection #57 from 192.0.2.2
[00:00:02.660,000] <dbg> net_sock.wait_data: (main): -----------waiting data 0x130a50 from queue 0x130abc
[00:00:02.670,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): ctx=0x130a50, pkt=0x14c3f0, st=0, user_data=(nil)
[00:00:02.670,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): ----------ctx 0x130a50 putting pkt 0x14c3f0 to queue 0x130abc
[00:00:32.720,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): ctx=0x130a50, pkt=(nil), st=-104, user_data=(nil)
[00:00:32.720,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): Set EOF flag on pkt 0x14c3f0
[00:00:32.720,000] <dbg> net_sock.wait_data: (main): -----------done waiting 0x130a50
[00:00:32.720,000] <err> sample: Error sending data to peer, errno: 128
[00:00:32.720,000] <dbg> net_sock.z_impl_zsock_close: (main): close: ctx=0x130a50, fd=1

Here, the recv() in main() is able to run before the receive_cb() is called, this means the the queue seems to be empty and we are waiting for the data to arrive. The condition variable signal should release the wait but that does not seem to happen here.

@jukkar
Copy link
Member

jukkar commented May 18, 2021

Also this log looks weird

[00:00:15.410,000] <dbg> net_sock.zsock_accepted_cb: (<log_strdup alloc failed>): parent=0x130900, ctx=0x1309a8, st=0
[00:00:15.410,000] <dbg> net_sock.zsock_accept_ctx: (<log_strdup alloc failed>): accept: ctx=0x1309a8, fd=1
[00:00:15.410,000] <dbg> sample.main: Connection #881 from 192.0.2.2
[00:00:15.410,000] <dbg> net_sock.zsock_received_cb: (<log_strdup alloc failed>): ctx=0x1309a8, pkt=0x14c470, st=0, user_data=(nil)
[00:00:15.410,000] <dbg> net_sock.zsock_received_cb: (<log_strdup alloc failed>): ----------ctx 0x1309a8 putting pkt 0x14c470 to queue 0x130a14 cond 0x130a34
[00:00:15.410,000] <dbg> net_sock.wait_data: (<log_strdup alloc failed>): -----------waiting data 0x1309a8 from queue 0x130a14 condvar 0x130a34
[00:00:45.440,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): ctx=0x1309a8, pkt=(nil), st=-104, user_data=(nil)
[00:00:45.440,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): Set EOF flag on pkt 0x14c470
[00:00:45.440,000] <dbg> net_sock.wait_data: (main): -----------done waiting 0x1309a8
[00:00:45.440,000] <err> sample: Error sending data to peer, errno: 128

In this log, the received_cb is called before the recv, but still we start to wait data even if the receive queue should have it there now.

jukkar added a commit to jukkar/zephyr that referenced this issue May 21, 2021
Fix a regression when application is waiting data but does
not notice that because socket layer is not woken up.

This could happen because application was waiting condition
variable but the signal to wake the condvar came before the
wait started. Normally if there is constant flow of incoming
data to the socket, the signal would be given later. But if
the peer is waiting that Zephyr replies, there might be a
timeout at peer.

The solution is to add locking in socket receive callback so
that we only signal the condition variable after we have made
sure that the condition variable is actually waiting the data.

Fixes zephyrproject-rtos#34964

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
nashif pushed a commit that referenced this issue May 25, 2021
Fix a regression when application is waiting data but does
not notice that because socket layer is not woken up.

This could happen because application was waiting condition
variable but the signal to wake the condvar came before the
wait started. Normally if there is constant flow of incoming
data to the socket, the signal would be given later. But if
the peer is waiting that Zephyr replies, there might be a
timeout at peer.

The solution is to add locking in socket receive callback so
that we only signal the condition variable after we have made
sure that the condition variable is actually waiting the data.

Fixes #34964

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
@pfalcon
Copy link
Contributor Author

pfalcon commented May 28, 2021

Confirming that #35466 fixed this, tested with ab -n100000 http://192.0.2.1:8080/ with both qemu_x86 and qemu_cortex_m3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Regression Something, which was working, does not anymore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants