Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QEMU serial output is not reliable, may affect SLIP and thus network testing #8187

Closed
pfalcon opened this issue Jun 5, 2018 · 10 comments
Closed
Labels
area: Networking area: QEMU QEMU Emulation bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@pfalcon
Copy link
Contributor

pfalcon commented Jun 5, 2018

This ticket provides a (partial) answer of why the issue described in #7831 (comment) happens, specifically:

  1. when running samples/net/socket/dumb_http_server sample app on qemu_cortex_m3,
  2. running ab -n1000 http://192.0.2.1:8080/,
  3. processing of requests gets stuck after just few dozens of requests, ab eventually times out
  4. (ab can be restarted and number of requests can be processed still, i.e. the app keeps running, but requests get stuck soon)

So, it's more or less know issue, but it's not always kept in mind: UART emulation in QEMU is sub-ideal, and there can be problems with serial communication, which is used by SLIP and loop-slip-tap.sh. This is what happens here.

For example, SLIP driver logging:

[slip] [INF] slip_send: sent: pkt 0x20001ec4 llr: 14, len: 54
[slip] [INF] slip_send: sent: pkt 0x20001ec4 llr: 14, len: 1506
[slip] [INF] slip_send: sent: pkt 0x20001e78 llr: 14, len: 783
[slip] [INF] slip_send: sent: pkt 0x20001e2c llr: 14, len: 54
Connection from 192.0.2.2 closed
[slip] [INF] slip_send: sent: pkt 0x20001e78 llr: 14, len: 783

What we can see here is that pkt 0x20001e78 was transmitted twice. But here's what Wireshark sees:

screenshot from 2018-06-05 23-50-25

As can be seen, instead of first 783 bytes packet it receives broken 275 bytes packet, which gets ignored by host. That's what causes retransmission, and next time the packet gets thru.

@galak
Copy link
Collaborator

galak commented Nov 20, 2018

Still an issue with new qemu?

@pfalcon
Copy link
Contributor Author

pfalcon commented Nov 20, 2018

Can retest to be 100% sure, bit I don't see any that part of qemu change any longer. Indirectly, it's the same - MicroPythin testsuite running over QEMU serial emu has ~50% chance to fail: https://ci.linaro.org/view/lite-iot-ci/job/lite-aeolus-micropython/

@pfalcon
Copy link
Contributor Author

pfalcon commented Nov 21, 2018

@rlubos, this was submitted as a result of investigation of issue reported by you (see comment link in the description), so I wonder why you aren't mentioned still ;-). @jukkar, you should be in loop on every networking related issue ;-).

@pfalcon
Copy link
Contributor Author

pfalcon commented Nov 21, 2018

Well, I'm actually pleasantly surprised, because the situation is visibly improved, running with qemu from SDK 0.9.5.

First run of dumb_http_server/qemu_cortex_m3 with ab -n1000 http://192.0.2.1:8080/ went without a hitch.

I then proceeded with -n10000, and it failed soon:

Benchmarking 192.0.2.1 (be patient)
Completed 1000 requests
apr_socket_recv: Connection reset by peer (104)
Total of 1219 requests completed

But note that the type of failure is different from the original description above: there it was hang with eventual timeout, here quick ECONNRESET. Another ab session is runnable after that. This ECONNRESET can be as well a different issue, e.g. issue in the stack, not on UART comm level - or not.

ECONNRESETs are repeatable, the longest run I got from 3 was:

apr_socket_recv: Connection reset by peer (104)
Total of 4499 requests completed

@pfalcon
Copy link
Contributor Author

pfalcon commented Nov 21, 2018

But! Now qemu_x86 and qemu_cortex_m3 switched their positions, i.e. SLIP comm with qemu_x86 seems to be significantly broken:

Benchmarking 192.0.2.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 115 requests completed

Got these timeouts 2 times in row (again much less than on 1000th req). All that happened with -n10000. Surprisingly, running with -n1000, I got 2 successful runs. The thing smells the big number and gives up early, but cheerfully chews not so frightening numbers ;-).

Summing up: yes, QEMU SLIP is still not reliable, yes.

@pfalcon
Copy link
Contributor Author

pfalcon commented Nov 21, 2018

@rlubos, Given that this comes from your report, can you please put qemu_x86/qemu_cortex_m3 under some ordeal too?

@jukkar
Copy link
Member

jukkar commented Nov 28, 2018

Have you tried with native_posix board? Just wondering if this is something related to SLIP or serial connection, or if the issue is in other part of the networking stack.

@pfalcon
Copy link
Contributor Author

pfalcon commented Nov 28, 2018

Have you tried with native_posix board?

Fairly speaking no, as I find network setup of native_posix kinda cumbersome.

Just wondering if this is something related to SLIP or serial connection, or if the issue is in other part of the networking stack.

As the title of this ticket suggest, the best hypothesis is that the problem on the side of QEMU emulation. The behavior I experienced is that SLIP driver gets e.g. 783 bytes packet, and spools it into the UART. However, Wireshark sees just truncated 275 bytes packet, which of course gets discarded.

@jukkar
Copy link
Member

jukkar commented Dec 14, 2018

Qemu has now native ethernet support so we could start to migrate away from SLIP. Closing this one.

@jukkar jukkar closed this as completed Dec 14, 2018
@pfalcon
Copy link
Contributor Author

pfalcon commented Dec 14, 2018

Well, closing is s bit of haste, given the "we can start". And this ticket is about serial output non-reliability, not just networking as affected by it. So, I'll likely reopen it when hitting issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking area: QEMU QEMU Emulation bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants