IP Stack: Enabling MMU for qemu_x86 broke active connect support #752

pfalcon · 2017-07-11T13:04:20Z

dbd7052 seems to have broken TCP connect() support. E.g. echo_client crashes:

[QEMU] CPU: qemu32
qemu-system-i386: warning: Unknown firmware file in legacy mode: genroms/multiboot.bin

shell> ***** CPU Page Fault (error code 0x00000000)
Supervisor thread read address 0x00000000
Current thread ID = 0x00405b20
Faulting segment:address = 0x0008:0x000074bc
eax: 0x00000000, ebx: 0x0040908e, ecx: 0x0000000e, edx: 0x020200c0
esi: 0x0040d700, edi: 0x00000002, ebp: 0x0040e4e4, esp: 0x0040e4c4
eflags: 0x246
Fatal fault in essential thread! Spinning...
QEMU: Terminated
make[2]: Leaving directory '/mnt/hdd/projects-3rdparty/Embedded/Zephyr/zephyr/samples/net/http_client/outdir/qemu_x86'

I'm not exactly sure about operation which crashes, but initially discovered with MicroPython Zephyr port with socket.connect() call.

I wasn't able to find pull request which introduced that commit (to cross-link to this ticket), @andrewboie , can you please point to it?

The text was updated successfully, but these errors were encountered:

pfalcon · 2017-07-11T13:14:18Z

Btw, this is 3rd regression in row from the recent commits found in last 3 days. Which leads to thinking:

From the commit message of the above patch:

We have lots of RAM, this helps catch bugs.

Indeed, it helped ;-). But who's going to resolve them? How about reverting that patch, then fixing the issues it uncovered as resources permit, and only then enable it by default?

jukkar · 2017-07-11T13:53:19Z

Lets not revert the MMU patch but fix instead the errors in the echo-client instead.
I am actually trying echo-client as I am writing this and I do not see any errors so far. Can you provide your conf file that I can try it too.

jukkar · 2017-07-11T13:57:03Z

Ah, I could replicate this after all. Investigating the issue....

jukkar · 2017-07-11T14:21:18Z

The problem seems to be some stray pointer that only occacionally goes to NULL, I have hard time triggering this with gdb attached.

When the ARP message is received when the device is starting up, the network interface might not yet have IPv4 address setup correctly. In this case, the IP address pointer could be NULL and we must not use it for anything. Fixes zephyrproject-rtos#752 Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>

jukkar · 2017-07-11T14:42:55Z

Found an issue in ARP handling, depends lot of timing issues so a bit hard to replicate. This is only seen when the device is starting up.

andrewboie · 2017-07-11T14:43:59Z

@pfalcon we are not reverting the MMU patch.
The whole point of turning this on was to expose all these issues well in advance of the release.
I see no value in delaying it once sanitycheck is passing.
If you are still seeing lots of problems even with sanitycheck passing, maybe you should consider opening some user stories to improve test coverage.

pfalcon · 2017-07-11T15:53:03Z

@andrewboie : I don't see how submitting more user stories helps to resolve issues. I submitted GH-3559 "Fault injection framework for Zephyr" some time ago and set its Fix Version to 1.9.0, it was reset. While instead it would need involvement of folks experienced with fault injection e.g. in Linux to design something for Zephyr. And FI is required to even start talking about network code coverage. And as I mentioned, I found 3 issues in row in 3 days, all using MicroPython application, i.e. they aren't much findable using Zephyr own tree. (And I'm trying to have progress with MicroPython port to move along with more testing, and instead hit these recent regressions.)

pfalcon · 2017-07-11T16:01:04Z

@jukkar : Thanks. In my case, both crashes were 100% reproducible. #753 appears to fix echo_client issue, but MicroPython's test_dl.py script still crashes, i.e. there're more issues.

andrewboie · 2017-07-11T16:42:04Z

@pfalcon what exactly are you asking for?
sanitycheck is the means we have at our disposal to determine whether a change breaks the tree. the MMU patches weren't merged until sanitycheck was passing.
If there are additional NULL pointer dereferences lurking, I think you're just going to have to fix them, I don't think it is appropriate to revert the MMU patch.
The point I am trying to make about filing stories for coverage is that it seems clear at this point that there are parts of the network stack that are not being exercised by the current set of test cases, otherwise sanitycheck would have found them instead of you.

When the ARP message is received when the device is starting up, the network interface might not yet have IPv4 address setup correctly. In this case, the IP address pointer could be NULL and we must not use it for anything. Fixes #752 Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>

jukkar · 2017-07-11T17:21:06Z

MicroPython's test_dl.py script still crashes

I can look into it, but I would need information how to test MicroPython, is there some instructions somewhere describing how to do this?

jukkar · 2017-07-11T17:24:27Z

there are parts of the network stack that are not being exercised by the current set of test cases

This is currently the case, we are not doing proper end-to-end testing for the networking stack. We have lot of unit tests and some of them simulate networking but it is not really the same thing as end-to-end testing of the stack. This needs to be improved of course.

pfalcon · 2017-07-11T18:14:34Z

@andrewboie:

sanitycheck is the means we have at our disposal to determine whether a change breaks the tree.

Unfortunately, it's not easy to test networking code with sanitycheck as it is now, and many tests aren't even possible without additional infrastructure (like fault injection mention above).

what exactly are you asking for?

Well, I just appreciate discussion of the issues we have (particularly because I imagine you're member of TSC and can affect prioritization of test infra related tasks).

One question I was asking though and would appreciate an answer is:

I wasn't able to find pull request which introduced that commit (to cross-link to this ticket), @andrewboie , can you please point to it?

Searching commit subject line, I can find only https://github.com/zephyrproject-rtos/zephyr/pull/245/commits , but it lacks the exact commit in question. Thanks.

pfalcon · 2017-07-11T18:19:37Z

@jukkar :

I can look into it, but I would need information how to test MicroPython, is there some instructions somewhere describing how to do this?

There're, e.g. https://github.com/pfalcon/micropython/wiki/ZephyrSocket . But while I would appreciate you trying it, I may imagine, you have more important things to do, so no hurry with that, and I'll try to investigate issue with the connect() myself a bit later. (I'm just trying to refactor MicroPython port from out-of-tree socket code to the in-tree one, and can't establish solid testing baseline, each time I try, I find a new regression even with old out of tree code ;-) ).

jukkar · 2017-07-12T07:33:48Z

I'll try to investigate issue with the connect() myself a bit later.

Found the root cause for this connect crash. The local end point was not created which then caused a crash because context->local.sin_addr was NULL and was accessed when connecting. The micropython sample test started to work when I added

s.bind(("192.0.2.1",8888))

before s.connect() call. I suppose we would need to fix this by binding automatically in net_context_connect() if the user has not bound the local end point. I am a bit busy now with DTLS support so this will need to be postponed a bit, or if possible could you propose a patch that fixes this?

pfalcon · 2017-07-12T10:32:31Z

@jukkar : Thanks for investigating this! Yes, I'll look into this, hopefully this week.

jukkar · 2017-08-13T17:46:58Z

I suggest we close this. There are probably still issues in the stack regarding NULL pointer access, but the original issue reported by @pfalcon is fixed.

pfalcon · 2017-08-13T18:05:16Z

Oh, so the bugtracker is back, neat!

pfalcon assigned jukkar and andrewboie Jul 11, 2017

jukkar mentioned this issue Jul 11, 2017

net: arp: Do not try to access NULL pointer #753

Merged

jukkar closed this as completed in #753 Jul 11, 2017

pfalcon reopened this Jul 11, 2017

This was referenced Jul 13, 2017

net: Fix connectivity issues if only UDP or TCP is enabled #776

Merged

net: context: connect: Make sure local end is bound before connecting #785

Closed

pfalcon closed this as completed Aug 13, 2017

zephyrbot mentioned this issue Sep 23, 2017

Enabling MMU for qemu_x86 broke active connect support #3836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IP Stack: Enabling MMU for qemu_x86 broke active connect support #752

IP Stack: Enabling MMU for qemu_x86 broke active connect support #752

pfalcon commented Jul 11, 2017

pfalcon commented Jul 11, 2017

jukkar commented Jul 11, 2017

jukkar commented Jul 11, 2017

jukkar commented Jul 11, 2017

jukkar commented Jul 11, 2017

andrewboie commented Jul 11, 2017

pfalcon commented Jul 11, 2017 •

edited by zephyrbot

pfalcon commented Jul 11, 2017

andrewboie commented Jul 11, 2017 •

edited

jukkar commented Jul 11, 2017

jukkar commented Jul 11, 2017

pfalcon commented Jul 11, 2017

pfalcon commented Jul 11, 2017

jukkar commented Jul 12, 2017

pfalcon commented Jul 12, 2017

jukkar commented Aug 13, 2017

pfalcon commented Aug 13, 2017

IP Stack: Enabling MMU for qemu_x86 broke active connect support #752

IP Stack: Enabling MMU for qemu_x86 broke active connect support #752

Comments

pfalcon commented Jul 11, 2017

pfalcon commented Jul 11, 2017

jukkar commented Jul 11, 2017

jukkar commented Jul 11, 2017

jukkar commented Jul 11, 2017

jukkar commented Jul 11, 2017

andrewboie commented Jul 11, 2017

pfalcon commented Jul 11, 2017 • edited by zephyrbot

pfalcon commented Jul 11, 2017

andrewboie commented Jul 11, 2017 • edited

jukkar commented Jul 11, 2017

jukkar commented Jul 11, 2017

pfalcon commented Jul 11, 2017

pfalcon commented Jul 11, 2017

jukkar commented Jul 12, 2017

pfalcon commented Jul 12, 2017

jukkar commented Aug 13, 2017

pfalcon commented Aug 13, 2017

pfalcon commented Jul 11, 2017 •

edited by zephyrbot

andrewboie commented Jul 11, 2017 •

edited