Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP Stack: Enabling MMU for qemu_x86 broke active connect support #752

Closed
pfalcon opened this issue Jul 11, 2017 · 17 comments
Closed

IP Stack: Enabling MMU for qemu_x86 broke active connect support #752

pfalcon opened this issue Jul 11, 2017 · 17 comments
Assignees

Comments

@pfalcon
Copy link
Contributor

pfalcon commented Jul 11, 2017

dbd7052 seems to have broken TCP connect() support. E.g. echo_client crashes:

[QEMU] CPU: qemu32
qemu-system-i386: warning: Unknown firmware file in legacy mode: genroms/multiboot.bin

shell> ***** CPU Page Fault (error code 0x00000000)
Supervisor thread read address 0x00000000
Current thread ID = 0x00405b20
Faulting segment:address = 0x0008:0x000074bc
eax: 0x00000000, ebx: 0x0040908e, ecx: 0x0000000e, edx: 0x020200c0
esi: 0x0040d700, edi: 0x00000002, ebp: 0x0040e4e4, esp: 0x0040e4c4
eflags: 0x246
Fatal fault in essential thread! Spinning...
QEMU: Terminated
make[2]: Leaving directory '/mnt/hdd/projects-3rdparty/Embedded/Zephyr/zephyr/samples/net/http_client/outdir/qemu_x86'

I'm not exactly sure about operation which crashes, but initially discovered with MicroPython Zephyr port with socket.connect() call.

I wasn't able to find pull request which introduced that commit (to cross-link to this ticket), @andrewboie , can you please point to it?

@pfalcon
Copy link
Contributor Author

pfalcon commented Jul 11, 2017

Btw, this is 3rd regression in row from the recent commits found in last 3 days. Which leads to thinking:

From the commit message of the above patch:

We have lots of RAM, this helps catch bugs.

Indeed, it helped ;-). But who's going to resolve them? How about reverting that patch, then fixing the issues it uncovered as resources permit, and only then enable it by default?

@jukkar
Copy link
Member

jukkar commented Jul 11, 2017

Lets not revert the MMU patch but fix instead the errors in the echo-client instead.
I am actually trying echo-client as I am writing this and I do not see any errors so far. Can you provide your conf file that I can try it too.

@jukkar
Copy link
Member

jukkar commented Jul 11, 2017

Ah, I could replicate this after all. Investigating the issue....

@jukkar
Copy link
Member

jukkar commented Jul 11, 2017

The problem seems to be some stray pointer that only occacionally goes to NULL, I have hard time triggering this with gdb attached.

jukkar added a commit to jukkar/zephyr that referenced this issue Jul 11, 2017
When the ARP message is received when the device is starting up,
the network interface might not yet have IPv4 address setup
correctly. In this case, the IP address pointer could be NULL
and we must not use it for anything.

Fixes zephyrproject-rtos#752

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
@jukkar
Copy link
Member

jukkar commented Jul 11, 2017

Found an issue in ARP handling, depends lot of timing issues so a bit hard to replicate. This is only seen when the device is starting up.

@andrewboie
Copy link
Contributor

@pfalcon we are not reverting the MMU patch.
The whole point of turning this on was to expose all these issues well in advance of the release.
I see no value in delaying it once sanitycheck is passing.
If you are still seeing lots of problems even with sanitycheck passing, maybe you should consider opening some user stories to improve test coverage.

@pfalcon
Copy link
Contributor Author

pfalcon commented Jul 11, 2017

@andrewboie : I don't see how submitting more user stories helps to resolve issues. I submitted GH-3559 "Fault injection framework for Zephyr" some time ago and set its Fix Version to 1.9.0, it was reset. While instead it would need involvement of folks experienced with fault injection e.g. in Linux to design something for Zephyr. And FI is required to even start talking about network code coverage. And as I mentioned, I found 3 issues in row in 3 days, all using MicroPython application, i.e. they aren't much findable using Zephyr own tree. (And I'm trying to have progress with MicroPython port to move along with more testing, and instead hit these recent regressions.)

@pfalcon
Copy link
Contributor Author

pfalcon commented Jul 11, 2017

@jukkar : Thanks. In my case, both crashes were 100% reproducible. #753 appears to fix echo_client issue, but MicroPython's test_dl.py script still crashes, i.e. there're more issues.

@andrewboie
Copy link
Contributor

andrewboie commented Jul 11, 2017

@pfalcon what exactly are you asking for?
sanitycheck is the means we have at our disposal to determine whether a change breaks the tree. the MMU patches weren't merged until sanitycheck was passing.
If there are additional NULL pointer dereferences lurking, I think you're just going to have to fix them, I don't think it is appropriate to revert the MMU patch.
The point I am trying to make about filing stories for coverage is that it seems clear at this point that there are parts of the network stack that are not being exercised by the current set of test cases, otherwise sanitycheck would have found them instead of you.

jukkar added a commit that referenced this issue Jul 11, 2017
When the ARP message is received when the device is starting up,
the network interface might not yet have IPv4 address setup
correctly. In this case, the IP address pointer could be NULL
and we must not use it for anything.

Fixes #752

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
@jukkar
Copy link
Member

jukkar commented Jul 11, 2017

MicroPython's test_dl.py script still crashes

I can look into it, but I would need information how to test MicroPython, is there some instructions somewhere describing how to do this?

@jukkar
Copy link
Member

jukkar commented Jul 11, 2017

there are parts of the network stack that are not being exercised by the current set of test cases

This is currently the case, we are not doing proper end-to-end testing for the networking stack. We have lot of unit tests and some of them simulate networking but it is not really the same thing as end-to-end testing of the stack. This needs to be improved of course.

@pfalcon
Copy link
Contributor Author

pfalcon commented Jul 11, 2017

@andrewboie:

sanitycheck is the means we have at our disposal to determine whether a change breaks the tree.

Unfortunately, it's not easy to test networking code with sanitycheck as it is now, and many tests aren't even possible without additional infrastructure (like fault injection mention above).

what exactly are you asking for?

Well, I just appreciate discussion of the issues we have (particularly because I imagine you're member of TSC and can affect prioritization of test infra related tasks).

One question I was asking though and would appreciate an answer is:

I wasn't able to find pull request which introduced that commit (to cross-link to this ticket), @andrewboie , can you please point to it?

Searching commit subject line, I can find only https://github.com/zephyrproject-rtos/zephyr/pull/245/commits , but it lacks the exact commit in question. Thanks.

@pfalcon
Copy link
Contributor Author

pfalcon commented Jul 11, 2017

@jukkar :

I can look into it, but I would need information how to test MicroPython, is there some instructions somewhere describing how to do this?

There're, e.g. https://github.com/pfalcon/micropython/wiki/ZephyrSocket . But while I would appreciate you trying it, I may imagine, you have more important things to do, so no hurry with that, and I'll try to investigate issue with the connect() myself a bit later. (I'm just trying to refactor MicroPython port from out-of-tree socket code to the in-tree one, and can't establish solid testing baseline, each time I try, I find a new regression even with old out of tree code ;-) ).

@pfalcon pfalcon reopened this Jul 11, 2017
@jukkar
Copy link
Member

jukkar commented Jul 12, 2017

I'll try to investigate issue with the connect() myself a bit later.

Found the root cause for this connect crash. The local end point was not created which then caused a crash because context->local.sin_addr was NULL and was accessed when connecting. The micropython sample test started to work when I added

s.bind(("192.0.2.1",8888))

before s.connect() call. I suppose we would need to fix this by binding automatically in net_context_connect() if the user has not bound the local end point. I am a bit busy now with DTLS support so this will need to be postponed a bit, or if possible could you propose a patch that fixes this?

@pfalcon
Copy link
Contributor Author

pfalcon commented Jul 12, 2017

@jukkar : Thanks for investigating this! Yes, I'll look into this, hopefully this week.

@jukkar
Copy link
Member

jukkar commented Aug 13, 2017

I suggest we close this. There are probably still issues in the stack regarding NULL pointer access, but the original issue reported by @pfalcon is fixed.

@pfalcon
Copy link
Contributor Author

pfalcon commented Aug 13, 2017

Oh, so the bugtracker is back, neat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants