New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Curious Timing Causes Failed Syscall on Cortex-M4 MCU #3109
Comments
From #3076 (comment) A few things I have tried:
All of these results are repeatable. Moving the app to a different location in memory by flashing other apps before it does not seem to fix things. An interesting process_console printout from the failed version:
|
To reproduce:
Also note:
Giving up for now, but I was able to make two versions of the analog comparator test app that differ in one instruction where one causes the error and the other doesn't. Basically inserting one meaningless instruction before the
Any changes to the app or kernel seem to make the error go away. Best ideas so far:
|
I did an experiment where I changed the memop number in the sbrk function to 0xf (rather than 0x1). I also then have memop() call It seems like what could be happening:
|
To add a little more data from my debugging:
|
I would like to work on this. @ppannuto Can you sketch out what the context handlers should be doing, i.e. expand on what
means? |
Sorry, almost forgot to post this. The book I mentioned on the call is "The Definitive Guide to ARM Cortex-[ ]" there are several fill in the blanks, but the most relevant for us is probably the M3 edition rather than the M0 edition: It's almost certainly available via your university library, or, google finds a copy, uh, here: http://centaur.sch.bme.hu/~holcsik_t/sem/The%20Definitive%20Guide%20to%20the%20ARM%20Cortex-M3.pdf Chapter 7 is the general discussion of how nested handlers and PendSV interrupts work; Chapter 12 is also probably relevant as it discusses how to implement a dual-stack (split kernel/userspace + MPU) system. |
There has been some discussion on this on the #wg-context-switch channel on slack. I'll try to summarize a little here: We do not have an explanation which both matches 1) our understanding of how the cortex-m hardware behaves, and 2) the symptoms we observed from the process crashes on imix. What we observed on imix is:
This issue at this point is that even if The problem with that we don't know how this can happen. The SVC exception is a relatively high priority interrupt because its exception number is quite low, and the only exceptions with higher priority are fatal exceptions that will crash the system. The SVC exception is also synchronous. Everything I have come up with at least which would cause the exhibited behavior doesn't gel with these two properties of the SVC exception. |
From #3305:
I sort of agree. I do wish this problem had happened on any other MCU, as we have reason to be suspicious of the SAM4L. I'm not sure what the right path forward is. Some options:
Number 2 seems reasonable, but we would still have the doubt that it doesn't actually fix the issue. |
I think the right path forward if we want to spend more time on this is to take the approach @ppannuto suggested on slack to try to get an instruction trace of the buggy execution (since it is at least highly repeatable):
If we could get something close to a detailed instruction trace we could almost certainly figure out what is happening. I am not actually sure how to do this with the Jlink I have though -- is there a way to interact with the JTAG debugger programatically via GDB in a way that would be fast enough to actually capture all of the instructions being executed? |
I did want to point out that something similar happens on the Apollo3 boards as well. It isn't just a SAM4L issue |
We don't have context switch support for floating point chips, so it makes sense that there would be faults. |
I just want to document where I stand after working on this for about a week last month. I can't find the issue, or even a plausible explanation for the issue, so I think any change is as likely to fix whatever the bug is as it is to not fix it. I consider this to be something we just ignore at this point. I suppose we leave the issue open in case anyone tries to get an actual instruction execution trace, which seems like the only option for debugging this. |
Originally discovered in #3076.
Observation
With a very specific kernel (on commit f6adf9d) compiled on linux, running on imix, with the
examples/tests/analog_comparator
app (commit f5e164366028b5633006299ac136ba5bf0e0db74) as the only app running, the fourth syscall (a memop sbrk) fails in some way. In code, libtock-c memop() seems to get an invalid return code from the kernel (r0==1
, which isn't a valid memop return code).Comments
Observing this bug seems to take a VERY specific order of instructions and timing. The bug is very repeatable (Hudson and I can both observe it, on multiple imixes, over many trials, and it happens every time). However, modifying the kernel OR the app, OR the number of apps all cause it to go away. This suggests that the particular timing quirk of when the analog_comparator app calls sbrk matters significantly. For example, compiling the same kernel on mac and the issue goes away. Adding a second app (and lengthing the boot process), and the issue goes away. Adding even a single instruction before the
svc 5
(memop) call in the app and the issue goes away.The text was updated successfully, but these errors were encountered: