Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curious Timing Causes Failed Syscall on Cortex-M4 MCU #3109

Open
bradjc opened this issue Jul 26, 2022 · 13 comments
Open

Curious Timing Causes Failed Syscall on Cortex-M4 MCU #3109

bradjc opened this issue Jul 26, 2022 · 13 comments
Labels

Comments

@bradjc
Copy link
Contributor

bradjc commented Jul 26, 2022

Originally discovered in #3076.

Observation

With a very specific kernel (on commit f6adf9d) compiled on linux, running on imix, with the examples/tests/analog_comparator app (commit f5e164366028b5633006299ac136ba5bf0e0db74) as the only app running, the fourth syscall (a memop sbrk) fails in some way. In code, libtock-c memop() seems to get an invalid return code from the kernel (r0==1, which isn't a valid memop return code).

Comments

Observing this bug seems to take a VERY specific order of instructions and timing. The bug is very repeatable (Hudson and I can both observe it, on multiple imixes, over many trials, and it happens every time). However, modifying the kernel OR the app, OR the number of apps all cause it to go away. This suggests that the particular timing quirk of when the analog_comparator app calls sbrk matters significantly. For example, compiling the same kernel on mac and the issue goes away. Adding a second app (and lengthing the boot process), and the issue goes away. Adding even a single instruction before the svc 5 (memop) call in the app and the issue goes away.

@bradjc bradjc added the bug label Jul 26, 2022
@bradjc
Copy link
Contributor Author

bradjc commented Jul 26, 2022

From #3076 (comment)

A few things I have tried:

  • Current kernel master + libtock-c master acomp app: works
  • This PR + libtock-c master acomp app : fails
  • This PR + libtock-c acomp app with one 4 byte value added to the binary: works
  • This PR with a 48 byte string added to the kernel + libtock-c master acomp app: works
  • This PR with an 8 byte string removed from the kernel + libtock-c master acomp app: works
  • This PR w/ system call tracing enabled: works
  • Modifying master to add a 48 byte string, such that the master kernel is the exact same size as the kernel in this PR: works

All of these results are repeatable. Moving the app to a different location in memory by flashing other apps before it does not seem to fix things.

An interesting process_console printout from the failed version:

tock$ process analog_comparator
𝐀𝐩𝐩: analog_comparator   -   [Running]
 Events Queued: 0   Syscall Count: 3   Dropped Upcall Count: 0
 Restart Count: 0
 Last Syscall: Memop { operand: 11, arg0: 536906356 }
 Completion Code: None


 ╔═══════════╤══════════════════════════════════════════╗
 ║  Address  │ Region Name    Used | Allocated (bytes)  ║
 ╚0x2000A000═╪══════════════════════════════════════════╝
             │ Grant Ptrs      152
             │ Upcalls         360
             │ Process         880
  0x20009A90 ┼───────────────────────────────────────────
             │ ▼ Grant           0
  0x20009A90 ┼───────────────────────────────────────────
             │ Unused
  0x20008A74 ┼───────────────────────────────────────────
             │ ▲ Heap            0 |   4124               S
  0x20008A74 ┼─────────────────────────────────────────── R
             │ Data            628 |    628               A
  0x20008800 ┼─────────────────────────────────────────── M
             │ ▼ Stack         200 |   2048
  0x20008738 ┼───────────────────────────────────────────
             │ Unused
  0x20008000 ┴───────────────────────────────────────────
             .....
  0x00042000 ┬─────────────────────────────────────────── F
             │ App Flash      8128                        L
  0x00040040 ┼─────────────────────────────────────────── A
             │ Protected        64                        S
  0x00040000 ┴─────────────────────────────────────────── H

@bradjc
Copy link
Contributor Author

bradjc commented Jul 26, 2022

To reproduce:

  • Need Hudson's imix kernel (or to compile on linux, not mac).
  • Run only the analog comparator test app
  • The app calls exit when doing its first sbrk, and gets stuck in a while 1 loop in the exit function. The symptom appears to be the memop call returns an invalid return (fail with value, r0 == 1). However, the kernel does not record that the app ever called memop(1) (sbrk), so its not clear what is happening.

Also note:

  • sbrk uses r0=1, and the error is happening because r0==1 "after" the memop runs. so could be that r0 is not changing (ie not being set by the kernel).

Giving up for now, but I was able to make two versions of the analog comparator test app that differ in one instruction where one causes the error and the other doesn't. Basically inserting one meaningless instruction before the svc 5 memop call and the problem goes away (insert it after the svc 5 and still broken):

diff cortex-m4.userland_debug.lst.WORKING_MINOR_CHANGE cortex-m4.userland_debug.lst.BROKEN_MINOR_CHANGE
1962,1963c1962,1963
<    403c6:	4684      	mov	ip, r0
<    403c8:	df05      	svc	5
---
>    403c6:	df05      	svc	5
>    403c8:	4684      	mov	ip, r0
diff --git a/libtock/tock.c b/libtock/tock.c
index 04787bf..adb2941 100644
--- a/libtock/tock.c
+++ b/libtock/tock.c
@@ -348,16 +348,26 @@ allow_userspace_r_return_t allow_userspace_read(uint32_t driver,
 }

 memop_return_t memop(uint32_t op_type, int arg1) {
+  // uint32_t opppp=op_type;
+  // register uint32_t r5 __asm__ ("r5") = op_type;
   register uint32_t r0 __asm__ ("r0") = op_type;
   register int r1 __asm__ ("r1")      = arg1;
   register uint32_t val __asm__ ("r1");
   register uint32_t code __asm__ ("r0");
+  // register uint32_t myr3 __asm__ ("r3");
+  // register uint32_t opp __asm__ ("r5");
   __asm__ volatile (
-    "svc 5"
+    "mov r12, r0\n"
+    // "mov r12, r0\n"
+    "svc 5\n"
     : "=r" (code), "=r" (val)
     : "r" (r0), "r" (r1)
+    // : "r2", "r3", "memory"
     : "memory"
     );
+  // if (opp == 11) {
+  //   _exit(0xcc);
+  // }
   if (code == TOCK_SYSCALL_SUCCESS) {
     memop_return_t rv = {TOCK_STATUSCODE_SUCCESS, 0};
     return rv;
@@ -369,7 +379,9 @@ memop_return_t memop(uint32_t op_type, int arg1) {
     return rv;
   } else {
     // Invalid return type
-    exit(1);
+    exit(r0);
+    // exit(0xc);
+    // exit(op_type);
   }
 }

Any changes to the app or kernel seem to make the error go away.

Best ideas so far:

  • some sort of exception is happening during the svc call, which is confusing the svc_handler_arm_v7m code somehow and it ends up going back to the app without ever handling the syscall in the kernel.
  • we've stumbled on some very reproducible timing issue, where the svc memop sbrk is happening at the exact right moment, causing weird behavior. Even delaying the memop svc by one instruction causes the issue to go away. Using a different kernel which delays it removes the error, loading a second process changes the timing and the issue goes away, or even just inserting a single instruction before the svc.

@bradjc
Copy link
Contributor Author

bradjc commented Jul 26, 2022

I did an experiment where I changed the memop number in the sbrk function to 0xf (rather than 0x1). I also then have memop() call exit(r0) rather than exit(1) (this leaves r0 in r0). Running the analog_comparator app and sure enough after faulting the app, 0xf is still in r0. This suggests that the kernel is NOT running, and the app is somehow resuming from the SVC call with the same register state it had when it called the SVC. I can reproduce this with any small value (0x2e for example).

It seems like what could be happening:

  1. The app calls memop (sbrk).
  2. During the exception for the SVC, another exception occurs (?), or something happens. I don't know what other exception would happen so reliably, but this is consistent with the timing being crucial.
  3. Somehow the other exception causes the in-kernel svc exception handler to do the wrong thing and return to userspace.
  4. The memop function in libtock-c thinks the syscall was handled when it wasn't and tries to interpret what it called the svc with as the return values, causing an exit() call.

@hudson-ayers
Copy link
Contributor

To add a little more data from my debugging:

  • If you put a breakpoint inside the sbrk call in libtock-c and immediately continue from there, the bug does not occur. This suggests that the problem is not just specific to some memory address being overwritten, and that timing is an important factor.
  • There seem to be GPIO interrupts arriving despite GPIO interrupts not being enabled anywhere that I can find -- however, I am unsure if this is just an artifact of the JTAG debugger being attached somehow
  • I find that the bug is reproducible even with certain apps flashed alongside analog_comparator: flashing imix or lua_hello preserves the bug. Strangely, the bug only seems to exist if analog_comparator is the last app on the board (tockloader installs apps largest first)

@ppannuto
Copy link
Member

ppannuto commented Aug 6, 2022

This issue is in our svc handler. Specifically, we don't handle all of the possible exception entry cases:

This issue will happen if a higher-priority exception preempts a currently executing exception handler. (Not surprising it's so timing sensitive; lucky to hit really).

Specifically, when an exception handler starts executing, there are three possible states the processor could have been executing in before; the LR contents distinguishes which of these is the case:

  • 0xFFFF_FFFD — The core was not currently executing an exception handler and the core was in “Thread” (aka userspace) context
  • 0xFFFF_FFF9 — The core was not currently executing an exception handler and the core was in “Main” (aka kernel) context
  • 0xFFFF_FFF1 — The core was in the middle of executing an exception handler and a higher priority exception came in, so the current exception handler’s stack frame has been pushed, and we start a new exception handler

Currently, we do not handle the F1 case correctly (i.e. the code assumed only FD or F9 will ever happen).

It’s not so surprising that this hasn’t bit us yet as it’s rare: it will only happen if the core is servicing an exception when another exception arrives [and we expect to return successfully — i.e. this wasn’t a problem when we had HardFaults in our exception handlers because we just panic’d anyway].

The symptom is that we will never finish servicing the exception handler that was interrupted. The “return code” of r0==1 is almost certainly happening because this situation occurs during a sbrk, i.e. you’re actually seeing the passed in argument to the syscall still as the requested syscall didn’t finish executing, so it never overwrote the stack with new values.

But what about that half-complete first exception handler you ask? Its stack frame is actually still around, hanging out on the kernel stack.

tl;dr this is a bug; the quick hack check would be to add a compare for the F1 case and verify it goes away. The actual fix is probably to read the Interrupt Control and State Register (ICSR) and do the right thing from there, especially if we ever intend to support floating point, as they set of exception returns grows significantly:
image

@bradjc
Copy link
Contributor Author

bradjc commented Sep 19, 2022

I would like to work on this.

@ppannuto Can you sketch out what the context handlers should be doing, i.e. expand on what

read the Interrupt Control and State Register (ICSR) and do the right thing from there

means?

@ppannuto
Copy link
Member

ppannuto commented Oct 6, 2022

Sorry, almost forgot to post this. The book I mentioned on the call is "The Definitive Guide to ARM Cortex-[ ]" there are several fill in the blanks, but the most relevant for us is probably the M3 edition rather than the M0 edition: It's almost certainly available via your university library, or, google finds a copy, uh, here: http://centaur.sch.bme.hu/~holcsik_t/sem/The%20Definitive%20Guide%20to%20the%20ARM%20Cortex-M3.pdf

Chapter 7 is the general discussion of how nested handlers and PendSV interrupts work; Chapter 12 is also probably relevant as it discusses how to implement a dual-stack (split kernel/userspace + MPU) system.

@bradjc
Copy link
Contributor Author

bradjc commented Oct 24, 2022

There has been some discussion on this on the #wg-context-switch channel on slack. I'll try to summarize a little here:

We do not have an explanation which both matches 1) our understanding of how the cortex-m hardware behaves, and 2) the symptoms we observed from the process crashes on imix. What we observed on imix is:

  1. A process fault occurs during a system call where userspace began executing again (aka the syscall returned) without the kernel actually handling the syscall. That is, the register contents remained unchanged, so the process effectively interpreted its own system call as the system call return, and this caused a process crash.
  2. There is only place in the tock kernel for cortex-m where we switch from privileged mode to user mode, and that is in the SVC exception handler. We only do that if the SVC exception handler thinks the exception was triggered while in privileged mode (i.e. the svc instruction was called by the kernel). The SVC exception handler tries to check if svc was called by the kernel by examining the link register: if LR != 0xfffffff9 then we assume this is a syscall, otherwise we do a context switch to the process.

This issue at this point is that even if LR was being set to the nested exception value (i.e. LR=0xfffffff1), that still does not equal 0xfffffff9, and the svc handler should switch to the kernel (which is what we want). It seems like to explain the observed behavior, the svc handler must be executing with LR==0xfffffff9 when we don't intend it to (i.e., somehow even though the svc instruction was called by a process, the svc handler ends up executing with LR==0xfffffff9).

The problem with that we don't know how this can happen. The SVC exception is a relatively high priority interrupt because its exception number is quite low, and the only exceptions with higher priority are fatal exceptions that will crash the system. The SVC exception is also synchronous. Everything I have come up with at least which would cause the exhibited behavior doesn't gel with these two properties of the SVC exception.

@bradjc
Copy link
Contributor Author

bradjc commented Oct 25, 2022

From #3305:

Based on the recent discussion in slack, I still don't think we have a good enough understanding of the issue in #3109 to be confident that this fix will address it

I sort of agree. I do wish this problem had happened on any other MCU, as we have reason to be suspicious of the SAM4L.

I'm not sure what the right path forward is. Some options:

  1. We mark as "don't fix" because we can't find any within-spec explanation for what is happening, and attribute it to some oddity of the SAM4L.
  2. We implement context switching from the kernel-to-userspace manually (i.e. implement the register stacking, stack change, and privilege level change ourselves), and hope that eliminates the issue that led to this bug.
  3. We find an expert on architecture-level details of the Cortex-M exception handling and see if they can describe a possible execution order that would cause the syscall to be skipped.

Number 2 seems reasonable, but we would still have the doubt that it doesn't actually fix the issue.

@hudson-ayers
Copy link
Contributor

I think the right path forward if we want to spend more time on this is to take the approach @ppannuto suggested on slack to try to get an instruction trace of the buggy execution (since it is at least highly repeatable):

Okay, so the world’s most naive PC tracing might be to try to read DWT_PCSR in a tight loop
IIUC, that should not affect CPU operation/timing, as the CPU blindly dumps info into the DWT registers, and JTAG interface reads them out

If we could get something close to a detailed instruction trace we could almost certainly figure out what is happening. I am not actually sure how to do this with the Jlink I have though -- is there a way to interact with the JTAG debugger programatically via GDB in a way that would be fast enough to actually capture all of the instructions being executed?

@alistair23
Copy link
Contributor

I did want to point out that something similar happens on the Apollo3 boards as well. It isn't just a SAM4L issue

@bradjc
Copy link
Contributor Author

bradjc commented Nov 21, 2022

I did want to point out that something similar happens on the Apollo3 boards as well. It isn't just a SAM4L issue

We don't have context switch support for floating point chips, so it makes sense that there would be faults.

@bradjc
Copy link
Contributor Author

bradjc commented Nov 21, 2022

I just want to document where I stand after working on this for about a week last month. I can't find the issue, or even a plausible explanation for the issue, so I think any change is as likely to fix whatever the bug is as it is to not fix it. I consider this to be something we just ignore at this point.

I suppose we leave the issue open in case anyone tries to get an actual instruction execution trace, which seems like the only option for debugging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants