Skip to content
GitHub no longer supports this web browser. Learn more about the browsers we support.
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix assembly ordering in context switching code #1185

Merged
merged 1 commit into from Oct 13, 2018

Conversation

@alevy
Copy link
Member

alevy commented Oct 11, 2018

Pull Request Overview

Closes #1160

This pull request fixes a bug in the context switching code that appears under certain circumstances where the compiler reorders volatile writes before inline assembly (which was not marked volatile). The fix is to mark them volatile as well as move the non-assembly code in hardfault_handler to a separate function so we can make hardfault_handler naked like the other ISRs.

This can lead to defer execution of system calls or handling of faults which seems to interact in such a way that sometimes app code faults.

At a high level, the code to determine why a process was interrupted (e.g. for a system call, timeslice, or fault) were moved to before context switching to the app, so the kernel always thought the switch was due to whatever the previous reason was.

This change also adds a new context switch condition, Interrupted, to distinguish cases where the context switch was due to an ISR (and it is used as the default case). I believe this is unrelated to the fix, but is meaningful and was helpful when I was debugging.

Testing Strategy

@hudson-ayers, @lthiery, and @hchiang tested that this fixes the bugs we were seeing. I also looked carefully at the resulting assembly.

TODO or Help Wanted

  • Replicate fixes for other architectures (Cortex-M0 & Cortex-M3)

Documentation Updated

  • Updated the relevant files in /docs, or no updates are required.

Formatting

  • Ran make formatall.
@alevy

This comment has been minimized.

Copy link
Member Author

alevy commented Oct 11, 2018

I'm still trying to trace through exactly why this may have caused the faults i (/we) were seeing. It seems related, but it could also have changed the code layout enough to just evade the bug in the particular scenario in which it it was previous showing up. So stay tuned...

@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Oct 11, 2018

I can test tonight. Mostly worried about false negatives...

@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Oct 11, 2018

FWIW your explanation and change is consistent with my mental model of what's going on (much more so than issues with Cell types and stuff). I'll think it through some more.

@lthiery

This comment has been minimized.

Copy link
Contributor

lthiery commented Oct 12, 2018

This fixed an application that was crashing consistently on my end. 👍
Just in time to - I only spent a few hours probing this. Thanks @alevy

@alevy

This comment has been minimized.

Copy link
Member Author

alevy commented Oct 12, 2018

Temporary explanation:

To re-iterate, the context switch code in cortexm::syscall::switch_to_process is getting compiled such that the reason for the context switch is checked before actually context switching. This is done in cortexm::syscall::switch_to_process by checking global variables that are set in one of hardfault_handler, svc_handlers, or systick_handler. The compiler is moving the reads of those globals before the context switch, even though the call to switch_to_user comes first.

The effect is that when there is a system call that doesn't immediately follow another system call, the kernel thinks the context switch cause was something else (a timeslice expiring, or an interrupt) and does not execute the system call.

In the normal case, when a process isn't interrupted by hardware events and it doesn't exhaust it's timeslice, the cause of the switch will always read as a system call, since the previous cause of a switch will always have been a system call as well.

The exception is the very first system call (brk(stack + heap) in the _start function). This explains the note @bradjc left when writing the new context switch code about how if we panic when we don't have an explanation for the context switch, the first app faults immediately. But it turns out this first system call doesn't actually matter yet, since we default to exposing all non-grant memory to a process anyway (this is also sort of a bug, but @dverhaert and @cmcavity are fixing it in their new MPU code).

However, if we start adding debug! statements (or indeed to any I/O that might interrupt a process), everything falls to crap. Once a hardware event interrupts a process, the next system call will be seen by the kernel as caused by a hardware event, not a system call. It won't execute the system call, but it'll still eventually context switch back to the process, with an undefined return value (basically whatever was in the process's r0 already).

In my case, system calls were consistently interspersed with hardware events since I was doing a debug! on each context switch (which results in a hardware event of the UART completing shortly after).

I was seeing app faults happening in the second call to sbrk in malloc, called from printf. The first call to sbrk in newlib's malloc is sbrk(0) (increment the break by 0, so basically return the current break), which it for accounting, so it doesn't really matter. The second one, which was fairly consistently interrupted by hardware, was effectively not running, and the return value from sbrk was 1 (which is just the unchanged value of r0 sbrk passes to memop). Of course, 0x1 is not a valid memory location, and I was getting an error writing to 0x4 (the extra three bytes is added by malloc and I'm guessing they are used for accounting).

Note that, originally, the hardfault errors were meaningless for two reasons that I needed to fix:

  1. They were imprecise data faults, meaning the error occured some number of instructions before the hardfault ISR was invoked. This is fixed by turning off piplining in the Cortex-M4 core (we don't want to do this generally, but was useful for debugging).

  2. The hardfault handler had basically the same issue. It would run, return to the kernel, but the kernel wouldn't realize the app had faulted (this also explains why I was seeing cases where the app marked as faulting clearly was not the one that actually faulted).

So if you got error messages that are not 100% consistent with this explanation, but seem close, the delta could be that the error messages were actually only happening after the next system call, so the stack trace was wrong.

@brghena

This comment has been minimized.

Copy link
Contributor

brghena commented Oct 12, 2018

Wow. This is amazing. Great job Amit (and others who debugged along the way)!

You mentioned that you had to turn of pipelining in order to get valid fault responses. How useful was this in tracking down the problem? Should make disabling pipelining a reusable primitive, maybe activated for debug builds?

@alevy

This comment has been minimized.

Copy link
Member Author

alevy commented Oct 12, 2018

Should make disabling pipelining a reusable primitive, maybe activated for debug builds?

Probably? It's literally just flipping bit 2 in the Cortex-M4's ACTLR register:

::core::ptr::write_volatile(0xE000E008 as *mut u32, 0x2);

Might be worth having as, e.g., a helper in the cortexm4 crate so someone can easily just add a call to the top of the board initialization to turn it off when debugging.

@hudson-ayers

This comment has been minimized.

Copy link
Collaborator

hudson-ayers commented Oct 12, 2018

This fixed my bug and a similar bug that Holly ran into. Also, the description and changes make sense to me. I recognize that we may not want to immediately merge this into master so others can review it, but the sooner the better imo as this is one of the most annoying bugs imaginable for Tock.

bfarvalid,
bfar
);
kernel_hardfault(faulting_stack);

This comment has been minimized.

Copy link
@brghena

brghena Oct 12, 2018

Contributor

Was splitting this up into two functions important? Or is this just an artifact of debugging?

This comment has been minimized.

Copy link
@hudson-ayers

hudson-ayers Oct 12, 2018

Collaborator

"The fix is to mark them volatile as well as move the non-assembly code in hardfault_handler to a separate function so we can make hardfault_handler naked like the other ISRs." I think the non-assembly needs the prologue/epilogue functions but the raw assembly is just fine without them?

This comment has been minimized.

Copy link
@brghena

brghena Oct 12, 2018

Contributor

Hmmm. It's not clear to me what prologue/epilogue the rest of the hardfault handler would need. Registers should already be stacked before entry into the handler.

This comment has been minimized.

Copy link
@alevy

alevy Oct 12, 2018

Author Member

It's not directly related to the bug, because the issue with the hardfault handler only matters once we actually have a hard fault of some sort. But...

Without separating them, and being able to mark the hardfault handler naked, the prologue was pushing LR onto the stack, and the epilogue was popping directly into PC.

So, when an app faulted, our changes of the LR register to return to the kernel instead of back to the process had no effect. Hardcoding a BX LR in the assembly isn't enough because at that point the kernel stack has changed (since the procedure does push about 52 bytes of actual values on the stack), and the hardware unstacking is incorrect.

Separating it out in this way allows the us to mark the ISR as naked, thus ensuring the compiler to doesn't do any special handling of the LR register, but and all of the variable stacking/unstacking happens in the helper function, which is marked inline(never).

This comment has been minimized.

Copy link
@ppannuto

ppannuto Oct 12, 2018

Member

This all makes sense and sounds good, but I believe it means that #1176 is still a problem that needs addressing (independent of this PR)

This comment has been minimized.

Copy link
@alevy

alevy Oct 12, 2018

Author Member

Correct, it does not fix #1176 (presumably it maybe somewhat isolates it, but is basically unrelated).

This comment has been minimized.

Copy link
@brghena

brghena Oct 12, 2018

Contributor

I'm still missing something. Why can't you mark the function as naked but still keep the helper function code in the function as well?

This comment has been minimized.

Copy link
@alevy

alevy Oct 13, 2018

Author Member

a naked function doesn't, e.g., stack callee saved registers (r4-r11), but the helper function code might (in this case definitely does) override those registers. You could explicitly stack them in assembly with a custom prelude, but then you have to unstack them, and you don't necessarily have a guarantee at the end of your naked function what exactly the helper code has done with the stack.

So, maybe it's possible, but certainly simpler to reason about by moving the actual Rust code to a separate helper function.

This comment has been minimized.

Copy link
@brghena

brghena Oct 13, 2018

Contributor

Oh! What I had missed here was that the NVIC doesn't stack r4-r11 for you, so you need to do it yourself.

@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Oct 12, 2018

Haven't tested (sounds like others have been able to better than me). Read the code carefully and the explanation. LGTM.

FWIW, I am absolutely horrified that anyone would ever think it's OK to optimize inline assembly and that you need to mark it volatile.

@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Oct 12, 2018

Awesome debugging job, @alevy ! 💥 🔛 🍰

lthiery added a commit to helium/tock that referenced this pull request Oct 12, 2018
Copy link
Contributor

dverhaert left a comment

Amazing catch Amit!

@niklasad1

This comment has been minimized.

Copy link
Member

niklasad1 commented Oct 12, 2018

@alevy 🥇 🥇

Amazing work, however did anyone trying this out with optimization-level 0 or 1 instead z?

Copy link
Member

niklasad1 left a comment

LGTM, but can you fix the commit message?

Fixes a bug in the context switching code that appears under certain
circumstances where the compiler reorders volatile writes before inline
assembly (which was not marked volatile). The fix is to mark them
volatile as well as move the non-assembly code in hardfault_handler to a
separate function so we can make hardfault_handler naked like the other
ISRs.

This can lead to defer execution of system calls or handling of faults
which seems to interact in such a way that sometimes app code faults.

At a high level, the code to determine why a process was interrupted
(e.g. for a system call, timeslice, or fault) were moved to before
context switching to the app, so the kernel always thought the switch
was due to whatever the previous reason was.

This change also adds a new context switch condition, Interrupted, to
distinguish cases where the context switch was due to an ISR (and it is
used as the default case). I believe this is unrelated to the fix, but
is meaningful and was helpful when I was debugging.
@alevy alevy dismissed stale reviews from niklasad1, dverhaert, and hudson-ayers via 88a5fa4 Oct 12, 2018
@alevy alevy force-pushed the alevy:bug/context_switch_assembly branch from fcd4755 to 88a5fa4 Oct 12, 2018
@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Oct 12, 2018

@alevy

This comment has been minimized.

Copy link
Member Author

alevy commented Oct 12, 2018

FYI, @phil-levis @niklasad1 @hudson-ayers and @dverhaert, we need your re-approval, since my cleanup dismissed your previous reviews.

@alevy

This comment has been minimized.

Copy link
Member Author

alevy commented Oct 13, 2018

bors r+

bors bot added a commit that referenced this pull request Oct 13, 2018
Merge #1185
1185: Fix assembly ordering in context switching code r=alevy a=alevy

### Pull Request Overview

Closes #1160 

This pull request fixes a bug in the context switching code that appears under certain circumstances where the compiler reorders volatile writes before inline assembly (which was _not_ marked volatile). The fix is to mark them volatile as well as move the non-assembly code in hardfault_handler to a separate function so we can make `hardfault_handler` `naked` like the other ISRs.

This can lead to defer execution of system calls or handling of faults which _seems_ to interact in such a way that sometimes app code faults.

At a high level, the code to determine why a process was interrupted (e.g. for a system call, timeslice, or fault) were moved to _before_ context switching to the app, so the kernel always thought the switch was due to whatever the _previous_ reason was.

This change also adds a new context switch condition, `Interrupted`, to distinguish cases where the context switch was due to an ISR (and it is used as the default case). I believe this is unrelated to the fix, but is meaningful and was helpful when I was debugging.

### Testing Strategy

@hudson-ayers, @lthiery, and @hchiang tested that this fixes the bugs we were seeing. I also looked carefully at the resulting assembly.

### TODO or Help Wanted

- [x] Replicate fixes for other architectures (Cortex-M0 & Cortex-M3)

### Documentation Updated

- [x] ~~Updated the relevant files in `/docs`, or no updates are required.~~

### Formatting

- [x] Ran `make formatall`.


Co-authored-by: Amit Aryeh Levy <amit@amitlevy.com>
@bors

This comment has been minimized.

Copy link
Contributor

bors bot commented Oct 13, 2018

@bors bors bot merged commit 88a5fa4 into tock:master Oct 13, 2018
3 checks passed
3 checks passed
bors Build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
deploy/netlify Deploy preview ready!
Details
@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Oct 13, 2018

WOOOOOHOOOOO!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.