Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upFix assembly ordering in context switching code #1185
Conversation
This comment has been minimized.
This comment has been minimized.
|
I'm still trying to trace through exactly why this may have caused the faults i (/we) were seeing. It seems related, but it could also have changed the code layout enough to just evade the bug in the particular scenario in which it it was previous showing up. So stay tuned... |
This comment has been minimized.
This comment has been minimized.
|
I can test tonight. Mostly worried about false negatives... |
This comment has been minimized.
This comment has been minimized.
|
FWIW your explanation and change is consistent with my mental model of what's going on (much more so than issues with Cell types and stuff). I'll think it through some more. |
This comment has been minimized.
This comment has been minimized.
|
This fixed an application that was crashing consistently on my end. |
This comment has been minimized.
This comment has been minimized.
|
Temporary explanation: To re-iterate, the context switch code in The effect is that when there is a system call that doesn't immediately follow another system call, the kernel thinks the context switch cause was something else (a timeslice expiring, or an interrupt) and does not execute the system call. In the normal case, when a process isn't interrupted by hardware events and it doesn't exhaust it's timeslice, the cause of the switch will always read as a system call, since the previous cause of a switch will always have been a system call as well. The exception is the very first system call ( However, if we start adding In my case, system calls were consistently interspersed with hardware events since I was doing a I was seeing app faults happening in the second call to Note that, originally, the hardfault errors were meaningless for two reasons that I needed to fix:
So if you got error messages that are not 100% consistent with this explanation, but seem close, the delta could be that the error messages were actually only happening after the next system call, so the stack trace was wrong. |
This comment has been minimized.
This comment has been minimized.
|
Wow. This is amazing. Great job Amit (and others who debugged along the way)! You mentioned that you had to turn of pipelining in order to get valid fault responses. How useful was this in tracking down the problem? Should make disabling pipelining a reusable primitive, maybe activated for debug builds? |
This comment has been minimized.
This comment has been minimized.
Probably? It's literally just flipping bit 2 in the Cortex-M4's ACTLR register:
Might be worth having as, e.g., a helper in the cortexm4 crate so someone can easily just add a call to the top of the board initialization to turn it off when debugging. |
This comment has been minimized.
This comment has been minimized.
|
This fixed my bug and a similar bug that Holly ran into. Also, the description and changes make sense to me. I recognize that we may not want to immediately merge this into master so others can review it, but the sooner the better imo as this is one of the most annoying bugs imaginable for Tock. |
| bfarvalid, | ||
| bfar | ||
| ); | ||
| kernel_hardfault(faulting_stack); |
This comment has been minimized.
This comment has been minimized.
brghena
Oct 12, 2018
Contributor
Was splitting this up into two functions important? Or is this just an artifact of debugging?
This comment has been minimized.
This comment has been minimized.
hudson-ayers
Oct 12, 2018
Collaborator
"The fix is to mark them volatile as well as move the non-assembly code in hardfault_handler to a separate function so we can make hardfault_handler naked like the other ISRs." I think the non-assembly needs the prologue/epilogue functions but the raw assembly is just fine without them?
This comment has been minimized.
This comment has been minimized.
brghena
Oct 12, 2018
Contributor
Hmmm. It's not clear to me what prologue/epilogue the rest of the hardfault handler would need. Registers should already be stacked before entry into the handler.
This comment has been minimized.
This comment has been minimized.
alevy
Oct 12, 2018
Author
Member
It's not directly related to the bug, because the issue with the hardfault handler only matters once we actually have a hard fault of some sort. But...
Without separating them, and being able to mark the hardfault handler naked, the prologue was pushing LR onto the stack, and the epilogue was popping directly into PC.
So, when an app faulted, our changes of the LR register to return to the kernel instead of back to the process had no effect. Hardcoding a BX LR in the assembly isn't enough because at that point the kernel stack has changed (since the procedure does push about 52 bytes of actual values on the stack), and the hardware unstacking is incorrect.
Separating it out in this way allows the us to mark the ISR as naked, thus ensuring the compiler to doesn't do any special handling of the LR register, but and all of the variable stacking/unstacking happens in the helper function, which is marked inline(never).
This comment has been minimized.
This comment has been minimized.
ppannuto
Oct 12, 2018
Member
This all makes sense and sounds good, but I believe it means that #1176 is still a problem that needs addressing (independent of this PR)
This comment has been minimized.
This comment has been minimized.
alevy
Oct 12, 2018
Author
Member
Correct, it does not fix #1176 (presumably it maybe somewhat isolates it, but is basically unrelated).
This comment has been minimized.
This comment has been minimized.
brghena
Oct 12, 2018
Contributor
I'm still missing something. Why can't you mark the function as naked but still keep the helper function code in the function as well?
This comment has been minimized.
This comment has been minimized.
alevy
Oct 13, 2018
Author
Member
a naked function doesn't, e.g., stack callee saved registers (r4-r11), but the helper function code might (in this case definitely does) override those registers. You could explicitly stack them in assembly with a custom prelude, but then you have to unstack them, and you don't necessarily have a guarantee at the end of your naked function what exactly the helper code has done with the stack.
So, maybe it's possible, but certainly simpler to reason about by moving the actual Rust code to a separate helper function.
This comment has been minimized.
This comment has been minimized.
brghena
Oct 13, 2018
Contributor
Oh! What I had missed here was that the NVIC doesn't stack r4-r11 for you, so you need to do it yourself.
This comment has been minimized.
This comment has been minimized.
|
Haven't tested (sounds like others have been able to better than me). Read the code carefully and the explanation. LGTM. FWIW, I am absolutely horrified that anyone would ever think it's OK to optimize inline assembly and that you need to mark it volatile. |
This comment has been minimized.
This comment has been minimized.
|
Awesome debugging job, @alevy ! |
|
Amazing catch Amit! |
This comment has been minimized.
This comment has been minimized.
|
@alevy Amazing work, however did anyone trying this out with optimization-level |
|
LGTM, but can you fix the commit message? |
Fixes a bug in the context switching code that appears under certain circumstances where the compiler reorders volatile writes before inline assembly (which was not marked volatile). The fix is to mark them volatile as well as move the non-assembly code in hardfault_handler to a separate function so we can make hardfault_handler naked like the other ISRs. This can lead to defer execution of system calls or handling of faults which seems to interact in such a way that sometimes app code faults. At a high level, the code to determine why a process was interrupted (e.g. for a system call, timeslice, or fault) were moved to before context switching to the app, so the kernel always thought the switch was due to whatever the previous reason was. This change also adds a new context switch condition, Interrupted, to distinguish cases where the context switch was due to an ISR (and it is used as the default case). I believe this is unrelated to the fix, but is meaningful and was helpful when I was debugging.
fcd4755
to
88a5fa4
This comment has been minimized.
This comment has been minimized.
|
Having it as a helper in the m4 crate would be fine.
Phil
… On Oct 11, 2018, at 5:51 PM, Amit Levy ***@***.***> wrote:
Should make disabling pipelining a reusable primitive, maybe activated for debug builds?
Probably? It's literally just flipping bit 2 in the Cortex-M4's ACTLR register:
::core::ptr::write_volatile(0xE000E008 as *mut u32, 0x2);
Might be worth having as, e.g., a helper in the cortexm4 crate so someone can easily just add a call to the top of the board initialization to turn it off when debugging.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#1185 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEG3axQ0mP2KHfQGMu3ZPQ3DV0nvhSvTks5uj-eegaJpZM4XYgK5>.
|
This comment has been minimized.
This comment has been minimized.
|
FYI, @phil-levis @niklasad1 @hudson-ayers and @dverhaert, we need your re-approval, since my cleanup dismissed your previous reviews. |
This comment has been minimized.
This comment has been minimized.
|
bors r+ |
1185: Fix assembly ordering in context switching code r=alevy a=alevy ### Pull Request Overview Closes #1160 This pull request fixes a bug in the context switching code that appears under certain circumstances where the compiler reorders volatile writes before inline assembly (which was _not_ marked volatile). The fix is to mark them volatile as well as move the non-assembly code in hardfault_handler to a separate function so we can make `hardfault_handler` `naked` like the other ISRs. This can lead to defer execution of system calls or handling of faults which _seems_ to interact in such a way that sometimes app code faults. At a high level, the code to determine why a process was interrupted (e.g. for a system call, timeslice, or fault) were moved to _before_ context switching to the app, so the kernel always thought the switch was due to whatever the _previous_ reason was. This change also adds a new context switch condition, `Interrupted`, to distinguish cases where the context switch was due to an ISR (and it is used as the default case). I believe this is unrelated to the fix, but is meaningful and was helpful when I was debugging. ### Testing Strategy @hudson-ayers, @lthiery, and @hchiang tested that this fixes the bugs we were seeing. I also looked carefully at the resulting assembly. ### TODO or Help Wanted - [x] Replicate fixes for other architectures (Cortex-M0 & Cortex-M3) ### Documentation Updated - [x] ~~Updated the relevant files in `/docs`, or no updates are required.~~ ### Formatting - [x] Ran `make formatall`. Co-authored-by: Amit Aryeh Levy <amit@amitlevy.com>
This comment has been minimized.
This comment has been minimized.
Build succeeded |
This comment has been minimized.
This comment has been minimized.
|
WOOOOOHOOOOO! |
alevy commentedOct 11, 2018
•
edited
Pull Request Overview
Closes #1160
This pull request fixes a bug in the context switching code that appears under certain circumstances where the compiler reorders volatile writes before inline assembly (which was not marked volatile). The fix is to mark them volatile as well as move the non-assembly code in hardfault_handler to a separate function so we can make
hardfault_handlernakedlike the other ISRs.This can lead to defer execution of system calls or handling of faults which seems to interact in such a way that sometimes app code faults.
At a high level, the code to determine why a process was interrupted (e.g. for a system call, timeslice, or fault) were moved to before context switching to the app, so the kernel always thought the switch was due to whatever the previous reason was.
This change also adds a new context switch condition,
Interrupted, to distinguish cases where the context switch was due to an ISR (and it is used as the default case). I believe this is unrelated to the fix, but is meaningful and was helpful when I was debugging.Testing Strategy
@hudson-ayers, @lthiery, and @hchiang tested that this fixes the bugs we were seeing. I also looked carefully at the resulting assembly.
TODO or Help Wanted
Documentation Updated
Updated the relevant files in/docs, or no updates are required.Formatting
make formatall.