New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KERNEL_COHERENCE on xtensa doesn't quite work yet #32705
Comments
Call out @lyakh @lgirdwood @nashif who will all want to follow this and probably have input. |
@andyross Fwiw, SOF does not use the default Cadence HAL & xtos, but modified versions with numerous enhancements to work better on Intel hardware. It may be worth checking the git history for any changes. Fwiw, the history may be missing full details in commit messages (since it was squashed and came directly from internal codebase) but the changes (to default Cadence code) may be helpful. |
Well, a little progress to report. Here's the temporary rig I'm working on in a branch on top of the in-progress Xtensa pull request: https://github.com/andyross/zephyr/tree/adsp-spill-rig It's just a modified hello_world: https://github.com/andyross/zephyr/blob/adsp-spill-rig/samples/hello_world/src/main.c What this does (very carefully!) is:
The idea here is to catch the theoretical situation mentioned above where the GPR window state isn't spilled until too late, leaving dirty cache pages. Unfortunately... it works. This passes perfectly. In fact it seems like the work involved in getting into the minimal interrupt uses enough stack to flush out all 64 registers on its own, so by the time we reach the second thread it's all spilled already. While this is probably reasonably considered a performance bug (should a minimal interrupt really be using >256 bytes of register stack on top of its assembly save areas?), I'm a little afraid that what we thought was the bug can't be the only problem. Now I'm back to digging through code to see if there's a shallower way to get through the interrupt (or exception?) path and skip the spill. |
OK, that rig turned out to be mostly wasted effort. The core (final?) bug was much simpler: I'd forgotten when I added the code to invalidate the unused bottom area of the stack that the Xtensa ABI reserves 16 bytes of space underneath the stack pointer as a spill area for caller functions. Almost always, the caller has not been spilled (you'd have had to have taken an interrupt or just returned form a deep call tree), so this was hidden. BUT, the fix for the problem detailed above (which is real!) would have the effect of spilling those registers preemptively before arch_cohere_stacks(), which meant that they'd be invalidated immediately after they'd been spilled. To wit: they were being destroyed reliably where before it would have been a rare-to-impossible condition. Which is great! Because it means that we have both a fix for the bug detailed above and a convincing explanation for why adding this fix earlier broke everything. Code is up in #32356 now. With that series, I can't make anything fail anywhere in the kernel tests, even spuriously. |
The big PR merged this morning, but I realized that I didn't have a Fixed header anywhere in it. Close this. |
The KERNEL_COHERENCE framework in use on the intel_adsp_cavs15 device is designed to allow the thread and interrupt stack to be placed in (much faster) cached/incoherent memory. And the context switch code is then responsible for synchronizing the stacks properly when needed.
This... doesn't quite work yet. Right now, at the tip of my branch in the not-yet-merged #32356 a full twister run works, except three tests: lifo_usage, queue, and lib/p4wq. If I apply the tiny patch below, which moves the stacks (and only the stacks) back into uncached memory, then these tests work correctly.
Note that the exact set of failing tests will change over time because of its sensitivity to memory use patterns. But in general these failures depend only on order of context switches and not interrupt timing, so they're very deterministic and will either pass or fail every time. Which means they tend to be "fixed" by voodoo that isn't addressing the root cause, making debugging a huge pain.
Here's what I believe the last hole is, found mostly by @lyakh:
We use an innovative[1] "cross stack call" trick on interrupt entry that allows us to avoid spilling the Xtensa register windows from the interrupted user code. We will then spill the registers only when needed if the interrupt exit tells us we're switching to a different thread. But this spill happens after the code in arch_cohere_stacks(), so any such registers written out land only in the local CPU's L1. So another CPU that runs that thread may see incorrect/unspilled regiser contents on resume, and even if it's tolerant of this it may see its live stack contents clobbered at runtime when the original CPU decided to flush its dirty lines.
Unfortunately, absent a hardware debugger that can investigate cache state this is taking a while to work through. The "obvious" fix would be to spill the register window contents earlier, but that either doesn't fix the bug or has the effect of just moving the failures around (c.f. the voodoo problem from above).
Right now I'm working up a minimally invasive test rig that deliberately arranges to take an interrupt and context switch with a full window state and known register contents. Ideally I can get to a "smoking gun" kind of bug and then narrow the focus.
(Thankfully one trick we do have available is that the SoC gives us two mappings for the memory, so we can see the underlying SRAM state independently of the cached results.)
[1] No, really, I'm very proud of that gadget.
Hack to disable cached stacks:
The text was updated successfully, but these errors were encountered: