Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KERNEL_COHERENCE on xtensa doesn't quite work yet #32705

Closed
andyross opened this issue Feb 26, 2021 · 5 comments
Closed

KERNEL_COHERENCE on xtensa doesn't quite work yet #32705

andyross opened this issue Feb 26, 2021 · 5 comments
Assignees
Labels
area: Xtensa Xtensa Architecture bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug

Comments

@andyross
Copy link
Contributor

The KERNEL_COHERENCE framework in use on the intel_adsp_cavs15 device is designed to allow the thread and interrupt stack to be placed in (much faster) cached/incoherent memory. And the context switch code is then responsible for synchronizing the stacks properly when needed.

This... doesn't quite work yet. Right now, at the tip of my branch in the not-yet-merged #32356 a full twister run works, except three tests: lifo_usage, queue, and lib/p4wq. If I apply the tiny patch below, which moves the stacks (and only the stacks) back into uncached memory, then these tests work correctly.

Note that the exact set of failing tests will change over time because of its sensitivity to memory use patterns. But in general these failures depend only on order of context switches and not interrupt timing, so they're very deterministic and will either pass or fail every time. Which means they tend to be "fixed" by voodoo that isn't addressing the root cause, making debugging a huge pain.

Here's what I believe the last hole is, found mostly by @lyakh:

We use an innovative[1] "cross stack call" trick on interrupt entry that allows us to avoid spilling the Xtensa register windows from the interrupted user code. We will then spill the registers only when needed if the interrupt exit tells us we're switching to a different thread. But this spill happens after the code in arch_cohere_stacks(), so any such registers written out land only in the local CPU's L1. So another CPU that runs that thread may see incorrect/unspilled regiser contents on resume, and even if it's tolerant of this it may see its live stack contents clobbered at runtime when the original CPU decided to flush its dirty lines.

Unfortunately, absent a hardware debugger that can investigate cache state this is taking a while to work through. The "obvious" fix would be to spill the register window contents earlier, but that either doesn't fix the bug or has the effect of just moving the failures around (c.f. the voodoo problem from above).

Right now I'm working up a minimally invasive test rig that deliberately arranges to take an interrupt and context switch with a full window state and known register contents. Ideally I can get to a "smoking gun" kind of bug and then narrow the focus.

(Thankfully one trick we do have available is that the SoC gives us two mappings for the memory, so we can see the underlying SRAM state independently of the cached results.)

[1] No, really, I'm very proud of that gadget.

Hack to disable cached stacks:

diff --git a/arch/xtensa/include/kernel_arch_func.h b/arch/xtensa/include/kernel_arch_func.h
index 8f45391cb7..7dbc25bfcd 100644
--- a/arch/xtensa/include/kernel_arch_func.h
+++ b/arch/xtensa/include/kernel_arch_func.h
@@ -64,9 +64,10 @@ static inline void arch_switch(void *switch_to, void **switched_from)
 #if defined(CONFIG_SOC_FAMILY_INTEL_ADSP) && defined(CONFIG_KERNEL_COHERENCE)
 static inline bool arch_mem_coherent(void *ptr)
 {
-	size_t addr = (size_t) ptr;
+	return true;
+	//size_t addr = (size_t) ptr;
 
-	return addr >= 0x80000000 && addr < 0xa0000000;
+	//return addr >= 0x80000000 && addr < 0xa0000000;
 }
 #endif
 
diff --git a/include/linker/section_tags.h b/include/linker/section_tags.h
index 86e383e7a0..aac9338f68 100644
--- a/include/linker/section_tags.h
+++ b/include/linker/section_tags.h
@@ -43,7 +43,8 @@
 #endif /* CONFIG_NOCACHE_MEMORY */
 
 #if defined(CONFIG_KERNEL_COHERENCE)
-#define __incoherent __in_section_unique(cached)
+//#define __incoherent __in_section_unique(cached)
+#define __incoherent
 #define __stackmem __incoherent
 #define __kstackmem __stackmem
 #else
diff --git a/kernel/thread.c b/kernel/thread.c
index 989f17f134..6a14738b1a 100644
--- a/kernel/thread.c
+++ b/kernel/thread.c
@@ -563,7 +563,7 @@ char *z_setup_new_thread(struct k_thread *new_thread,
 	 * still cached!
 	 */
 	__ASSERT_NO_MSG(arch_mem_coherent(new_thread));
-	__ASSERT_NO_MSG(!arch_mem_coherent(stack));
+	//__ASSERT_NO_MSG(!arch_mem_coherent(stack));
 #endif
 
 	arch_new_thread(new_thread, stack, stack_ptr, entry, p1, p2, p3);
-- 
2.25.1
@andyross andyross added the bug The issue is a bug, or the PR is fixing a bug label Feb 26, 2021
@andyross andyross self-assigned this Feb 26, 2021
@andyross andyross added priority: high High impact/importance bug area: Xtensa Xtensa Architecture labels Feb 26, 2021
@andyross
Copy link
Contributor Author

Call out @lyakh @lgirdwood @nashif who will all want to follow this and probably have input.

@lgirdwood
Copy link
Collaborator

@andyross Fwiw, SOF does not use the default Cadence HAL & xtos, but modified versions with numerous enhancements to work better on Intel hardware. It may be worth checking the git history for any changes. Fwiw, the history may be missing full details in commit messages (since it was squashed and came directly from internal codebase) but the changes (to default Cadence code) may be helpful.

@andyross
Copy link
Contributor Author

andyross commented Mar 1, 2021

Well, a little progress to report. Here's the temporary rig I'm working on in a branch on top of the in-progress Xtensa pull request:

https://github.com/andyross/zephyr/tree/adsp-spill-rig

It's just a modified hello_world:

https://github.com/andyross/zephyr/blob/adsp-spill-rig/samples/hello_world/src/main.c

What this does (very carefully!) is:

  1. Use some custom assembly to fill up the register window state (i.e. all 64 registers unspilled)

  2. Make a copy of the current stack in uncached memory

  3. Clobber the underlying SRAM "underneath" the current stack to detect spill failures

  4. Deliberately trigger a custom interrupt

  5. The ISR does no work beyond waking up a separate thread

  6. That thread then immediately invalidates the old stack (to be sure that any unspilled contents don't reach SRAM) and checks it vs. the copy that was made earlier.

The idea here is to catch the theoretical situation mentioned above where the GPR window state isn't spilled until too late, leaving dirty cache pages.

Unfortunately... it works. This passes perfectly. In fact it seems like the work involved in getting into the minimal interrupt uses enough stack to flush out all 64 registers on its own, so by the time we reach the second thread it's all spilled already.

While this is probably reasonably considered a performance bug (should a minimal interrupt really be using >256 bytes of register stack on top of its assembly save areas?), I'm a little afraid that what we thought was the bug can't be the only problem.

Now I'm back to digging through code to see if there's a shallower way to get through the interrupt (or exception?) path and skip the spill.

@andyross
Copy link
Contributor Author

andyross commented Mar 2, 2021

OK, that rig turned out to be mostly wasted effort. The core (final?) bug was much simpler:

I'd forgotten when I added the code to invalidate the unused bottom area of the stack that the Xtensa ABI reserves 16 bytes of space underneath the stack pointer as a spill area for caller functions. Almost always, the caller has not been spilled (you'd have had to have taken an interrupt or just returned form a deep call tree), so this was hidden. BUT, the fix for the problem detailed above (which is real!) would have the effect of spilling those registers preemptively before arch_cohere_stacks(), which meant that they'd be invalidated immediately after they'd been spilled. To wit: they were being destroyed reliably where before it would have been a rare-to-impossible condition.

Which is great! Because it means that we have both a fix for the bug detailed above and a convincing explanation for why adding this fix earlier broke everything.

Code is up in #32356 now. With that series, I can't make anything fail anywhere in the kernel tests, even spuriously.

@andyross
Copy link
Contributor Author

andyross commented Mar 8, 2021

The big PR merged this morning, but I realized that I didn't have a Fixed header anywhere in it. Close this.

@andyross andyross closed this as completed Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Xtensa Xtensa Architecture bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

No branches or pull requests

2 participants