New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big Xtensa architecture cleanup #32356
Conversation
Add @sylvioalves for esp32 eyes. Unfortunately my board (which I haven't touched in over a year) seems to be having trouble. All the validation on Espressif was build-only. Similarly the Intel Sue Creek changes are unvalidated, though I'm more confident there because of the similarly to cavs. |
Extend with some intel_adsp work to clean up multiprocessor startup. The major bug addressed was a stack collision in cached memory at the top of RAM. Now all startup paths use the "Zephyr standard" scheme where each CPU gets a dedicated interrupt stack for startup use. |
aecdb8a
to
3110aec
Compare
Rebase. Clean up a little bit of voodoo in the cache patch whose root cause turned out to be the ADSP SMP stack collision thing fixed later. Also adds a document on "how Xtensa's crazy register windows work" that I sketched out long ago and have been sitting on. There's no particular reason why Zephyr should be responsible for documenting an architecture feature, but no one else does it very well and IMHO this may be useful to someone in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so weird... I was sure to have submitted this review :-/ I'll look at the new version again, but maybe these comments can be addressed too, if they're still relevant
include/arch/xtensa/atomic_xtensa.h
Outdated
* Evaluates to the old value which was atomically replaced. | ||
*/ | ||
|
||
#define ATOMIC_XCHG(expr) ({ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be better to make target
also a macro parameter, especially since this is defined in a header, i.e. visible to and usable by all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a weird macro to generate a function body, so I went with minimal parameters to maximize readability. But maybe I should rename this to something obscure like "Z_XT_GEN_XCHG()" or something so it doesn't look like something anyone is expected to call.
/* Some architectures need their own implementation */ | ||
# ifdef CONFIG_XTENSA | ||
/* Not all Xtensa toolchains support GCC-style atomic intrinsics */ | ||
# include <arch/xtensa/atomic_xtensa.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be made a generic <arch/atomic.h>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another of those project policy things from which I tend to hide. I think the rule here is that this is a public API, therefore it belongs in the top-level include directory, and therefore has to have an architecture-specific pathname and can't be generic. But I might be wrong.
Quite honestly I typically just drop stuff where I think it's supposed to be and wait for someone else to submit a patch to move it where it's supposed to go. We're constantly moving stuff around anyway, so I view this as developer parallelism. :)
@@ -0,0 +1,143 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: this file is already in an xtensa directory, just calling it atomic.h should be enough, also following other files in the same directory?
arch/xtensa/core/xtensa-asm2.c
Outdated
{ | ||
#if XCHAL_DCACHE_SIZE | ||
size_t first = ((size_t) addr) & ~(Z_DLINE - 1); | ||
size_t last = (((size_t) addr) + bytes - 1) & ~(Z_DLINE - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ROUND_DOWN()
/ ROUND_UP()
. Shouldn't the type be uintptr_t
? For line
too of course
include/arch/xtensa/cache.h
Outdated
#define Z_DLINE (XCHAL_DCACHE_LINESIZE) | ||
|
||
#if XCHAL_DCACHE_SIZE | ||
#define Z_IS_POW2(x) (((x) > 0) && (((x) & ((x)-1)) == 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if someone defines int x = 0x80000000;
(arguably silly as it might be) it won't be recognised as a power of 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define Z_IS_POW2(x) (((x) > 0) && (((x) & ((x)-1)) == 0)) | |
#define Z_IS_POW2(x) (((x) != 0) && (((x) & ((x)-1)) == 0)) |
include/arch/xtensa/cache.h
Outdated
#if XCHAL_DCACHE_SIZE | ||
char *start = (char *)addr; | ||
|
||
for (char *end = start + bytes; start <= end; start += Z_DLINE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also potentially one line too many
include/arch/xtensa/cache.h
Outdated
/* For complicated reasons in arch_cohere_stacks(), this has to be a | ||
* proper function call and not an inline so the emitted frame does | ||
* not overlap the area to be invalidated. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but stack frames aren't cache-line size aligned, are they? They still wouldn't overlap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are. The platform code guarantees alignment of static stacks, and will align the stack top correctly within a user-provided buffer. It's a kconfig somewhere, but I should probably check to make sure that it's set correctly. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Oh, and this bug was found and fixed, so that comment got removed. All the code is inlined the same way)
@@ -23,6 +24,8 @@ | |||
#define MANIFEST_BASE IMR_BOOT_LDR_MANIFEST_BASE | |||
#endif | |||
|
|||
int ANDY_PRINT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug left-over?
@@ -15,4 +15,43 @@ | |||
#define SOC_DCACHE_INVALIDATE(addr, size) \ | |||
z_xtensa_cache_inv((addr), (size)) | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a native English speaker but my dictionary tells me, that "canonize" means "declare holy" whereas "canonicalize" "make standard" :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Canonize" does mean that, and that was the intent: taking a previously ad-hoc trick and making it an official part of the (SoC) API. But yeah, it's maybe a bit too poetic. I'll rephrase.
*/ | ||
static inline void *z_soc_uncached_ptr(void *p) | ||
{ | ||
return ((void *)(((size_t)p) & ~0x20000000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a common use for size_t
for address-like integers under Zephyr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely recall a bikeshed about "size_t vs. uintptr_t" being resolved in favor of size_t, but I don't think there's a firm policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely recall a bikeshed about "size_t vs. uintptr_t" being resolved in favor of size_t, but I don't think there's a firm policy.
I understand Zephyr will never support every bell and whistle but I don't understand this desire to "simplify" basic C99 types. Anyway there are hundreds/thousands occurrences of each right now.
In theory SIZE_MAX
can be very small, 65535 is allowed.
https://pubs.opengroup.org/onlinepubs/009696799/basedefs/stdint.h.html
Segmentation is quoted as one possible reason. Looking at the name of this file I doubt portability is an issue here.
Theoretical limits aside, I think it's more readable to use each one for the purpose they were intended for, so: uintptr_t
for absolute pointers and size_t
for an array size, something that could be returned by offsetof
etc. There may be grey areas but this does not seem like one of them.
I compared with the updated version, apart from the documentation (great to have it, thanks!) there's only one change, moving code from a .c to .h, so, all comments should still hold. |
@andyross I have tried this PR with a few code samples on ESP32 and found no issues. Anyway, I'll review the PR soon as well. |
arch/xtensa/core/README-WINDOWS.rst
Outdated
until cleared by a function return (by RETW, see below). Other bits | ||
stay zero. So there is one set bit in WINDOWSTART corresponding to | ||
each call frame that is live in hardware registers, and it will be | ||
followed by 0, 1 or 2 zero bits that tell you have "big" (how many |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that tell you how "big"
arch/xtensa/core/README-WINDOWS.rst
Outdated
they are naturally found not via the frame's own stack pointer but by | ||
negative offsets from the caller's (!) stack pointer. This is so that | ||
A1 can be modified at runtime (e.g. to push/pop data via traditional | ||
techniques) at runtime without having to have a separate register to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified at runtime (…) at runtime without
-- one "at runtime" seems extraneous
arch/xtensa/core/README-WINDOWS.rst
Outdated
A1 can be modified at runtime (e.g. to push/pop data via traditional | ||
techniques) at runtime without having to have a separate register to | ||
store the frame pointer -- the caller's stack pointer obviously won't | ||
change until we return and our frame doesn't exist any more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different explanation for this layout. The function knows maximum size of a call it makes, so it can reserve enough space in its own stack frame for the save area for these registers.
Register windows of all callers may be spilled when A1 is modified, so this modification is normally done with the movsp opcode that will take care (by raising the alloca exception) of copying spill area below A1 to a new location if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this bit could stand to be re-written. What I think I was trying to discuss here is the way that external code (a debugger or exception handler) finds those registers' spots on the stack. On other architectures, you just read the frame pointer in e.g. %RBP and offset from that. Or if that was optimized out the toolchain is expected to give you an elaborate framework for how to extract a base pointer from a current stack pointer and program counter. But Xtensa doesn't have a frame pointer and doesn't need external metadata, you just roll back to the parent caller in the windows and you can find the current frame without context (albeit in kind of a voodoo way).
And yeah, I never wrote anything about MOVSP/alloca because quite honestly I'm not sure I ever completely understood it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Xtensa doesn't have a frame pointer
This is not accurate too. Both windowed and call0 ABIs specify frame pointer register (A7 and A15 respectively). But because of the way addressing works in load/store opcodes (only non-negative offsets are available) the frame pointer points not to the top of a frame, but to the bottom. For that reason the frame pointer is less useful for the stack unwinding. But it is till used, e.g. in functions that use alloca.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, maybe I should just remove the paragraph. But the point remains: on other architectures a stack frame in the ABI is either unambiguously interpretable without context (e.g. classic x86 debugger can walk the stack of an unknown binary by following nothing but the BP values pushed by the ABI) or requires external context to understand (build with -fomit-frame-pointer and now you really can't do anything without the DWARF info absent a full RE effort).
Xtensa lives in this odd middle space where there is no way for a debugger to reconstruct the stack from a function's own registers, but where it is possible by rolling up the window state. And that struck me as worth calling out, I guess. Again the intended audience for the document is people like me trying to build an intuition about an odd architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that struck me as worth calling out, I guess.
Sure, I'm all for it. I'm pointing out that the proposed explanation of the a0-a3 / a4-a11 spill location split wasn't clear to me.
* | ||
* This function takes a pointer to any addressible object (either in | ||
* cacheable memory or not) and returns a pointer that can be used to | ||
* refer to the same memory through the L1 data cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for myself: I documented these two backwards! This function's result bypasses the cache.
Add one more fix for a cache coherence bug on thread creation that has been plaguing @lyakh I still need to address most of the review feedback above, and I note this is now failing to build after a rebase (something in the device model?! probably just a transitive header glitch from all the code motion?) that I need to fix. |
There was a bunch of dead historical cruft floating around in the arch/xtensa tree, left over from older code versions. It's time to do a cleanup pass. This is entirely refactoring and size optimization, no behavior changes on any in-tree devices should be present. Among the more notable changes: + xtensa_context.h offered an elaborate API to deal with a stack frame and context layout that we no longer use. + xtensa_rtos.h was entirely dead code + xtensa_timer.h was a parallel abstraction layer implementing in the architecture layer what we're already doing in our timer driver. + The architecture thread structs (_callee_saved and _thread_arch) aren't used by current code, and had dead fields that were removed. Unfortunately for standards compliance and C++ compatibility it's not possible to leave an empty struct here, so they have a single byte field. + xtensa_api.h was really just some interrupt management inlines used by irq.h, so fold that code into the outer header. + Remove the stale assembly offsets. This architecture doesn't use that facility. All told, more than a thousand lines have been removed. Not bad. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The xtensa atomics layer was written with hand-coded assembly that had to be called as functions. That's needlessly slow, given that the low level primitives are a two-instruction sequence. Ideally the compiler should see this as an inline to permit it to better optimize around the needed barriers. There was also a bug with the atomic_cas function, which had a loop internally instead of returning the old value synchronously on a failed swap. That's benign right now because our existing spin lock does nothing but retry it in a tight loop anyway, but it's incorrect per spec and would have caused a contention hang with more elaborate algorithms (for example a spinlock with backoff semantics). Remove the old implementation and replace with a much smaller inline C one based on just two assembly primitives. This patch also contains a little bit of refactoring to address the scheme has been split out into a separate header for each, and the ATOMIC_OPERATIONS_CUSTOM kconfig has been renamed to ATOMIC_OPERATIONS_ARCH to better capture what it means. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
This whole file is written to assume XEA2, so there's no value to using an abstraction call here. Write to the RSIL instruction directly. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The trace output layer was using this transformation already, make it an official API. There are other places doing similar logic that can benefit. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The multiprocessor entry code here had some bits that look to have been copied from esp32, including a clumsy stack switch that's needed there. But it wasn't actually switching the stack at all, which on this device is pointed at the top of HP-SRAM and can stay there until the second CPU swaps away into a real thread (this will need to change once we support >2 CPUS though). Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
There's no need to muck with the cache directly as long as we're careful about addressing the shared start record through an uncached volatile pointer. Correct a theoretical bug with the initial cache invalidate on the second CPU which was actually doing a flush (and thus potentially pushing things the boot ROM wrote into RAM now owned by the OS). Optimize memory layout a bit when using KERNEL_COHERENCE; we don't need a full cache line for the start record there as it's already in uncached memory. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The kernel passes the CPU's interrupt stack expected that it will start on that, so do it. Pass the initial stack pointer from the SOC layer in the variable "z_mp_stack_top" and set it in the assembly startup before calling z_mp_entry(). Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Zephyr's normal architecture is to do all initialization in the interrupt stacks. The CAVS code was traditionally written to start the stack at the end of HP-SRAM, where it has no protection against overlap with other uses (e.g. MP startup used the same region for stacks and saw cache collisions, and the SOF heap lives in this area too). Put it where Zephyr expects and we'll have fewer surprises. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Instead of passing the crt1 _start function as the entry code for auxiliary CPUs, use a tiny assembly stub instead which can avoid the runtime testing needed to skip the work in _start. All the crt1 code was doing was clearing BSS (which must not happen on a second CPU) and setting the stack pointer (which is wrong on the second CPU). This allows us to clean out the SMP code in crt1. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Back when I started work on this stuff, I had a set of notes on register windows that slowly evolved into something that looks like formal documentation. There really isn't any overview-style documentation of this stuff on the public internet, so it couldn't hurt to commit it here for posterity. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The Xtensa L1 cache layer has straightforward semantics accessible via single-instructions that operate on cache lines via physical addresses. These are very amenable to inlining. Unfortunately the Xtensa HAL layer requires function calls to do this, leading to significant code waste at the calling site, an extra frame on the stack and needless runtime instructions for situations where the call is over a constant region that could elide the loop. This is made even worse because the HAL library is not built with -ffunction-sections, so pulling in even one of these tiny cache functions has the effect of importing a 1500-byte object file into the link! Add our own tiny cache layer to include/arch/xtensa/cache.h and use that instead. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Both new thread creation and context switch had the same mistake in cache management: the bottom of the stack (the "unused" region between the lower memory bound and the live stack pointer) needs to be invalidated before we switch, because otherwise any dirty lines we might have left over can get flushed out on top of the same thread on another CPU that is putting live data there. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Putting spinlocks (or things containing them) onto the stack is a KERNEL_COHERENCE violation. This doesn't need to be there so just make it static. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
When we reach this code in interrupt context, our upper GPRs contain a cross-stack call that may still include some registers from the interrupted thread. Those need to go out to memory before we can do our cache coherence dance here. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
These tests would pass pointers to data on their own stacks to other threads, which is forbidden when CONFIG_KERNEL_COHERENCE (because stack memory isn't cache-coherent). Make the variables static. Also, queue had two sleeps of 2 ticks (having been written in an era where that meant "20-30ms"), and on a device with a 50 kHz tick rate that's not very much time at all. It would sometimes fail spuriously because the spawned threads didn't consume the queue entries in time. How about 10ms of real time instead? Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The code here was written to "get out of the way just long enough for the trivial context switch and callback to execute". But on a machine with 50 kHz ticks, that's not reliably enough time and this was failing spuriously. Which would have been a reasonably forgivable mistake to make had I not written this code with this very machine in mind... Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Think I got this now. I have reliable passing (modulo the annoying twister --device-testing bug) results now for ever test run under twister, plus the handful of non-default ones I knew to try and run. New material includes what I believe to be the final fix for the coherence trouble detailed in #32705 and a bunch of minor test fixups unmasked once everything was working reliably otherwise. Should be good to merge. Please re-/review as needed. |
Oh, and to repeat form before: the "QUOTED_WHITESPACE_BEFORE_NEWLINE" checkpatch errors are a false positive. They're present in an asm() where IMHO they greatly improve readability. The error was designed to catch bad format strings. |
Ping this one again. @lyakh , are you OK with this at this stage? The series as it exists seems very solid to me; I have reliable passing results for every test that is configured for the board. And once the work is unblocked to submit, I have another series with minor platform/kconfig fixups that gets us passes on all but five tests across the whole suite. :) (edit: see https://github.com/andyross/zephyr/tree/xtensa-wip , which is a combined tree with #32356, #32678, #32784 and some final fixup work) |
Other architectures do this too by e.g. pushing the incoming stack | ||
pointer onto the stack as a standard "frame pointer" defined in the | ||
platform ABI. Xtensa wraps this together with the natural spill area | ||
for register windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andyross fwiw, I've noticed some differences between GCC and XCC here wrt stack and register layout when debugging exceptions. The SOF exception code will dump registers and stack but this looks different between XCC and GCC. SOF has a python script to decode the dump (works on XCC not GCC). Something to be aware off here for later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not reviewing earlier, I was on vacation last week. This passes the synchronization test in my setup now! I've posted one question to one of the commits, let's clarify this today and then it should be good to merge!
*/ | ||
__asm__ volatile("mov %0, a1" : "=r"(osp)); | ||
osp -= 16; | ||
z_xtensa_cache_flush((void *)osp, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand how we're sure that only a single line has to be flushed? "The calling function" that you're referring to in the commend - is it the function, "calling" this inline function (in quotes, because as an inline this function won't really generate a function frame on stack, right?), or the function one more level up? But even 16 bytes can cross the (64-byte long) cache line boundary, cannot they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the lines get flushed in the middle of arch_switch()
. This one line is here because there may be overlap in that line between the bytes that are live at the bottom of the stack and the bytes below the stack that have to be invalidated. We have to be sure that there is no dirty data in the stack, but can't accidentally invalidate real data at the bottom. So we flush the one line where there may be overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And yes, this detail, and the existence of the 16 byte spill area I forgot in the original version, was the hardest bit.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andyross so the case of the spill area crossing 2 cache lines is impossible? Can our current stack pointer be something like 0x1008 then you'd need to flush two lines: at 0xfc0 and 0x1000 - assuming 0x40 byte cache lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller spill area just moves the boundary. Again, this isn't the whole cache operation, the rest is completed later, in arch_switch()
. The theory is that we have two regions of interest:
- The top of the stack with live data that has to be flushed
- The bottom of the stack with stale data that might be dirty and must be flushed or invalidated, we can't have dirty data in the cache once the thread stops running (because it will end up being flushed out later on top of the same thread running on another CPU).
This code here, in arch_cohere_stacks()
, is only concerned with region 2 (because we're still running C code and dirtying our own stack, we can't flush yet -- that happens in assembly right before we switch stacks into the new context).
But there's a complexity, because there can be one cache line that has bytes from region 1 and region 2. So we flush the top line instead of invalidating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, and you had a question above that I skipped: the "caller" in this case refers to the ABI calling context -- the last spot where a CALLn instruction was issued, not necessarily a C function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the case of the spill area crossing 2 cache lines is impossible?
Spill area is 16 bytes in size and a cache line may be as small as 4 bytes. So in the worst case the spill area can cover 4 cache lines.
Can our current stack pointer be something like 0x1008
According to xtensa ABI stack pointer must be aligned to 16 bytes.
Finally got a chance this weekend (snowed in! release tagged! family activities cancelled! no impending dissolution of the democratic republic!) to revisit a bunch of "large but basically minor" xtensa changes that I've been sitting on in varying states of completion for a while. The dead code removal is the most pleasing aspect.