Big Xtensa architecture cleanup #32356

Commit 6b84ab3 ("kernel/sched: Adjust locking in z_swap()") moved the call to arch_cohere_stacks() out of the scheduler lock while doing some reorgnizing. On further reflection, this is incorrect. When done outside the lock, the two arch_cohere_stacks() calls will race against each other. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

There was a bunch of dead historical cruft floating around in the arch/xtensa tree, left over from older code versions. It's time to do a cleanup pass. This is entirely refactoring and size optimization, no behavior changes on any in-tree devices should be present. Among the more notable changes: + xtensa_context.h offered an elaborate API to deal with a stack frame and context layout that we no longer use. + xtensa_rtos.h was entirely dead code + xtensa_timer.h was a parallel abstraction layer implementing in the architecture layer what we're already doing in our timer driver. + The architecture thread structs (_callee_saved and _thread_arch) aren't used by current code, and had dead fields that were removed. Unfortunately for standards compliance and C++ compatibility it's not possible to leave an empty struct here, so they have a single byte field. + xtensa_api.h was really just some interrupt management inlines used by irq.h, so fold that code into the outer header. + Remove the stale assembly offsets. This architecture doesn't use that facility. All told, more than a thousand lines have been removed. Not bad. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

The xtensa atomics layer was written with hand-coded assembly that had to be called as functions. That's needlessly slow, given that the low level primitives are a two-instruction sequence. Ideally the compiler should see this as an inline to permit it to better optimize around the needed barriers. There was also a bug with the atomic_cas function, which had a loop internally instead of returning the old value synchronously on a failed swap. That's benign right now because our existing spin lock does nothing but retry it in a tight loop anyway, but it's incorrect per spec and would have caused a contention hang with more elaborate algorithms (for example a spinlock with backoff semantics). Remove the old implementation and replace with a much smaller inline C one based on just two assembly primitives. This patch also contains a little bit of refactoring to address the scheme has been split out into a separate header for each, and the ATOMIC_OPERATIONS_CUSTOM kconfig has been renamed to ATOMIC_OPERATIONS_ARCH to better capture what it means. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

This whole file is written to assume XEA2, so there's no value to using an abstraction call here. Write to the RSIL instruction directly. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

The trace output layer was using this transformation already, make it an official API. There are other places doing similar logic that can benefit. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

The multiprocessor entry code here had some bits that look to have been copied from esp32, including a clumsy stack switch that's needed there. But it wasn't actually switching the stack at all, which on this device is pointed at the top of HP-SRAM and can stay there until the second CPU swaps away into a real thread (this will need to change once we support >2 CPUS though). Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

There's no need to muck with the cache directly as long as we're careful about addressing the shared start record through an uncached volatile pointer. Correct a theoretical bug with the initial cache invalidate on the second CPU which was actually doing a flush (and thus potentially pushing things the boot ROM wrote into RAM now owned by the OS). Optimize memory layout a bit when using KERNEL_COHERENCE; we don't need a full cache line for the start record there as it's already in uncached memory. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

The kernel passes the CPU's interrupt stack expected that it will start on that, so do it. Pass the initial stack pointer from the SOC layer in the variable "z_mp_stack_top" and set it in the assembly startup before calling z_mp_entry(). Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

Zephyr's normal architecture is to do all initialization in the interrupt stacks. The CAVS code was traditionally written to start the stack at the end of HP-SRAM, where it has no protection against overlap with other uses (e.g. MP startup used the same region for stacks and saw cache collisions, and the SOF heap lives in this area too). Put it where Zephyr expects and we'll have fewer surprises. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

Instead of passing the crt1 _start function as the entry code for auxiliary CPUs, use a tiny assembly stub instead which can avoid the runtime testing needed to skip the work in _start. All the crt1 code was doing was clearing BSS (which must not happen on a second CPU) and setting the stack pointer (which is wrong on the second CPU). This allows us to clean out the SMP code in crt1. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

Back when I started work on this stuff, I had a set of notes on register windows that slowly evolved into something that looks like formal documentation. There really isn't any overview-style documentation of this stuff on the public internet, so it couldn't hurt to commit it here for posterity. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

The Xtensa L1 cache layer has straightforward semantics accessible via single-instructions that operate on cache lines via physical addresses. These are very amenable to inlining. Unfortunately the Xtensa HAL layer requires function calls to do this, leading to significant code waste at the calling site, an extra frame on the stack and needless runtime instructions for situations where the call is over a constant region that could elide the loop. This is made even worse because the HAL library is not built with -ffunction-sections, so pulling in even one of these tiny cache functions has the effect of importing a 1500-byte object file into the link! Add our own tiny cache layer to include/arch/xtensa/cache.h and use that instead. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

Both new thread creation and context switch had the same mistake in cache management: the bottom of the stack (the "unused" region between the lower memory bound and the live stack pointer) needs to be invalidated before we switch, because otherwise any dirty lines we might have left over can get flushed out on top of the same thread on another CPU that is putting live data there. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

Putting spinlocks (or things containing them) onto the stack is a KERNEL_COHERENCE violation. This doesn't need to be there so just make it static. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

When we reach this code in interrupt context, our upper GPRs contain a cross-stack call that may still include some registers from the interrupted thread. Those need to go out to memory before we can do our cache coherence dance here. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

These tests would pass pointers to data on their own stacks to other threads, which is forbidden when CONFIG_KERNEL_COHERENCE (because stack memory isn't cache-coherent). Make the variables static. Also, queue had two sleeps of 2 ticks (having been written in an era where that meant "20-30ms"), and on a device with a 50 kHz tick rate that's not very much time at all. It would sometimes fail spuriously because the spawned threads didn't consume the queue entries in time. How about 10ms of real time instead? Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

The code here was written to "get out of the way just long enough for the trivial context switch and callback to execute". But on a machine with 50 kHz ticks, that's not reliably enough time and this was failing spuriously. Which would have been a reasonably forgivable mistake to make had I not written this code with this very machine in mind... Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big Xtensa architecture cleanup #32356

Big Xtensa architecture cleanup #32356

Commits on Mar 2, 2021