Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big Xtensa architecture cleanup #32356

Merged
merged 17 commits into from Mar 8, 2021

Conversation

andyross
Copy link
Contributor

Finally got a chance this weekend (snowed in! release tagged! family activities cancelled! no impending dissolution of the democratic republic!) to revisit a bunch of "large but basically minor" xtensa changes that I've been sitting on in varying states of completion for a while. The dead code removal is the most pleasing aspect.

@andyross andyross requested a review from lyakh February 16, 2021 01:14
@github-actions github-actions bot added area: API Changes to public APIs area: Kernel area: Timer Timer area: Xtensa Xtensa Architecture labels Feb 16, 2021
@zephyrbot zephyrbot added area: Base OS Base OS Library (lib/os) platform: Intel ADSP Intel Audio platforms labels Feb 16, 2021
@andyross
Copy link
Contributor Author

Add @sylvioalves for esp32 eyes. Unfortunately my board (which I haven't touched in over a year) seems to be having trouble. All the validation on Espressif was build-only. Similarly the Intel Sue Creek changes are unvalidated, though I'm more confident there because of the similarly to cavs.

@andyross
Copy link
Contributor Author

Extend with some intel_adsp work to clean up multiprocessor startup. The major bug addressed was a stack collision in cached memory at the top of RAM. Now all startup paths use the "Zephyr standard" scheme where each CPU gets a dedicated interrupt stack for startup use.

@andyross
Copy link
Contributor Author

Rebase. Clean up a little bit of voodoo in the cache patch whose root cause turned out to be the ADSP SMP stack collision thing fixed later. Also adds a document on "how Xtensa's crazy register windows work" that I sketched out long ago and have been sitting on. There's no particular reason why Zephyr should be responsible for documenting an architecture feature, but no one else does it very well and IMHO this may be useful to someone in the future.

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so weird... I was sure to have submitted this review :-/ I'll look at the new version again, but maybe these comments can be addressed too, if they're still relevant

* Evaluates to the old value which was atomically replaced.
*/

#define ATOMIC_XCHG(expr) ({ \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to make target also a macro parameter, especially since this is defined in a header, i.e. visible to and usable by all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a weird macro to generate a function body, so I went with minimal parameters to maximize readability. But maybe I should rename this to something obscure like "Z_XT_GEN_XCHG()" or something so it doesn't look like something anyone is expected to call.

/* Some architectures need their own implementation */
# ifdef CONFIG_XTENSA
/* Not all Xtensa toolchains support GCC-style atomic intrinsics */
# include <arch/xtensa/atomic_xtensa.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be made a generic <arch/atomic.h>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another of those project policy things from which I tend to hide. I think the rule here is that this is a public API, therefore it belongs in the top-level include directory, and therefore has to have an architecture-specific pathname and can't be generic. But I might be wrong.

Quite honestly I typically just drop stuff where I think it's supposed to be and wait for someone else to submit a patch to move it where it's supposed to go. We're constantly moving stuff around anyway, so I view this as developer parallelism. :)

@@ -0,0 +1,143 @@
/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: this file is already in an xtensa directory, just calling it atomic.h should be enough, also following other files in the same directory?

{
#if XCHAL_DCACHE_SIZE
size_t first = ((size_t) addr) & ~(Z_DLINE - 1);
size_t last = (((size_t) addr) + bytes - 1) & ~(Z_DLINE - 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROUND_DOWN() / ROUND_UP(). Shouldn't the type be uintptr_t? For line too of course

#define Z_DLINE (XCHAL_DCACHE_LINESIZE)

#if XCHAL_DCACHE_SIZE
#define Z_IS_POW2(x) (((x) > 0) && (((x) & ((x)-1)) == 0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if someone defines int x = 0x80000000; (arguably silly as it might be) it won't be recognised as a power of 2?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define Z_IS_POW2(x) (((x) > 0) && (((x) & ((x)-1)) == 0))
#define Z_IS_POW2(x) (((x) != 0) && (((x) & ((x)-1)) == 0))

#if XCHAL_DCACHE_SIZE
char *start = (char *)addr;

for (char *end = start + bytes; start <= end; start += Z_DLINE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also potentially one line too many

/* For complicated reasons in arch_cohere_stacks(), this has to be a
* proper function call and not an inline so the emitted frame does
* not overlap the area to be invalidated.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but stack frames aren't cache-line size aligned, are they? They still wouldn't overlap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are. The platform code guarantees alignment of static stacks, and will align the stack top correctly within a user-provided buffer. It's a kconfig somewhere, but I should probably check to make sure that it's set correctly. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Oh, and this bug was found and fixed, so that comment got removed. All the code is inlined the same way)

@@ -23,6 +24,8 @@
#define MANIFEST_BASE IMR_BOOT_LDR_MANIFEST_BASE
#endif

int ANDY_PRINT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug left-over?

@@ -15,4 +15,43 @@
#define SOC_DCACHE_INVALIDATE(addr, size) \
z_xtensa_cache_inv((addr), (size))

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a native English speaker but my dictionary tells me, that "canonize" means "declare holy" whereas "canonicalize" "make standard" :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Canonize" does mean that, and that was the intent: taking a previously ad-hoc trick and making it an official part of the (SoC) API. But yeah, it's maybe a bit too poetic. I'll rephrase.

*/
static inline void *z_soc_uncached_ptr(void *p)
{
return ((void *)(((size_t)p) & ~0x20000000));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a common use for size_t for address-like integers under Zephyr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely recall a bikeshed about "size_t vs. uintptr_t" being resolved in favor of size_t, but I don't think there's a firm policy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely recall a bikeshed about "size_t vs. uintptr_t" being resolved in favor of size_t, but I don't think there's a firm policy.

I understand Zephyr will never support every bell and whistle but I don't understand this desire to "simplify" basic C99 types. Anyway there are hundreds/thousands occurrences of each right now.

In theory SIZE_MAX can be very small, 65535 is allowed.
https://pubs.opengroup.org/onlinepubs/009696799/basedefs/stdint.h.html
Segmentation is quoted as one possible reason. Looking at the name of this file I doubt portability is an issue here.

Theoretical limits aside, I think it's more readable to use each one for the purpose they were intended for, so: uintptr_t for absolute pointers and size_t for an array size, something that could be returned by offsetof etc. There may be grey areas but this does not seem like one of them.

@lyakh
Copy link
Collaborator

lyakh commented Feb 17, 2021

This is so weird... I was sure to have submitted this review :-/ I'll look at the new version again, but maybe these comments can be addressed too, if they're still relevant

I compared with the updated version, apart from the documentation (great to have it, thanks!) there's only one change, moving code from a .c to .h, so, all comments should still hold.

@sylvioalves
Copy link
Collaborator

sylvioalves commented Feb 18, 2021

Add @sylvioalves for esp32 eyes. Unfortunately my board (which I haven't touched in over a year) seems to be having trouble. All the validation on Espressif was build-only. Similarly the Intel Sue Creek changes are unvalidated, though I'm more confident there because of the similarly to cavs.

@andyross I have tried this PR with a few code samples on ESP32 and found no issues. Anyway, I'll review the PR soon as well.

until cleared by a function return (by RETW, see below). Other bits
stay zero. So there is one set bit in WINDOWSTART corresponding to
each call frame that is live in hardware registers, and it will be
followed by 0, 1 or 2 zero bits that tell you have "big" (how many
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that tell you how "big"

they are naturally found not via the frame's own stack pointer but by
negative offsets from the caller's (!) stack pointer. This is so that
A1 can be modified at runtime (e.g. to push/pop data via traditional
techniques) at runtime without having to have a separate register to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified at runtime (…) at runtime without

-- one "at runtime" seems extraneous

A1 can be modified at runtime (e.g. to push/pop data via traditional
techniques) at runtime without having to have a separate register to
store the frame pointer -- the caller's stack pointer obviously won't
change until we return and our frame doesn't exist any more.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different explanation for this layout. The function knows maximum size of a call it makes, so it can reserve enough space in its own stack frame for the save area for these registers.

Register windows of all callers may be spilled when A1 is modified, so this modification is normally done with the movsp opcode that will take care (by raising the alloca exception) of copying spill area below A1 to a new location if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this bit could stand to be re-written. What I think I was trying to discuss here is the way that external code (a debugger or exception handler) finds those registers' spots on the stack. On other architectures, you just read the frame pointer in e.g. %RBP and offset from that. Or if that was optimized out the toolchain is expected to give you an elaborate framework for how to extract a base pointer from a current stack pointer and program counter. But Xtensa doesn't have a frame pointer and doesn't need external metadata, you just roll back to the parent caller in the windows and you can find the current frame without context (albeit in kind of a voodoo way).

And yeah, I never wrote anything about MOVSP/alloca because quite honestly I'm not sure I ever completely understood it. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But Xtensa doesn't have a frame pointer

This is not accurate too. Both windowed and call0 ABIs specify frame pointer register (A7 and A15 respectively). But because of the way addressing works in load/store opcodes (only non-negative offsets are available) the frame pointer points not to the top of a frame, but to the bottom. For that reason the frame pointer is less useful for the stack unwinding. But it is till used, e.g. in functions that use alloca.

Copy link
Contributor Author

@andyross andyross Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, maybe I should just remove the paragraph. But the point remains: on other architectures a stack frame in the ABI is either unambiguously interpretable without context (e.g. classic x86 debugger can walk the stack of an unknown binary by following nothing but the BP values pushed by the ABI) or requires external context to understand (build with -fomit-frame-pointer and now you really can't do anything without the DWARF info absent a full RE effort).

Xtensa lives in this odd middle space where there is no way for a debugger to reconstruct the stack from a function's own registers, but where it is possible by rolling up the window state. And that struck me as worth calling out, I guess. Again the intended audience for the document is people like me trying to build an intuition about an odd architecture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that struck me as worth calling out, I guess.

Sure, I'm all for it. I'm pointing out that the proposed explanation of the a0-a3 / a4-a11 spill location split wasn't clear to me.

*
* This function takes a pointer to any addressible object (either in
* cacheable memory or not) and returns a pointer that can be used to
* refer to the same memory through the L1 data cache.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself: I documented these two backwards! This function's result bypasses the cache.

@andyross
Copy link
Contributor Author

Add one more fix for a cache coherence bug on thread creation that has been plaguing @lyakh

I still need to address most of the review feedback above, and I note this is now failing to build after a rebase (something in the device model?! probably just a transitive header glitch from all the code motion?) that I need to fix.

Andy Ross added 16 commits March 1, 2021 16:21
There was a bunch of dead historical cruft floating around in the
arch/xtensa tree, left over from older code versions.  It's time to do
a cleanup pass.  This is entirely refactoring and size optimization,
no behavior changes on any in-tree devices should be present.

Among the more notable changes:

+ xtensa_context.h offered an elaborate API to deal with a stack frame
  and context layout that we no longer use.

+ xtensa_rtos.h was entirely dead code

+ xtensa_timer.h was a parallel abstraction layer implementing in the
  architecture layer what we're already doing in our timer driver.

+ The architecture thread structs (_callee_saved and _thread_arch)
  aren't used by current code, and had dead fields that were removed.
  Unfortunately for standards compliance and C++ compatibility it's
  not possible to leave an empty struct here, so they have a single
  byte field.

+ xtensa_api.h was really just some interrupt management inlines used
  by irq.h, so fold that code into the outer header.

+ Remove the stale assembly offsets.  This architecture doesn't use
  that facility.

All told, more than a thousand lines have been removed.  Not bad.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The xtensa atomics layer was written with hand-coded assembly that had
to be called as functions.  That's needlessly slow, given that the low
level primitives are a two-instruction sequence.  Ideally the compiler
should see this as an inline to permit it to better optimize around
the needed barriers.

There was also a bug with the atomic_cas function, which had a loop
internally instead of returning the old value synchronously on a
failed swap.  That's benign right now because our existing spin lock
does nothing but retry it in a tight loop anyway, but it's incorrect
per spec and would have caused a contention hang with more elaborate
algorithms (for example a spinlock with backoff semantics).

Remove the old implementation and replace with a much smaller inline C
one based on just two assembly primitives.

This patch also contains a little bit of refactoring to address the
scheme has been split out into a separate header for each, and the
ATOMIC_OPERATIONS_CUSTOM kconfig has been renamed to
ATOMIC_OPERATIONS_ARCH to better capture what it means.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
This whole file is written to assume XEA2, so there's no value to
using an abstraction call here.  Write to the RSIL instruction
directly.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The trace output layer was using this transformation already, make it
an official API.  There are other places doing similar logic that can
benefit.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The multiprocessor entry code here had some bits that look to have
been copied from esp32, including a clumsy stack switch that's needed
there.  But it wasn't actually switching the stack at all, which on
this device is pointed at the top of HP-SRAM and can stay there until
the second CPU swaps away into a real thread (this will need to change
once we support >2 CPUS though).

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
There's no need to muck with the cache directly as long as we're
careful about addressing the shared start record through an uncached
volatile pointer.

Correct a theoretical bug with the initial cache invalidate on the
second CPU which was actually doing a flush (and thus potentially
pushing things the boot ROM wrote into RAM now owned by the OS).

Optimize memory layout a bit when using KERNEL_COHERENCE; we don't
need a full cache line for the start record there as it's already in
uncached memory.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The kernel passes the CPU's interrupt stack expected that it will
start on that, so do it.  Pass the initial stack pointer from the SOC
layer in the variable "z_mp_stack_top" and set it in the assembly
startup before calling z_mp_entry().

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Zephyr's normal architecture is to do all initialization in the
interrupt stacks.  The CAVS code was traditionally written to start
the stack at the end of HP-SRAM, where it has no protection against
overlap with other uses (e.g. MP startup used the same region for
stacks and saw cache collisions, and the SOF heap lives in this area
too).  Put it where Zephyr expects and we'll have fewer surprises.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Instead of passing the crt1 _start function as the entry code for
auxiliary CPUs, use a tiny assembly stub instead which can avoid the
runtime testing needed to skip the work in _start.  All the crt1 code
was doing was clearing BSS (which must not happen on a second CPU) and
setting the stack pointer (which is wrong on the second CPU).

This allows us to clean out the SMP code in crt1.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Back when I started work on this stuff, I had a set of notes on
register windows that slowly evolved into something that looks like
formal documentation.  There really isn't any overview-style
documentation of this stuff on the public internet, so it couldn't
hurt to commit it here for posterity.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The Xtensa L1 cache layer has straightforward semantics accessible via
single-instructions that operate on cache lines via physical
addresses.  These are very amenable to inlining.

Unfortunately the Xtensa HAL layer requires function calls to do this,
leading to significant code waste at the calling site, an extra frame
on the stack and needless runtime instructions for situations where
the call is over a constant region that could elide the loop.  This is
made even worse because the HAL library is not built with
-ffunction-sections, so pulling in even one of these tiny cache
functions has the effect of importing a 1500-byte object file into the
link!

Add our own tiny cache layer to include/arch/xtensa/cache.h and use
that instead.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Both new thread creation and context switch had the same mistake in
cache management: the bottom of the stack (the "unused" region between
the lower memory bound and the live stack pointer) needs to be
invalidated before we switch, because otherwise any dirty lines we
might have left over can get flushed out on top of the same thread on
another CPU that is putting live data there.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Putting spinlocks (or things containing them) onto the stack is a
KERNEL_COHERENCE violation.  This doesn't need to be there so just
make it static.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
When we reach this code in interrupt context, our upper GPRs contain a
cross-stack call that may still include some registers from the
interrupted thread.  Those need to go out to memory before we can do
our cache coherence dance here.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
These tests would pass pointers to data on their own stacks to other
threads, which is forbidden when CONFIG_KERNEL_COHERENCE (because
stack memory isn't cache-coherent).  Make the variables static.

Also, queue had two sleeps of 2 ticks (having been written in an era
where that meant "20-30ms"), and on a device with a 50 kHz tick rate
that's not very much time at all.  It would sometimes fail spuriously
because the spawned threads didn't consume the queue entries in time.
How about 10ms of real time instead?

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The code here was written to "get out of the way just long enough for
the trivial context switch and callback to execute".  But on a machine
with 50 kHz ticks, that's not reliably enough time and this was
failing spuriously.  Which would have been a reasonably forgivable
mistake to make had I not written this code with this very machine in
mind...

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@andyross
Copy link
Contributor Author

andyross commented Mar 2, 2021

Think I got this now. I have reliable passing (modulo the annoying twister --device-testing bug) results now for ever test run under twister, plus the handful of non-default ones I knew to try and run. New material includes what I believe to be the final fix for the coherence trouble detailed in #32705 and a bunch of minor test fixups unmasked once everything was working reliably otherwise.

Should be good to merge. Please re-/review as needed.

@andyross
Copy link
Contributor Author

andyross commented Mar 2, 2021

Oh, and to repeat form before: the "QUOTED_WHITESPACE_BEFORE_NEWLINE" checkpatch errors are a false positive. They're present in an asm() where IMHO they greatly improve readability. The error was designed to catch bad format strings.

@andyross
Copy link
Contributor Author

andyross commented Mar 3, 2021

Ping this one again. @lyakh , are you OK with this at this stage? The series as it exists seems very solid to me; I have reliable passing results for every test that is configured for the board. And once the work is unblocked to submit, I have another series with minor platform/kconfig fixups that gets us passes on all but five tests across the whole suite. :)

(edit: see https://github.com/andyross/zephyr/tree/xtensa-wip , which is a combined tree with #32356, #32678, #32784 and some final fixup work)

@nashif nashif requested a review from lyakh March 3, 2021 16:22
Other architectures do this too by e.g. pushing the incoming stack
pointer onto the stack as a standard "frame pointer" defined in the
platform ABI. Xtensa wraps this together with the natural spill area
for register windows.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross fwiw, I've noticed some differences between GCC and XCC here wrt stack and register layout when debugging exceptions. The SOF exception code will dump registers and stack but this looks different between XCC and GCC. SOF has a python script to decode the dump (works on XCC not GCC). Something to be aware off here for later on.

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not reviewing earlier, I was on vacation last week. This passes the synchronization test in my setup now! I've posted one question to one of the commits, let's clarify this today and then it should be good to merge!

*/
__asm__ volatile("mov %0, a1" : "=r"(osp));
osp -= 16;
z_xtensa_cache_flush((void *)osp, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand how we're sure that only a single line has to be flushed? "The calling function" that you're referring to in the commend - is it the function, "calling" this inline function (in quotes, because as an inline this function won't really generate a function frame on stack, right?), or the function one more level up? But even 16 bytes can cross the (64-byte long) cache line boundary, cannot they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the lines get flushed in the middle of arch_switch(). This one line is here because there may be overlap in that line between the bytes that are live at the bottom of the stack and the bytes below the stack that have to be invalidated. We have to be sure that there is no dirty data in the stack, but can't accidentally invalidate real data at the bottom. So we flush the one line where there may be overlap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And yes, this detail, and the existence of the 16 byte spill area I forgot in the original version, was the hardest bit.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross so the case of the spill area crossing 2 cache lines is impossible? Can our current stack pointer be something like 0x1008 then you'd need to flush two lines: at 0xfc0 and 0x1000 - assuming 0x40 byte cache lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller spill area just moves the boundary. Again, this isn't the whole cache operation, the rest is completed later, in arch_switch(). The theory is that we have two regions of interest:

  1. The top of the stack with live data that has to be flushed
  2. The bottom of the stack with stale data that might be dirty and must be flushed or invalidated, we can't have dirty data in the cache once the thread stops running (because it will end up being flushed out later on top of the same thread running on another CPU).

This code here, in arch_cohere_stacks(), is only concerned with region 2 (because we're still running C code and dirtying our own stack, we can't flush yet -- that happens in assembly right before we switch stacks into the new context).

But there's a complexity, because there can be one cache line that has bytes from region 1 and region 2. So we flush the top line instead of invalidating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, and you had a question above that I skipped: the "caller" in this case refers to the ABI calling context -- the last spot where a CALLn instruction was issued, not necessarily a C function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the case of the spill area crossing 2 cache lines is impossible?

Spill area is 16 bytes in size and a cache line may be as small as 4 bytes. So in the worst case the spill area can cover 4 cache lines.

Can our current stack pointer be something like 0x1008

According to xtensa ABI stack pointer must be aligned to 16 bytes.

@nashif nashif merged commit 29b8358 into zephyrproject-rtos:master Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Base OS Base OS Library (lib/os) area: Documentation area: Kernel area: Timer Timer area: Xtensa Xtensa Architecture platform: Intel ADSP Intel Audio platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants