Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big Xtensa architecture cleanup #32356

Merged
merged 17 commits into from
Mar 8, 2021

Commits on Mar 2, 2021

  1. kernel/swap: Move arch_cohere_stacks() back under the lock

    Commit 6b84ab3 ("kernel/sched: Adjust locking in z_swap()") moved
    the call to arch_cohere_stacks() out of the scheduler lock while doing
    some reorgnizing.  On further reflection, this is incorrect.  When
    done outside the lock, the two arch_cohere_stacks() calls will race
    against each other.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    c669607 View commit details
    Browse the repository at this point in the history
  2. arch/xtensa: General cleanup, remove dead code

    There was a bunch of dead historical cruft floating around in the
    arch/xtensa tree, left over from older code versions.  It's time to do
    a cleanup pass.  This is entirely refactoring and size optimization,
    no behavior changes on any in-tree devices should be present.
    
    Among the more notable changes:
    
    + xtensa_context.h offered an elaborate API to deal with a stack frame
      and context layout that we no longer use.
    
    + xtensa_rtos.h was entirely dead code
    
    + xtensa_timer.h was a parallel abstraction layer implementing in the
      architecture layer what we're already doing in our timer driver.
    
    + The architecture thread structs (_callee_saved and _thread_arch)
      aren't used by current code, and had dead fields that were removed.
      Unfortunately for standards compliance and C++ compatibility it's
      not possible to leave an empty struct here, so they have a single
      byte field.
    
    + xtensa_api.h was really just some interrupt management inlines used
      by irq.h, so fold that code into the outer header.
    
    + Remove the stale assembly offsets.  This architecture doesn't use
      that facility.
    
    All told, more than a thousand lines have been removed.  Not bad.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    4d61aff View commit details
    Browse the repository at this point in the history
  3. arch/xtensa: Inline atomics

    The xtensa atomics layer was written with hand-coded assembly that had
    to be called as functions.  That's needlessly slow, given that the low
    level primitives are a two-instruction sequence.  Ideally the compiler
    should see this as an inline to permit it to better optimize around
    the needed barriers.
    
    There was also a bug with the atomic_cas function, which had a loop
    internally instead of returning the old value synchronously on a
    failed swap.  That's benign right now because our existing spin lock
    does nothing but retry it in a tight loop anyway, but it's incorrect
    per spec and would have caused a contention hang with more elaborate
    algorithms (for example a spinlock with backoff semantics).
    
    Remove the old implementation and replace with a much smaller inline C
    one based on just two assembly primitives.
    
    This patch also contains a little bit of refactoring to address the
    scheme has been split out into a separate header for each, and the
    ATOMIC_OPERATIONS_CUSTOM kconfig has been renamed to
    ATOMIC_OPERATIONS_ARCH to better capture what it means.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    088d1cf View commit details
    Browse the repository at this point in the history
  4. arch/xtensa: Remove XTOS dependency in irq_lock()

    This whole file is written to assume XEA2, so there's no value to
    using an abstraction call here.  Write to the RSIL instruction
    directly.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    1a6d7b6 View commit details
    Browse the repository at this point in the history
  5. soc/intel_adsp: Elevate cached/uncached mapping to a SoC API

    The trace output layer was using this transformation already, make it
    an official API.  There are other places doing similar logic that can
    benefit.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    ed7a7e4 View commit details
    Browse the repository at this point in the history
  6. soc/intel_adsp: Clean up MP startup

    The multiprocessor entry code here had some bits that look to have
    been copied from esp32, including a clumsy stack switch that's needed
    there.  But it wasn't actually switching the stack at all, which on
    this device is pointed at the top of HP-SRAM and can stay there until
    the second CPU swaps away into a real thread (this will need to change
    once we support >2 CPUS though).
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    81ed70a View commit details
    Browse the repository at this point in the history
  7. soc/intel_adsp: Clean up cache handling in MP startup

    There's no need to muck with the cache directly as long as we're
    careful about addressing the shared start record through an uncached
    volatile pointer.
    
    Correct a theoretical bug with the initial cache invalidate on the
    second CPU which was actually doing a flush (and thus potentially
    pushing things the boot ROM wrote into RAM now owned by the OS).
    
    Optimize memory layout a bit when using KERNEL_COHERENCE; we don't
    need a full cache line for the start record there as it's already in
    uncached memory.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    962039a View commit details
    Browse the repository at this point in the history
  8. soc/intel_adsp: Use the correct MP stack pointer

    The kernel passes the CPU's interrupt stack expected that it will
    start on that, so do it.  Pass the initial stack pointer from the SOC
    layer in the variable "z_mp_stack_top" and set it in the assembly
    startup before calling z_mp_entry().
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    12b4bc4 View commit details
    Browse the repository at this point in the history
  9. soc/intel_adsp: Put initial stack into the CPU0 interrupt stack

    Zephyr's normal architecture is to do all initialization in the
    interrupt stacks.  The CAVS code was traditionally written to start
    the stack at the end of HP-SRAM, where it has no protection against
    overlap with other uses (e.g. MP startup used the same region for
    stacks and saw cache collisions, and the SOF heap lives in this area
    too).  Put it where Zephyr expects and we'll have fewer surprises.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    efa05d1 View commit details
    Browse the repository at this point in the history
  10. arch/xtensa: soc/intel_adsp: Rework MP code entry

    Instead of passing the crt1 _start function as the entry code for
    auxiliary CPUs, use a tiny assembly stub instead which can avoid the
    runtime testing needed to skip the work in _start.  All the crt1 code
    was doing was clearing BSS (which must not happen on a second CPU) and
    setting the stack pointer (which is wrong on the second CPU).
    
    This allows us to clean out the SMP code in crt1.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    811aa96 View commit details
    Browse the repository at this point in the history
  11. arch/xtensa: Add an arch-internal README on register windows

    Back when I started work on this stuff, I had a set of notes on
    register windows that slowly evolved into something that looks like
    formal documentation.  There really isn't any overview-style
    documentation of this stuff on the public internet, so it couldn't
    hurt to commit it here for posterity.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    5a47aa0 View commit details
    Browse the repository at this point in the history
  12. arch/xtensa: Add non-HAL caching primitives

    The Xtensa L1 cache layer has straightforward semantics accessible via
    single-instructions that operate on cache lines via physical
    addresses.  These are very amenable to inlining.
    
    Unfortunately the Xtensa HAL layer requires function calls to do this,
    leading to significant code waste at the calling site, an extra frame
    on the stack and needless runtime instructions for situations where
    the call is over a constant region that could elide the loop.  This is
    made even worse because the HAL library is not built with
    -ffunction-sections, so pulling in even one of these tiny cache
    functions has the effect of importing a 1500-byte object file into the
    link!
    
    Add our own tiny cache layer to include/arch/xtensa/cache.h and use
    that instead.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    882c24d View commit details
    Browse the repository at this point in the history
  13. arch/xtensa: Invalidate bottom of outbound stacks

    Both new thread creation and context switch had the same mistake in
    cache management: the bottom of the stack (the "unused" region between
    the lower memory bound and the live stack pointer) needs to be
    invalidated before we switch, because otherwise any dirty lines we
    might have left over can get flushed out on top of the same thread on
    another CPU that is putting live data there.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    7130571 View commit details
    Browse the repository at this point in the history
  14. tests/fifo_api: Move k_fifo off stack

    Putting spinlocks (or things containing them) onto the stack is a
    KERNEL_COHERENCE violation.  This doesn't need to be there so just
    make it static.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    1aab761 View commit details
    Browse the repository at this point in the history
  15. arch/xtensa: Remember to spill windows in arch_cohere_stacks()

    When we reach this code in interrupt context, our upper GPRs contain a
    cross-stack call that may still include some registers from the
    interrupted thread.  Those need to go out to memory before we can do
    our cache coherence dance here.
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    1e3b4af View commit details
    Browse the repository at this point in the history
  16. tests/queue: tests/lifo_usage: Address ADSP/coherence issues

    These tests would pass pointers to data on their own stacks to other
    threads, which is forbidden when CONFIG_KERNEL_COHERENCE (because
    stack memory isn't cache-coherent).  Make the variables static.
    
    Also, queue had two sleeps of 2 ticks (having been written in an era
    where that meant "20-30ms"), and on a device with a 50 kHz tick rate
    that's not very much time at all.  It would sometimes fail spuriously
    because the spawned threads didn't consume the queue entries in time.
    How about 10ms of real time instead?
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    3ead8f1 View commit details
    Browse the repository at this point in the history
  17. tests/p4wq: Fix impossible sleep interval

    The code here was written to "get out of the way just long enough for
    the trivial context switch and callback to execute".  But on a machine
    with 50 kHz ticks, that's not reliably enough time and this was
    failing spuriously.  Which would have been a reasonably forgivable
    mistake to make had I not written this code with this very machine in
    mind...
    
    Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
    Andy Ross committed Mar 2, 2021
    Configuration menu
    Copy the full SHA
    c780c20 View commit details
    Browse the repository at this point in the history