Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtensa: Enable userspace with Xtensa MMU #61303

Merged
merged 33 commits into from Nov 21, 2023

Conversation

ceolin
Copy link
Member

@ceolin ceolin commented Aug 8, 2023

Implement support for userspace on Xtensa using its MMU.

  • Syscalls are not inline functions like in other architectures because
    some compiler issues when using multiple registers to pass parameters
    to the syscall. So here we have a function call so we can use
    registers as we need.
  • TLS is not supported by xcc in xtensa and reading PS register is
    a privileged instruction. So, we have to use threadptr to know if a
    thread is an user mode thread.
  • Support double memory mapping

For more details look individual commit messages.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very brief once-over, haven't had time to look at all the patches past the first one.

The big showstopper for me is that I'm having a lot of trouble getting past the "entirely separate context switch path for syscalls" problem. We just can't have that, especially since the architecture tried so hard to make it symmetric for us. Surely there's a way to unify this, really it just looks like a stack swap to me you could patch around with maybe an extra SR somewhere.

arch/xtensa/core/syscall_helper.c Show resolved Hide resolved
.pushsection .text.z_xtensa_do_syscall, "ax"
.global z_xtensa_do_syscall
.align 4
z_xtensa_do_syscall:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I don't understand at all. It looks like you're duplicating all the existing context save code just so you can do a stack swap? But syscalls are just Level 1 exceptions on xtensa, they're supposed to share the same handler path. I don't see why this should be any different than taking any exception or interrupt that needs to return to the context? Or for that matter if you need the stack swap here, surely you also need it on interrupt handling too?

To be clear, I could imagine the opposite tactic being useful: testing at the top of the L1 handler for syscalls so they can take a special fast path that avoids all the spilling and just presents their register state to the kernel handler directly, since they're parameters that will need to be interpreted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some complications in using the same handler, first the whole L1 handler is a bunch of macros that have static parameters, that are not the same for syscall. The L1 handler is composed by an entry code (defined by the macro EXCINT_HANDLER) and a static C handler. EXCIANDLER (entry code) is a macro with static parameter to basically tell which stack to use in the exception.
All exceptions/interruptions switch stack, from the current thread stack to the interruption stack, while on syscalls we have to switch to the current's thread privileged stack, also handling the syscall in a C handler can be quite complicated because of syscall parameters and all callN calls that happened until we get into the C code.

So while it is doable, it needs reasonably changes in these macros and handlers. It is not sure for me that this would make the code cleaner, honestly navigate through all these macro is already confuse enough IMHO.

rsync
l32i a2, a0, ___xtensa_irq_bsa_t_a2_OFFSET
l32i a3, a0, ___xtensa_irq_bsa_t_a3_OFFSET
SPILL_ALL_WINDOWS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm especially confused by the spilling of register windows. Syscalls, like exceptions, return back into the interrupted context unconditionally. We want to leave those registers in place and let them spill naturally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done before switch to privileged stack and the intention is to avoid mixing user/kernel context. Though, we have already used a0-3 in this window ...

include/zephyr/arch/xtensa/xtensa_mmu.h Outdated Show resolved Hide resolved
arch/xtensa/core/syscall_helper.c Outdated Show resolved Hide resolved
arch/xtensa/core/xtensa-asm2-util.S Outdated Show resolved Hide resolved
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More notes. Full disclosure, I've been playing in this space too w/r/t the 8195's MPU hardware, so some of the comments are coming from a perspective of "MMU and USERSPACE aren't quite the same thing and this isn't how I thought it would work at all". :)

arch/xtensa/core/CMakeLists.txt Show resolved Hide resolved
arch/xtensa/core/gen_zsr.py Outdated Show resolved Hide resolved
fixup_out:
#endif
if (is_dblexc) {
__asm__ volatile("wsr.depc %0" : : "r"(0));
Copy link
Contributor

@andyross andyross Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like it builds? Or maybe there's a compiler shorthand I'm not aware of? The RWSR instruction takes a register argument, not an immediate. You need to put a zero in some variable and pass that as the constraint.

That being said: what's the point here? The only reason you write to DEPC is because you're going to use RDE to return from the exception, and if you do that with a NULL in the PC address you'll obviously just fault again, which clearly isn't the intent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler understands that and copy the immediate to a register that is used in the instruction. e.g

  0xa00362bb <main+3>     movi.n  a2, 0
  0xa00362bd <main+5>     wsr.depc        a2

A double exception can happen because of a page fault during an exception. In the double exception handler we try to load the page from the address that caused the fault and then we rfde. But there are certain code that are allowed to fault, e,.g arch_string_user_nlen, in these cases the double exception is handled as any other exception and depc is cleared out here (and we don't rfde). That happens for the next exception/interruption we need to handle we know that it was not a double exception.

arch/xtensa/core/fatal.c Outdated Show resolved Hide resolved
@@ -87,6 +87,8 @@ char *z_xtensa_exccause(unsigned int cause_code)
case 63:
/* i.e. z_except_reason */
return "zephyr exception";
case 64:
return "kernel oops";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... obviously this isn't an architectural value for EXCCAUSE. Can we make it something different, like 0xffffff00 or whatever to make sure they don't collide? Also, somewhere else in the PR I saw that the dump handling was extended to handle a double exception, which also seems like a good candidate for a "special Zephyr cause value", but I can't find the patch again at the moment.

* i.e. vecbase, so that we need to reserve those
* space or else k_mem_map() would be mapping those,
* resulting in faults.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong spot for this. "Area between CONFIG_SRAM_BASE_ADDRESS and z_mapped_start is reserved" is a fine rule, but if it's a rule it's a rule at the SOC layer somewhere and not the arch, at least as we're doing linking currently. Maybe delegate to a z_soc_reserve_pages() or something and put this code there, where you can be more assertive than "linker scripts usually put" (e.g. "On TGL/ACE the vectors and iram section live below SRAM, reserve them", along with your own linker-emitted addresses on exactly the boundary).

Note that the mtk_adsp setup (which to be fair doesn't have an MMU) links most of its stuff to a DRAM region and not above SRAM_BASE_ADDRESS, puts vecbase at an arbitrary address (or will in the next version). There's likely to be a lot of variation.

* so we just invalidate the cache for all L1 page tables.
*/
sys_cache_data_invd_range((void *)l1_page_table, sizeof(l1_page_table));
sys_cache_data_invd_range((void *)l2_page_tables, sizeof(l2_page_tables));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does TLB data live in the dcache? Pretty sure the hardware doesn't populate the cache on TLB fetches (because, of course, the TLB is the cache for fetches of translation data). Or is the idea here that the local CPU might have unflushed data from having changed the page tables locally? It might, but if it does that's a race anyway; it's about to IPI in the other direction and someone has to lose.

The only reason I ask is because that invalidation is actually fairly expensive, it has to loop over every line in the page tables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called from IPI when another CPU changes the page table contents.

{
}

void z_xtensa_mmu_tlb_shootdown(void)
Copy link
Contributor

@andyross andyross Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned this in the more specific comment below, but honestly our implementation of shootdown on x86 has always felt racy to me. I know this has been a perenial source of bugs on Linux too. I know (edit: sorry) don't know for sure that this is wrong, it's just too hard for that to fit in my brain at once. Maybe a cleaner architecture would be to e.g. refcount the memory domain in the scheduler such that we can know that there's another CPU running on these page tables, and if there is come up with some lockout protocol via an IPI such that we know both threads are suspended/preempted while the operation happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable, and we can look try this out, I just want to avoid this change now, this code is being used / tested for some time and it is consistently working including stress tests. My proposal is to iterate over this later and try to align other platforms, as x86, too.

/* Skip TLB IPI when updating page tables.
* This allows us to send IPI only after the last
* changes of a series.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replying to this patch as a whole:

When adding a thread to a memory domain, we need to also update
the mapped page table if it is the current running thread on
the same CPU. If it's not on the same CPU, we need to notify
the other CPUs in case the thread is running in one of them.

Wait, why? Surely we don't asychronously add running threads to a domain, no? That seems crazy to me. Either you're adding a suspended thread, or you're adding _current. No way should it ever be legal to try to swap the entire memory map of a thread while it's spinning on another CPU. No OS does that. Do we actually have a test that tries this? Can we just say no and fix the API?

@@ -244,8 +244,7 @@ static inline uint32_t *alloc_l2_table(void)
static ALWAYS_INLINE void switch_page_tables(uint32_t *ptables, bool dtlb_inv, bool cache_inv)
{
if (cache_inv) {
sys_cache_data_invd_range((void *)ptables, XTENSA_L1_PAGE_TABLE_SIZE);
sys_cache_data_invd_range((void *)l2_page_tables, sizeof(l2_page_tables));
sys_cache_data_flush_and_invd_all();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mind the "flush cache on page table change" as an implementation choice. But I don't understand the commit message about way size aliasing with pages? You can't have data from alien memory maps in the cache, period. It's just a correctness thing, has nothing to do with what the address bits mean.

That said: I believe Xtensa has an ASID field in the TLB to allow the OS to elide exactly this flush. Would be very good to support that instead. But flush-the-world is fine as an intermediate step.

But can we clarify the commit message, because I'm really kinda confused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many things here, let me try to answer one by one:

  • Regarding ASID, the way it works is, we set RASID with ASID for all four rings. When the hardware populate the TLB, it gets the RING from the page and then looks the correspondent ASID (from RASID) and use it on TLB. That is one reason to flush the cache for a page, otherwise the cached page could have a wrong RING/value/permission. Only flushing the TLB does not solve the problem because the TLB could be re-populated with stale data. That does not explain the aliasing issue though.

  • About the aliasing problem, that was trick and I noticed it only later. The HW that I am using has 16kb in each way. So we need 14 bits to address everything, while the page size we are using is 4kb, it means that the page offset is 12 bits. In this situation, a physical memory mapped in two different virtual address may end up in two different cache entries. So, why this is a problem here, if we basically do 1:1 map and since we invalidating the TLB , a miss there would make it read from the physical memory ? The answer is:

    • The page table itself leaves in the virtual memory, since it has to be a contiguous 4MB block, we make a 2 level page table that leaves in the virtual memory and only few pages are really mapped. So, because of this the physical memory for the page table is mapped twice, the correspondent virtual address to its physical address ans this arbitrary virtual address of the page tables (that is the one used by the MMU). This means that a page table entry may have two entries in the cache.
    • Now why flushing the whole TLB does not solve the problem ? The problem here is that the MMU has auto-refill, so, when we have a TLB miss the hw automatically (without any exception) try to load the right page and if this page is on cache it gets from there and remember, a page entry may be mapped in two different virtual address and consequently have two entries in the cache. If one of them is stale we may have a problem

That's why we have cache aliasing and it is a problem. Obviously if the target has a smaller cache or uses a page size bigger than the cache size this flush is not needed ... We can make it build option, but without other uses for now, this additional complexity does not make much sense. The other option is having it more elaborated coloring the cache ... which I am thinking to do later.

One more thing, I tried to manually track possible duplicated entries in the cache but I end up with a lot of checks and conditions and occasional bugs :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused: I'm gathering (and maybe I'm wrong?) that you're trying to handle the case where you have a single physical page of entries in your page table itself mapped into multiple locations in the table!? Yikes? I mean I guess that probably works from a hardware perspective, but just... why?! Just duplicate the entries, it's only a tiny overhead.

Alternatively: why is the page table living in the dcache anyway? By definition this is read (on performance paths) only by the hardware itself, and the hardware has an entirely separate cache (the TLB) to optimize repeated reads. Just map the page tables uncached and this whole problem goes away, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine you have a page frame that is located in phys mem 0x00000000 and we set the ptevaddr to 0x20000000. This page frame will be mapped in two virtual addresses, 0x00000000 and 0x20000000. Note that the virtual address for the whole page table is mapped (0x20000000 ... 0x20400000).

So, while removing cache for that range and just having the cached for the 1:1 would probably avoid the need for flushing the whole cache, there is no restriction (AFAIK) to avoid someone from mapping a physical address in different virtual addresses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page frame will be mapped in two virtual addresses

No? Only if you have them both in the page table (like, you want the page tables in userspace mappings to have a different address than their physical location --- why?!). I'm still not understand the scheme here, but it really seems like it's needlessly overcomplicated.

And again: Just Don't Touch Page Tables Through the Data Cache (I'm like 70% certain this is the rule in Linux too, FWIW). There's near-zero value and almost no complication involved in making that change. IIRC you can't even have caching enabled until the MMU is active, right? Just leave the physical mappings uncached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(like, you want the page tables in userspace mappings to have a different address than their physical location --- why?!)

Actually this comment is wrong. Page tables are inherently virtual in Xtensa so you can have holes & whatnot. But the point about caching remains the same. These just don't belong in the dcache, period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This possibly solve the page table, but there is nothing forbidden mapping a physical address in multiple virtual address that is the root cause of the problem and why the flush is being done.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking on the syscall entry problem, and I think we actually have a serious problem here. The handler here is doing SPILL_ALL_WINDOWS to write out the userspace window state, which I had fixated on as an inneficiency vs. the CROSS_STACK_CALL that the standard handler does (which leaves the window state in place and arranges the calls to spill naturally via exceptions during the handler).

But... those are BOTH very wrong when you have an untrusted userspace! The window spills work by writing the various 4-register quads into their spill locations found via offsets against the stack pointers also found in those registers. Which is to say, any exception, interrupt or syscall is an instruction to the kernel to write arbitrary userspace-thread-controlled data to arbitrary addresses. Needless to say, we can't do that.

I think I have a few ideas here: for the syscall case, it's actually not so bad because the process can just spill its own registers using the exceptions (although it can't do it with the assembly we use for SPILL_ALL_WINDOWS: ROTW is a priviledged instruction, we'd have to arrange to do it with callx12 et. al.) and then wipe the WINDOWSTART bits from kernel mode knowing that the registers are clear.

The interrupt/exception case is hardest, since we can't rely on that. We'd need to walk the registers manually and validate all the stack pointers ourselves, by testing directly against the page tables (or MPU table) I guess, or maybe just validating against the thread's stack extents if we want to cheat (although that would break if we have userspace wanting to do things like green threads).

Which... brings up another idea I've had bouncing around in my head for a long time: what if we just give up and build using the CALL0 ABI (i.e. just use a standard 16 register set and don't do any window nonsense) instead? It eliminates this problem entirely. And in fact there are significant advantages, because you can use the extra registers for e.g. interrupted frames (i.e. just ROTW on interrupt/exception entry and you don't need to do any context switch!) or as "cached" context switch handles (i.e. the last thread or two would still be "stored" in registers waiting to resume).

As mentioned, I have a MPU rig more or less waiting to play with, so I think we can parallelize as I don't have MTL hardware (though I do have the toolchain -- do you have a recipe for running this against xt-sim[1]?)

(Sticking a -1 on this just to prevent anyone from merging it accidentally while we work out the security design.)

[1] Another digression is that we don't have a qemu emulator working for either the MPU or MMU. Has anyone looked at that? It would be pretty annoying if the only emulation environment for this code was either hardware or a proprietary simulator. We really need CI validation of userspace if we want to ship it.

@andyross
Copy link
Contributor

andyross commented Aug 9, 2023

FWIW: Cadence seems to have plugged a few of these holes with XEA3, frankly I think they got ahead of themselves with the memory protection design and had to rework stuff after the fact. But that's still a futures kind of thing for us, all Zephyr hardware planned is still XEA2 AFAIK. (It might be worth checking Linux to see what that port does. IIRC it's doing the full register spill with non-HAL code already, we could probably use that for a starting point.)

@ceolin
Copy link
Member Author

ceolin commented Aug 11, 2023

I've been thinking on the syscall entry problem, and I think we actually have a serious problem here. The handler here is doing SPILL_ALL_WINDOWS to write out the userspace window state, which I had fixated on as an inneficiency vs. the CROSS_STACK_CALL that the standard handler does (which leaves the window state in place and arranges the calls to spill naturally via exceptions during the handler).

Yep, we can't let window exception happens with different contents from different contexts.

But... those are BOTH very wrong when you have an untrusted userspace! The window spills work by writing the various 4-register quads into their spill locations found via offsets against the stack pointers also found in those registers. Which is to say, any exception, interrupt or syscall is an instruction to the kernel to write arbitrary userspace-thread-controlled data to arbitrary addresses. Needless to say, we can't do that.

I think I have a few ideas here: for the syscall case, it's actually not so bad because the process can just spill its own registers using the exceptions (although it can't do it with the assembly we use for SPILL_ALL_WINDOWS: ROTW is a priviledged instruction, we'd have to arrange to do it with callx12 et. al.) and then wipe the WINDOWSTART bits from kernel mode knowing that the registers are clear.

The interrupt/exception case is hardest, since we can't rely on that. We'd need to walk the registers manually and validate all the stack pointers ourselves, by testing directly against the page tables (or MPU table) I guess, or maybe just validating against the thread's stack extents if we want to cheat (although that would break if we have userspace wanting to do things like green threads).

Which... brings up another idea I've had bouncing around in my head for a long time: what if we just give up and build using the CALL0 ABI (i.e. just use a standard 16 register set and don't do any window nonsense) instead? It eliminates this problem entirely. And in fact there are significant advantages, because you can use the extra registers for e.g. interrupted frames (i.e. just ROTW on interrupt/exception entry and you don't need to do any context switch!) or as "cached" context switch handles (i.e. the last thread or two would still be "stored" in registers waiting to resume).

Entirely remove window abi and using call0 abi everywhere ? Even ask compiler to follow this ABI ?

As mentioned, I have a MPU rig more or less waiting to play with, so I think we can parallelize as I don't have MTL hardware (though I do have the toolchain -- do you have a recipe for running this against xt-sim[1]?)

I was using an emulator built on top of xt-sim, so it is possible. But there are things that it does not emulates well, the cache aliasing issue just happened in the hw.

(Sticking a -1 on this just to prevent anyone from merging it accidentally while we work out the security design.)

[1] Another digression is that we don't have a qemu emulator working for either the MPU or MMU. Has anyone looked at that? It would be pretty annoying if the only emulation environment for this code was either hardware or a proprietary simulator. We really need CI validation of userspace if we want to ship it.

Qemu MMU implementation is quite limited (I don't remember all limitations, but auto-refill is one thing that is not there)

@jcmvbkbc
Copy link
Contributor

Qemu MMU implementation is quite limited (I don't remember all limitations, but auto-refill is one thing that is not there)

@ceolin This is not true. QEMU MMU implementation is rather full, and specifically auto-refill is there, otherwise Xtensa linux would not work in QEMU. Please let me know if you have more questions regarding xtensa support in QEMU.

@jcmvbkbc
Copy link
Contributor

I was using an emulator built on top of xt-sim, so it is possible. But there are things that it does not emulates well, the cache aliasing issue just happened in the hw.

@ceolin xt-sim is capable of precise hardware emulation, just run it without the --turbo option.

@ceolin
Copy link
Member Author

ceolin commented Aug 12, 2023

Qemu MMU implementation is quite limited (I don't remember all limitations, but auto-refill is one thing that is not there)

@ceolin This is not true. QEMU MMU implementation is rather full, and specifically auto-refill is there, otherwise Xtensa linux would not work in QEMU. Please let me know if you have more questions regarding xtensa support in QEMU.

@jcmvbkbc I meant the sample controller target (the one we have).

File: zephyr/soc/sample_controller/xtensa/config/core-isa.h
 617  24 #define XCHAL_HAVE_PTP_MMU		0	/* full MMU (with page table
 619  24 						   usable for an MMU-based OS */
 623  15 #define XCHAL_MMU_ASID_BITS		0	/* number of bits in ASIDs */
 624  15 #define XCHAL_MMU_RINGS			1	/* number of rings (1..4) */
 625  15 #define XCHAL_MMU_RING_BITS		0	/* num of bits in RING field */

Please if there is another target that we can use I would be more than happy to check.

@ceolin
Copy link
Member Author

ceolin commented Aug 12, 2023

I was using an emulator built on top of xt-sim, so it is possible. But there are things that it does not emulates well, the cache aliasing issue just happened in the hw.

@ceolin xt-sim is capable of precise hardware emulation, just run it without the --turbo option.

Good to know, will try it out. It was really annoying figure out this problem late.

@jcmvbkbc
Copy link
Contributor

I meant the sample controller target (the one we have).

That core does not have the MMU feature configured, it only has the region protection.

if there is another target that we can use I would be more than happy to check.

dc233c has been the default MMU-capable core for the xtensa linux, the configuration overlay for it is available e.g. here.

@ceolin
Copy link
Member Author

ceolin commented Aug 15, 2023

Hi @andyross, I've addressed some comments and replied for many others, could you take another look please ? Specially the comment regarding re-using the same L1 handler.

@andyross
Copy link
Contributor

dc233c has been the default MMU-capable core for the xtensa linux,

OK, excellent. For sure we want to get that integrated to the toolchain and a platform added for it. FWIW, for anyone interested in picking that up: note that we don't want to eliminate sample_controller, as that's a good exerciser for the XCHAL_NUM_AREGS=32 case, which is otherwise a somewhat obscure configuration (though NXP and one of the AMD DSPs do that, per my local collection of core-isa.h files).

Though the more I think about this the more I really would like to see numbers on the CALL0 idea if I can find some way to get them cleanly. Register windows only really help in the "deep leafy" call tree case, but all the performance cases we're looking at for DSP work tend to be flat compute work in single stack frames with lots of inlining and[1] lots of use of SIMD registers that notably don't even have windowing![2].

Having a "userspace needs CALL0" rule really seems like not so much a hardship to me...

[1] In a handful of backend cases, though frustratingly not in SOF itself yet due to issues with toolchain availability.

[2] i.e. what's the point of having clean lazy spill/fill of your GPRs when most of your data is in SIMD registers that the ABI is going to make you spill anyway

@andyross
Copy link
Contributor

Coming back. I... really don't think we can merge this with a giant security hole. That's just not being responsible. I know (or think) we still have some weasel wording in the userspace docs about not trusting it as a security boundary, so maybe there's some flexibility there. But we all know that's not how people interpret it and for sure that's not where this particular userspace implementation is aimed.

And, less firmly, we really "shouldn't" merge it without a mechanism for exercising the code in community work and in particular in the project CI. We know this is going to need significant rework (not the TLB handling per se, but all the entry/exit/context code which involves the MMU), and that becomes impossible if the only people who can do that are working on pre-production hardware for one vendor[1]. I mean, consider that this merges and then I start work on CALL0 arch variant and/or a full-window-spill context layer (something I've kinda/sorta already started); now I really can't without breaking things, because I can't run the code I need to modify.

Bottom line: this just isn't finished and can't be used in the form it's in right now. Let's hold off and get it right before merge.

As a compromise, maybe we can prune out a utility layer to the TLB/MMU handling such that you can exercise it with the mem_domain/partition code, but leave all the thread lifecycle stuff unchanged? That actually might make for good cleanup regardless.

[1] Though FWIW: I have a MTL board on the way now for unrelated work, so for me personally this looks like it may resolve itself. But the principle remains the same: we're an open source project and shouldn't be rushing to merge stuff from hardware vendors that can't be exercised in the community.

@nashif
Copy link
Member

nashif commented Aug 17, 2023

dc233c has been the default MMU-capable core for the xtensa linux,

OK, excellent. For sure we want to get that integrated to the toolchain and a platform added for it. FWIW, for anyone interested in picking that up: note that we don't want to eliminate sample_controller, as that's a good exerciser for the XCHAL_NUM_AREGS=32 case, which is otherwise a somewhat obscure configuration (though NXP and one of the AMD DSPs do that, per my local collection of core-isa.h files).

yes, we need a new target with that, leave everything else as it is and introduce qemu_xtensa_mmu or whatever.

And, less firmly, we really "shouldn't" merge it without a mechanism for exercising the code in community work and in particular in the project CI.

We did introduce ARC userspace support in the past without CI (f81dee0), actually the architecture was maintained with any means to do anything in CI until Qemu was added later. This comes with its own issues and we had a few of those because of lack of any CI, however, we currently have Xtensa MMU in the tree, also without CI, userspace is on top of that and userspace itself and in general is covered by other architectures. Doing this will need to come with a commitment to run some CI, somewhere and address issues similar to how we do anything else without CI coverage (and we have plenty of HW features not covered in CI).

[1] Though FWIW: I have a MTL board on the way now for unrelated work, so for me personally this looks like it may resolve itself. But the principle remains the same: we're an open source project and shouldn't be rushing to merge stuff from hardware vendors that can't be exercised in the community.

Not sure his will help, MTL ADAP HW does not have MMU support enabled.

@jcmvbkbc
Copy link
Contributor

jcmvbkbc commented Aug 17, 2023

we don't want to eliminate sample_controller, as that's a good exerciser for the XCHAL_NUM_AREGS=32 case

dc233c also has 32 windowed registers.

I really would like to see numbers on the CALL0 idea if I can find some way to get them cleanly.

FWIW a while ago I've measured linux kernel boot times (specifically timestamp of the message "Run /sbin/init as init process" that marks start of the userspace) on esp32s3 for the kernels built with windowed and call0 ABIs. Got 3.164921s for the call0 and 2.798135s for the windowed ABI, all other settings being the same.

Register windows only really help in the "deep leafy" call tree case,

My observation is that generally windowed ABI generates smaller code (above mentioned kernels were 1803172 bytes for call0 and 1595820 bytes for windowed ABI, that's more than 10% difference).

@andyross
Copy link
Contributor

andyross commented Aug 17, 2023

[Max's numbers showing CALL0 Linux builds having a 13% penalty on initialization time and 10% on code size]

So... actually I'm thinking that doesn't look so bad at all? :)

I mean, kernel boot is basically pessimal for CALL0: it's a huge tree of rapid calls being done all at once all across every subsystem and driver in the image. By definition, it's not spending time doing "compute", it's not touching HiFi registers at all, it's not in hot loops, it's showing very poor locality of reference. And... just 13%!

Did you get numbers on syscall performance or interrupt latency? That's where a CALL0 scheme is going to shine. (I'll straight up say that I'm salivating at the idea of interrupt entry just doing the equivalent of "rotw 4; mov a1, _isr_stack; call0 _isr_handler_in_c"!).

And FWIW, the code size issue is surely real, but in a uC/DSP environment it needs to be weighed against the much smaller context code in the kernel needed to support it. We're spending like 2kb or something like that on this stuff (c.f. this PR that had to fork off an entirely separate entry path) that would shrink very significantly.

Also we'd almost certainly be deploying CALL0 only on userspace-enabled apps, where we're already committing to significant excess memory usage due to all the page alignment. 10% on the .text segment isn't that big a deal in comparison.

l32i a2, a0, ___xtensa_irq_bsa_t_a2_OFFSET
l32i a3, a0, ___xtensa_irq_bsa_t_a3_OFFSET
SPILL_ALL_WINDOWS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to clarify: this bit right here just can't merge. We can't do ABI-defined window spills on behalf of unprivileged code, period.[1] Now, it's 100% true that that's not really the fault of this PR, our existing entry code has always done it this way because the author didn't think about how that was going to translate to privilege boundaries, and that's on me and not you. But still, there's just no way to salvage this particular entry path; whatever we use can't look like this. We can't merge it.

[1] The spill has to be cooked through an interpretation layer of some form that prevents following the untrusted pointers. Design note digression: In practice I'm basically convinced that the only solid way to do this is to unconditionally spill all 32/64 registers, maybe with some WINDOWSTART logic to elide quads that aren't showing as part of ABI frames (though that optimization would then break "creative" userspace users that want to do non-ABI stuff like green threads or coroutines, etc...). But the tests are almost as expensive as the spills, so not sure that's worth it. Though there are two ways to do this: interrupts need to do the spill in kernel mode, but syscalls can be optimized to do it in userspace before entering the kernel, which can be done with exceptions and thus automatically elide spills for unused frames. Then the kernel can just clobber the registers knowing that userspace promised it doesn't care. And of course, on the way OUT of the kernel we need to write them all with something (doesn't matter what) to avoid leaking stale kernel-stored values in those registers.

@nashif
Copy link
Member

nashif commented Aug 17, 2023

we don't want to eliminate sample_controller, as that's a good exerciser for the XCHAL_NUM_AREGS=32 case

dc233c also has 32 windowed registers.

FWIW, someone is already looking into enabling this with the right toolchain. Let's see how this will work out.

dcpleung and others added 24 commits November 20, 2023 09:10
There are known exceptions which are not fatal, and we need to
handle them properly by returning to the fixup addresses as
indicated. This adds the code necessary in the exception
handler for this situation.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
This extracts the printing of fatal exception information into
its own function to declutter xtensa_excint1_c().

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
This function is needed by userspace.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
Trigger exception on Xtensa requires kernel privileges. Add
a new syscall that is used when ARCH_EXCEPT is invoked from userspace.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
When kernel OOPS is raised, we need to actually go through
the process of terminating the offending thread, instead of
simply printing the stack and continue running. This change
employs similar mechanism to xtensa_arch_except() to use
illegal instruction to raise hardware exception, and going
through the fatal exception path.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
After changing content of page table(s), it is needed to notify
the other CPUs that the page table(s) have been changed so they
can do the necessary steps to use the updated version. Note that
the actual way to send IPI is SoC specific as Xtensa does not
have a common way to do this at the moment.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
When adding a thread to a memory domain, we need to also update
the mapped page table if it is the current running thread on
the same CPU. If it's not on the same CPU, we need to notify
the other CPUs in case the thread is running in one of them.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Signed-off-by: Anas Nashif <anas.nashif@intel.com>
Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
Swap page tables at exit of exception handler if we are going to
be restored to another thread context. Or else we would be using
the outgoing thread's page tables which is not going to work
correctly due to mapping and permissions.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
This test requires architecture specific code to work.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
Signed-off-by: Daniel Leung <daniel.leung@intel.com>
This enables tests/kernel/mem_protect/mem_protect to be tested
on Xtensa.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
When the target has a cache way size (cache size / cache wasy) bigger
than the page size we have cache aliasing, since the number of bits
required by the cache index is bigger than the number of bits in the page
offset.

To avoid this problem we flush the whole cache on context switch or when
the current page table is changed.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
We need to use the mmu spin lock when invalidating the cache during
tlb shootdown, otherwise it is possible that this happens when another
thread is updating the page tables.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
Signed-off-by: Anas Nashif <anas.nashif@intel.com>
Do not map the heap area by default if we are not using
heap at all.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Consolidate all syscall helpers into one functions.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
During arch_kernel_init(), the interrupt stack is being
initialized. However, if the current in-use stack is
the interrupt stack, it would wipe all the data up to
that point in stack, and might result in crash. So skip
initializing the interrupt stack if the current stack
pointer is within the boundary of interrupt stack.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Only clear the user stack to 0xAA if CONFIG_INIT_STACKS is
enabled. Otherwise, write 0x00 as if the stack is in BSS.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
This massages kconfig and linker script to enable userspace
support on dc233c core.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Use thread local storage to check whether or not a thread is running
in user mode. This allows to use threadptr to properly support tls.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
Add a Kconfig option to tell whether or not using thread
local storage to store current thread.

The function using it can be called from ISR and using
TLS variables in this context may (should ???) not be
allowed

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
Xtensa clears out threadptr durint ISR when userspace
is enabled.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
Add a Kconfig option (and build warning) alerting about the problem
of the kernel spilling register in behave of the userspace.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
This file doesn't need the asm2 header, and the preprocessor logic
around whether to include the backtrace header is needless (all it
does is declare functions).

Signed-off-by: Andy Ross <andyross@google.com>
Initial work to split page table manipulation from
mmu hardware interaction.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
Andy Ross re-implementation of MMU layer with some subtle changes,
like re-using existent macros, fix page table cache property when
direct mapping it in TLB.

From Andy's original commit message:

This is a reworked MMU layer, sitting cleanly below the page table
handling in the OS.  Notable differences from the original work:

+ Significantly smaller code and simpler API (just three functions to
  be called from the OS/userspace/ptable layer).

+ Big README-MMU document containing my learnings over the process, so
  hopefully fewer people need to go through this in the future.

+ No TLB flushing needed.  Clean separation of ASIDs, just requires
  that the upper levels match the ASID to the L1 page table page
  consistently.

+ Vector mapping is done with a 4k page and not a 4M page, leading to
  much more flexibility with hardware memory layout.  The original
  scheme required that the 4M region containing vecbase be mapped
  virtually to a location other than the hardware address, which makes
  confusing linkage with call0 and difficult initialization
  constraints where the exception vectors run at different addresses
  before and after MMU setup (effectively forcing them to be PIC
  code).

+ More provably correct initialization, all MMU changes happen in a
  single asm block with no memory accesses which would generate a
  refill.

Signed-off-by: Andy Ross <andyross@google.com>
Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
@ceolin
Copy link
Member Author

ceolin commented Nov 20, 2023

Fixed rebase conflict only @dcpleung

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This integration looks good to me, thanks. I still worry a little about commiting the asm2 entry code when we know it's going to need a rewrite, but it's pretty well-isolated and easy to remove.

@carlescufi carlescufi merged commit c47880a into zephyrproject-rtos:main Nov 21, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants