Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtensa: add MPU support #67938

Merged
merged 9 commits into from Mar 20, 2024
Merged

xtensa: add MPU support #67938

merged 9 commits into from Mar 20, 2024

Conversation

dcpleung
Copy link
Member

@dcpleung dcpleung commented Jan 22, 2024

This enables support for MPU on Xtensa.

Note that a SoC config of sample_controller + MPU - s32c1i was used to develop the code, and tested on QEMU.

@dcpleung dcpleung changed the title xtensa: add MPU support for kernel mode xtensa: add MPU support Jan 22, 2024
@dcpleung dcpleung linked an issue Jan 22, 2024 that may be closed by this pull request
@dcpleung dcpleung force-pushed the xtensa/mpu branch 2 times, most recently from 76e71a8 to f66b5a5 Compare January 24, 2024 00:16
@dcpleung dcpleung requested review from ceolin, andyross and nashif and removed request for ceolin and andyross January 30, 2024 23:34
@dcpleung dcpleung marked this pull request as ready for review January 31, 2024 21:05
@@ -255,6 +292,7 @@ xtensa_userspace_enter:
l32i a6, a1, 24
call4 xtensa_swap_update_page_tables

#if XCHAL_HAVE_THREADPTR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't CONFIG_THREAD_LOCAL_STORAGE checking for this condition ? I mean, there is no TLS without THREADPTR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed as the #else part inside #ifdef CONFIG_THREAD_LOCAL_STORAGE still needs THREADPTR.

" syscall\n"
" mov %0, a2\n"
: "=r"(ret) : : "a2");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to think that we should always use syscall for this. It would simplify a lot the code having only one way to do this. With this change we have three different ways:

  1. Using THREADPTR directly
  2. Using TLS
  3. With syscall

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The obvious downside is speed. So I guess we can have syscall and TLS, and removing THREADPTR entirely. Though we can do it in future commits if decided to do so.

Comment on lines +35 to +46

movi a0, PS_RING_MASK
rsr.ps a2
and a2, a2, a0

/* Need to set return to 1 if RING != 0,
* so we won't be leaking which ring we are in
* right now.
*/
beqz a2, _is_user_context_return

movi a2, 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could simplify this logic having this logic having this exception handled in User/Kernel vectors and setting UM bit properly. I have a prototype doing it but without the EPC1 trick. This is very clever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it... but then all the existing code will need to be audited to make sure they work on both non-UM/UM modes. So for now, doing this here to keep the PR a little bit simpler.

*
* This makes a hole in the MPU entry array so that a new entry can be inserted
* into the one indexed at @a start_idx. The entries [@a start_idx, @a end_idx - 1]
* will be moved to [@a start_idx + 1, @a end_idx].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that correct ? Isn't it going down copying end_idx to end_idx - 1 ... resulting in [start_idx, end_idx - 1] ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept is kind of like stack where lower index is on top. So moving down means copying end_idx -1 to end_idx (overwriting end_idx). start_idx will be a duplicate of start_idx + 1.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Will need to come back by to do a full review.

One big gotcha that bit me when dealing with the same hardware for mt8195 (which doesn't do full memory_domain support, but does need the MPU to enable caching):

The monotonicity requirement of MPU entries on real hardware is strict. If you have entries that "go backwards" for any reason at an address below that of an active memory access, the hardware will trap. This is true even if none of the overlapping region is relevant to the fetch at all. So the only way to do MPU reprogramming is to walk backwards from the end of the array. It strikes me that it's possible that qemu is less strict, which might hide this. Something to check for sure; took a long time to figure out.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick notes. Definitely a little confused by the amount of work being spent on the background map feature, seems sorta needless? Is there hardware coming that has complicated background mappings we want to exploit for efficiency?

* - Background map entries are copied to foreground map to
* ensure correct alignment for hardware... which means it may
* waste some foreground entries if there are no custom regions
* between background entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: the "background" feature in the hardware doesn't seem very useful to me. Existing hardware I can see core-isa.h files for never does anything more complicated than e.g. "all of memory is mapped uncached". I'd suggest dropping this code and just using the pre-existing SOC-provided memory map we already have. At most, we should just be validating the map provided by the background vs. what we know from dts/kconfig to be "The Truth" as a check on SOC config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, if you cross a background map "boundary", the address needs to be appear in the foreground map. In other words, any foreground regions cannot cross the boundaries of any background regions. And also that the entry[0], if enabled, must be the same as one of the background entries. So I just take the easy and simpler route here of copying the BG entries to FG as starting point. Having to account for BG entries when add new regions is going to complicate the code quite a bit... as the logic now is already overly complicated, I simply do not want to add to the confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: my suggestion is "ignore the background mapping" and just have a fully-populated mapping in the foreground (which takes no more entries). The requirement of not crossing background boundaries regions is AFAICT incorrect, and if it's not a requirement there's little to no value in trying to support the background stuff.

The only reason we'd even want it would be if you had a device with genuinely useful fine-grained background mappings that would allow us to save foreground entries for register blocks or whatever. But that seems not to be the case on the hardware I've seen where "background" just means "something that works OK at boot" (e.g. on mt8195 it's all uncached, which is basically useless).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to have simulator or hardware setup to test this quickly? I can try it on QEMU and see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QEMU seems to be happy about it. So I removed the copying of background mappings.

* XCHAL_MPU_ALIGN_BITS provided by the toolchain indicates
* that only bits of and left of this value are valid. This
* corresponds to the minimum segment size (MINSEGMENTSIZE)
* definied in the processor configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems like needless complication? No hardware exists AFAIK with a MPU page size other than 4k, and IIRC there's language in the databook that implies strongly this is the only possible setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mostly getting the info from the ISA Reference Manual where segment size can be as small as 32 bytes.

Which databook is it? From what I have:

  • LX6 one does not have a MPU chapter.
  • The NX one mentioned that the configurable address granularity is minimally 4KB.
  • On LX7, it depends if MPU Small Page option is enabled on hardware. If not, that it is 4KB. If enabled, then the segment size depends on fetch width, cache line size, etc.

*
* According to manual, newer processor configurations require that
* the foreground entries must be aligned to the background map,
* in addition to the foreground entry requirements:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is? That's surprising to me, because I'm all but certain I had experimented successfully with things like "single MPU region covering all of memory". I can't find language like that in the ISA manual after a quick glance, but I do note that Figure 28 in section 5.5.3 clearly shows a foreground mapping labeled "jj" that spans the boundary of two background maps ("zz" and "yy"). Pretty sure you don't need this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming from the comments in the HAL. If the HAL implements it in such a way, I guess it is better for us to do it also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @jcmvbkbc has input? My gut says exactly the opposite: we should implement it per hardware and spec, as the HAL is known to be sort of a mess.

* - Each MPU region is described by TWO entries:
* [entry_a_address, entry_b_address). For contiguous memory regions,
* this should not much of an issue. However, disjoint memory regions
* "waste" another entry to describe the end of those regions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more: this waste is significant, and seems straightforward to fix? Just pre-compile your array of boundary entries into a "big enough" array on the stack, then pass over it slurping up duplicate boundaries that define zero-length segments. Cheap and easy, and only has to be done once per change to the domain. Obviously it needs some kind of handling for situations where we try to enable too many segments, but that's part of the userspace API already (and a K_OOPS() of the thread would be satisfactory handling anyway).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what you mean here. For regions of 0x1000 - 0x3000 and 0x3000 - 0x4000, it needs 3 entries. But if you have 0x1000 - 0x2000 and 0x3000 - 0x4000, you will need 4 entries to describe them.

(Well... for other arch's MPUs where you program both start and end addresses in the same entry, both examples only need 2 entries to describe.)

Could you describe more in details on how to pre-compile the array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your initial pass might emit an array like:

StartAddr  Access
=========  ======
 0x1000     rw-
 0x3000     ---
 0x3000     r-x
 0x4000     ---

Then you just go down the initial array, and just skip any entry where the following entry has the same address (i.e. it's a definition of a zero-length segment), producing:

StartAddr  Access
=========  ======
 0x1000     rw-
 0x3000     r-x
 0x4000     ---

...which is the optimized form you want. The same process on the disjoint second example wouldn't change the array. The first array can be constructed cheaply on the stack with a maximally-sized (e.g. 64 entry for a 32-register MPU). Now we can have bigger and more complicated mappings as long as lots of them are adjacent (which is reasonably common, .text is next to .rodata, kernel stacks next to user stacks, etc...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can amend consolidate_entries to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have amended consolidate_entries so if two entries have a the same address, remove the lower indexed one.

@dcpleung
Copy link
Member Author

Nice. Will need to come back by to do a full review.

One big gotcha that bit me when dealing with the same hardware for mt8195 (which doesn't do full memory_domain support, but does need the MPU to enable caching):

The monotonicity requirement of MPU entries on real hardware is strict. If you have entries that "go backwards" for any reason at an address below that of an active memory access, the hardware will trap. This is true even if none of the overlapping region is relevant to the fetch at all. So the only way to do MPU reprogramming is to walk backwards from the end of the array. It strikes me that it's possible that qemu is less strict, which might hide this. Something to check for sure; took a long time to figure out.

QEMU also requires monotonically increasing addresses. So to erase entries, you have to go forward from start, while you need program the entries backwards from the end.

@dcpleung dcpleung force-pushed the xtensa/mpu branch 2 times, most recently from 6fae680 to f7afaf0 Compare February 21, 2024 01:00
@@ -20,6 +22,37 @@
.global xtensa_do_syscall
.align 4
xtensa_do_syscall:
#if XCHAL_HAVE_THREADPTR == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question: but which hardware has an MPU but no THREADPTR? This seems a little academic? If it's just the qemu device, maybe we can reconfigure it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now it is QEMU only. Changing that will require a new configuration which will require Cadence to do so.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More notes after about an 80% readthrough. None fatal, though I will say that the ad-hoc sorting stuff creeps me out (that kind of code is subtle and really hard to get right and maintain).

Also the definition of a Really Big C API to expose all the hardware details of an MPU entry seems like a lot of typing for no value, since all this code actually needs to do in practice is translate to/from the much simpler Zephyr userspace requirements.

#include <xtensa/config/core-isa.h>
#include <xtensa/config/core.h>

bool xtensa_is_user_context(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be just a regular __syscall with an impl function that does the right thing? Not really seeing why there's value in hand-coding assembly here except a few cycles of performance, and this is AFAICT just a fallback for weird emulation environments. No one (famous last words) would instantiate hardware with an MPU and no THREADPTR, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is used frequent enough to warrant special treatment. If this is implemented as a normal syscall, it will need to go through all the level 1 exception code (e.g. cross stack, calling in C function, marshaling of args, etc.).

TBH, here it is to cover our basis just in case such configuration exists...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but the counterargument is that a regular __syscall is portable and doesn't require special handling in the call0 stuff I'm finally banging on again. :)

Not a big deal, but I reserve the right to include a patch removing the entry code for this with a justification of "it's only for qemu anyway".


if XTENSA_MPU

menuconfig XTENSA_MPU_CUSTOM_MEMFLAGS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general whine that this seems like way too much configuration. Most apps aren't going to be touching the defaults here. And where you want something other than default, it will be because of SOC-level memory map details that are better configured in DTS.

Basically: this is 100+ lines of kconfig definitions that almost no one is going to understand or know how to tune except for the original SOC integrators. And they'd be better served by devicetree or C APIs and not kconfig.

This comment was marked as outdated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these and replaced with just a hex kconfig for the default memory type value.

* current Zephyr image. This information must be available and
* need to be processed upon MPU initialization.
*/
static const struct xtensa_mpu_range mpu_zephyr_ranges[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we couldn't use a k_mem_partition here? I understand this isn't the userspace part of the PR, but there's no good reason we couldn't repurpose the structs, which are almost 1:1. Having a separate type here means that we'll actually have three distinct representations for every region (the MPU entry itself, the xtensa arch thing, and the userspace partition).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k_mem_partition doesn't have enough flags/attributes to cover the possible attributes of MPU entries. One example is if someone really want a execute-only region which cannot be describe with k_mem_partition.

.end = (uintptr_t)__text_region_start,
.access_rights = XTENSA_MPU_ACCESS_P_RX_U_RX,
.memory_flags = XTENSA_MPU_MEMFLAGS_DEFAULT,
.name = "vecbase",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big waste in the grand scheme of things, but production apps don't need a "name" field for their memory regions and so shouldn't be responsible for carrying around the memory. Debugging features should generally be optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will amend it so it appears when debugging is enabled. It helps quite a bit when debugging the MPU init code though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

* @param end_idx Index of end of region to be moved.
*/
static void mpu_entries_shift_one_up(struct xtensa_mpu_entry *entries,
uint8_t start_idx, uint8_t end_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design nit: not loving the ad-hoc sorting logic here, which seems kinda inefficient and needless. Would IMHO be cleaner and smaller to just sort the array once using qsort(), or keep them in a rbtree, etc... Even doing a one-time O(N^2) extraction sort pass over the ~32 elements seems simpler than this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about it a bit. Most of this code is really just to a find place to insert a new entry, and making a hole in the array to put the entry there. Doing in this way also makes the map valid (almost) all of the time, even between operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced it with qsort instead.

struct xtensa_mpu_entry *check_addr_in_mpu_entries(const struct xtensa_mpu_entry *entries,
size_t num_entries,
uintptr_t addr, bool *exact,
uint8_t *entry_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another function I don't understand the need for? You can probe the active MPU config with a hardware instruction, we shouldn't need to be reading the table ourselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??? What if we are manipulating a memory domain on another thread not currently running? We cannot probe MPU config for the MPU map on a thread that is not the current running thread.

}

if (remove) {
mpu_entries_shift_one_down(entries, new_first, idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clumsy O(N^2) way to do the compaction, IMHO. Just keep a "write pointer" and a "read pointer" into the array, which will diverge if you skip a write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to keep the map valid (almost) all of the time. So if any operation fails, the map can still be programmed to hardware. This is probably my paranoia here. Any future modifications to the code can leave the map invalid (or rather unknowingly invalid) which will cause exception in hardware if programmed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about it a bit... but doing it this way is less error prone as we will need to replace old entries at top of map if needed.

static int mpu_map_region_add(struct xtensa_mpu_map *map,
uintptr_t start_addr, uintptr_t end_addr,
uint32_t access_rights, uint32_t memory_type,
uint8_t *first_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This again is looking way too complicated to me. Don't keep the MPU config in sorted order at all, just keep an unordered[1] set of regions/partitions and sort/validate/optimize only once at the end of the process, before you turn it into the nice/small/tight/compact array of MPU entry registers.

[1] Or implicitly ordered in e.g. an rbtree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function still looks complicated, though I do see that it's no longer doing any sorting internally. But it's still doing a lot of work trying to keep the mpu entries in a correctly formatted form at all times, every time you want to modify even a single one of them. Seems like it would be simpler to just keep a list of regions to "compile" to a single array and do it all at once? That would require a failure path in the case where you try to compile an invalid map or whatever. But AFAICT userspace is tolerant of that.

* variable to guard against this (... if done correctly).
* Besides, this will almost be overridden by the SoC layer.
* So tell GCC to ignore this.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember this comment from somewhere else too. While this works, dealing with compiler pragmas seems needless. Just declare the symbol extern here and put the __weak definition in another C file where it doesn't get accessed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned the whole thing into "required" so no more weak symbol and this is no longer needed.

extern char __data_end[];

__weak const struct xtensa_mpu_range xtensa_soc_mpu_ranges[0];
__weak int xtensa_soc_mpu_ranges_num;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really my fight to pick, but for sure the Official Way To Manage Memory Maps in Zephyr is now in Devicetree, with a whole schema and everything. This doesn't bug me personally but I guarantee someone (@carlocaione @stephanosio seem like good candidates) is going to hate it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it is simpler to do it this way without cluttering the code with devicetree macros. Using devicetree memory regions is a future enhancement.

I think only ARM MPU is using devicetree to describe memory regions. ARM MMU does not seem to do it. ARC MPU does not do it either. I think all MMU code are also not using devicetree region.

Calling z_mrsh_* functions require 7 arguments where the 7th is
the stack frame. Only the first 6 arguments are passed by
registers where the 7th must be done via stack. However, this
is not being done and an incorrect argument was being passed to
the z_mrsh_* functions as stack frame pointer. An obvious issue
would be dumping of stack during kernel oops, as incorrect data
was being printed or crashes due to inaccessible memory. So fix
it by properly populating the stack with correct stack frame
pointer as outgoing argument for the caller of z_mrsh_*
functions.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
For CPU without THREADPTR, we need an alternative way to figure
out if we are in user context. This extends the user context
check to do that via a brief syscall.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Both CONFIG_XTENSA_SYSCALL_USE_HELPER and
CONFIG_XTENSA_INSECURE_USERSPACE are also applicable to MPU.
So move them out of the CPU_HAS_MMU block.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
This enables support for MPU on Xtensa. Currently this is
for kernel mode only.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
Add support to test for Xtensa MPU.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
With memory domain enabled, all threads within the same domain
have access to each other threads' stacks, especially with
CONFIG_ARCH_MEM_DOMAIN_SYNCHRONOUS_API enabled (as it is
expected behavior). So update the conditions to skip both
tests to read and write to other threads' stacks.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
This extends the Xtensa MPU to support userspace.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
This allows the SoC to have total control on what MPU ranges
to be programmed at boot. This overrides the generic ranges
in the architecture core code.

Signed-off-by: Daniel Leung <daniel.leung@intel.com>
@nashif nashif requested a review from andyross March 17, 2024 22:19
@andyross
Copy link
Contributor

The Xtensa MPU code happens to support memory domain and CONFIG_ARCH_MEM_DOMAIN_SYNCHRONOUS_API which means the tests to read/write other threads' stacks need to be skipped as these tests expect fault due to denied access.

This is exactly what I was freaked out about though, and why I thought we needed to do this discussion in a separate bug. But to do it here:

  1. We have userspace, which separates thread stacks by default. This is a security promise.
  2. We have a synchronous variant of the mem_domain API, which is an internal thing having to do with whether or not the hardware needs a hook called at various points in the context lifecycle. This is not a security feature, just an implementation detail.
  3. Turning on MEM_DOMAIN_SYNCHRONOUS_API (for reasons I don't understand, honestly) causes those thread stacks to not be separated anymore.

That just doesn't sound acceptable. Some architectures will be secure but others won't?

I get this isn't related to this PR. But it's more important and for sure has to be fixed first, right? What's the root cause here that is messing with domain assignments for thread stacks?

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'll +1 if someone files a bug on the security issue separately (or explains to me why it's not a security issue).

@dcpleung
Copy link
Member Author

FWIW I'll +1 if someone files a bug on the security issue separately (or explains to me why it's not a security issue).

Could you file the bug? Since you will be the one who can articulate on what is wrong in details?

@andyross
Copy link
Contributor

Could you file the bug? Since you will be the one who can articulate on what is wrong in details?

But I can't. I don't actually understand this, I honestly thought you did. :)

I wrote that list above about how it's a security problem in the expectation that it was at least 60% likely that one of the steps would be a mistake on my part. But sure, I'll take a look and write something up.

Basically the question here is "Why are thread stacks not isolated on architectures with synchronous mem_domain implemented?".

@dcpleung
Copy link
Member Author

Could you file the bug? Since you will be the one who can articulate on what is wrong in details?

But I can't. I don't actually understand this, I honestly thought you did. :)

I wrote that list above about how it's a security problem in the expectation that it was at least 60% likely that one of the steps would be a mistake on my part. But sure, I'll take a look and write something up.

Basically the question here is "Why are thread stacks not isolated on architectures with synchronous mem_domain implemented?".

My intention is more for the community, where you file an issue if you see something wrong. So that we can have a discussion there, and possibly decisions if they need to be made. So that it can be searchable, and be referenced in the future.

@ceolin
Copy link
Member

ceolin commented Mar 19, 2024

#70457

FYI !

@andyross
Copy link
Contributor

OK, so here's how I think this mess happened: It's not a framework security bug, it's a test weirdness and a historical wart of the ARM MPU implementation.

The ztest_test_skip() in question arrived in commit eeab568 where it explains "MMU threads within the same memory domain have access to each other's stacks." Which is unambiguously true about mem_domains as documented: all the threads running in the domain get the same view of memory, ergo they can read each others stacks. QED.

But I guess the early MPU implementations (realistically probably just ARM -- x86 MMU was the second userspace platform, wasn't it?) had some code to isolate stack memory by default, presumably dating from the days before the mem_domain API finalized. So these tests would pass (i.e. fail to access cross-thread stack memory) on the early MPU implementation. But that behavior was never promised or documented AFAICT. And of course now you've produced an MPU driver that honors the documented behavior and not "whatever ARM did", so it fails.

Basically: I think this is a historical wart. These tests are exercising undocumented behavior from an early platform variant that isn't portable to modern Zephyr. If there's a bug, it's the converse: "legacy MPUs" disallow access to thread stacks that the documentation requires.

I think the correct fix is to remove the tests entirely. They aren't testing anything but a failure condition. I'll submit a patch.

My only remaining fear is the dependence on CONFIG_ARCH_MEM_DOMAIN_SYNCHRONOUS_API. As far as I can tell, that kconfig has absolutely nothing to do with thread stacks or mem_domain. I think (?) you added it as a proxy for "MPU that doesn't do the weird stuff ARM did", am I right? The actual synchronous API is a driver/platform-level thing.

@andyross
Copy link
Contributor

See #70461 for an IMHO cleaner fix that removes the test cases that are "testing" undocumented behavior.

@dcpleung
Copy link
Member Author

My only remaining fear is the dependence on CONFIG_ARCH_MEM_DOMAIN_SYNCHRONOUS_API. As far as I can tell, that kconfig has absolutely nothing to do with thread stacks or mem_domain. I think (?) you added it as a proxy for "MPU that doesn't do the weird stuff ARM did", am I right? The actual synchronous API is a driver/platform-level thing.

Yes, that kconfig is the closest I could find to differentiate the behavior.

@andyross
Copy link
Contributor

Yes, that kconfig is the closest I could find to differentiate the behavior.

Yeah, let's not play games like that. My vote is to remove the tests entirely, obviously. But if you want to leave them in place and kludge around the skip logic, it's probably best to actually code to the specific platforms rather than pretend that an unrelated feature is important.

There's also an alternative where we "bless" the current madness and invent a kconfig like CONFIG_MEM_DOMAIN_THREAD_PROTECTION that the old MPU's would select. But that sounds like a terrible idea to me personally: the worst place for API variance like that is in memory protection schemes.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to a +1 (and feel free to assume refresh). If this merges first with the existing patch still present I can always fix up #70461 later.

@nashif
Copy link
Member

nashif commented Mar 20, 2024

Switch to a +1 (and feel free to assume refresh). If this merges first with the existing patch still present I can always fix up #70461 later.

let's get this in and deal with tests and documentation as a followup.

@nashif nashif merged commit f716539 into zephyrproject-rtos:main Mar 20, 2024
18 checks passed
@dcpleung dcpleung deleted the xtensa/mpu branch March 20, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Xtensa MPU Support
6 participants