X86 optimize context switch #17788

andrewboie · 2019-07-26T01:25:19Z

The PR does the following:

Major changes:

x86 now uses per-thread page tables, with a simple CR3 update on context switch or when trampolining to/from user mode
memory domain architecture interface no longer makes decisions on whether the arch code needs to be notified, necessary when domain configuration is managed per-thread instead of common data or registers

Minor changes:

thread->stack_obj is now set when z_new_thread() is called
stack overflow guard pages on x86 are now read-only instead of non-present, making debugging easier
added debug functions for dumping page tables at runtime
added inline functions to get/set active page tables on x86
fatal exceptions on x86 now report CR3 register (which points to the active page tables)
memory domain documentation clarification
some test case adjustments

Please read individual commit messages for more details.

This patch leaves the build-time page table generation alone, although this new code doesn't care how the master page tables were created and when we switch to runtime generated page tables at boot, this code can remain the same.

Addresses #15135 on x86
Fixes: #13003
Contains groundwork necessary for: #13441 #13074 #15223

zephyrbot · 2019-07-26T01:31:18Z

All checks are passing now.

Review history of this comment for details about previous failed status.
Note that some checks might have not completed yet.

arch/x86/core/ia32/x86_mmu.c

include/arch/x86/ia32/arch.h

arch/x86/Kconfig

arch/x86/core/ia32/thread.c

arch/x86/core/ia32/swap.S

dbkinder

No doc changes to review

andrewboie · 2019-08-02T00:49:43Z

@ioannisg @vonhust @ruuddw any objections to the two patches which touch the mem domain API and documentation changes? they are fairly trivial, mostly just moved the current thread check into the arch implementations

The relevant patches in this PR are

"userspace: adjust arch memory domain interface"
"userspace: clarify k_mem_partition_add()"

ioannisg · 2019-08-02T06:04:06Z

@ioannisg @vonhust @ruuddw any objections to the two patches which touch the mem domain API and documentation changes? they are fairly trivial, mostly just moved the current thread check into the arch implementations

The relevant patches in this PR are
* "userspace: adjust arch memory domain interface"

* "userspace: clarify k_mem_partition_add()"

@andrewboie sorry for being late on this.
I had a look in the mem_domain related changes, and they reflect your intention of moving the decisions to arch-specific code. So I think they are fine.

So, AFAICS, it is the arch-specific code which decides to re-program MPU (ARC, ARM), if the functions are called for the current thread. Does this mean that this principle is no longer needed, generally, e.g. for other architectures, such as x86?

In that case, I believe we lack some documentation in arm_core_mpu.c, arc_core_mpu.c, or in the corresponding API header, that clarifies why these checks are required. IMHO, maybe some notes on the function docs would suffice...

andrewboie · 2019-08-02T07:30:39Z

So, AFAICS, it is the arch-specific code which decides to re-program MPU (ARC, ARM), if the functions are called for the current thread. Does this mean that this principle is no longer needed, generally, e.g. for other architectures, such as x86?

I'm not quite sure what you mean, it depends on the arch code. On x86 it's any thread because there is arch-specific state that has to be managed for any thread that was in the domain being modified (in this case the per-thread pagetables)

So if I destroy a memory domain, I need to update all the per-thread page tables for the threads that were running in it, because they are no longer in a memory domain.

The key thing about per-thread state management is it removes having to compute anything at context switch, at the expense of more overhead when the memory domain is configured.

In that case, I believe we lack some documentation in arm_core_mpu.c, arc_core_mpu.c, or in the corresponding API header, that clarifies why these checks are required. IMHO, maybe some notes on the function docs would suffice...

The checks are required because the arch code was assuming that any changes were for the active thread and the MPU hardware gets immediately reprogrammed.

For 2.1 I'm prioritizing #13074 for at least ARMv7 MPU and these will get rewritten

include/kernel.h

ruuddw · 2019-08-02T07:40:38Z

Moving the check for 'MPU HW' reprogramming or not to the arch specific code is ok for me if it helps other arches. Not sure yet if it would help or be required for #13074 - managing such a 'TLB' would preferably be generic and not arch specific code. I can imagine a 3 layer approach instead of current 2 layers as well: logical/api mem domain management -> logical to physical mapping (and lazy programming/TLB management) -> physical MPU programming (as in current arch specific code).

andrewboie · 2019-08-02T18:36:19Z

Moving the check for 'MPU HW' reprogramming or not to the arch specific code is ok for me if it helps other arches. Not sure yet if it would help or be required for #13074 - managing such a 'TLB' would preferably be generic and not arch specific code.

That doesn't matter, it manages per-thread data so it will require this

ioannisg · 2019-08-02T20:13:31Z

I'm not quite sure what you mean, it depends on the arch code. On x86 it's any thread because there is arch-specific state that has to be managed for any thread that was in the domain being modified (in this case the per-thread pagetables)

So if I destroy a memory domain, I need to update all the per-thread page tables for the threads that were running in it, because they are no longer in a memory domain.

OK, so I understand that ARC and ARM only call the on-line reprogramming if it is the current thread, while in x86 it is any thread. Thanks for clarifying.

So, as I said above (maybe not clear enough) is that I would like us to document the reasoning why we only need to do on-line re-programming if it is the current thread in ARM Core MPU code. Earlier, these checks for current-thread were not present in arch-code, because they were in kernel/mem_domain, but since they are now in ARM core mpu, i think it would be nice to comment why we do them, to help explicitly the ARM developer understand the code better.

Populate thread->stack_obj earlier in the thread initialization process such that it is set when z_new_thread() is called. There was nothing specific about its position, or the rest of the code in that CONFIG_USERSPACE block, so just move it all up.. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Some options like stack canaries use more stack space, and on x86 this is not quite enough for ztest's main thread stack to be 512 bytes. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Has the same effect of catching stack overflows, but makes debugging with GDB simpler since we won't get errors when inspecting such regions. Making these areas non-present was more than we needed, read-only is sufficient. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Currently page tables have to be re-computed in an expensive operation on context switch. Here we reserve some room in the page tables such that we can have per-thread page table data, which will be much simpler to update on context switch at the expense of memory. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

These turned out to be quite useful when debugging MMU issues, commit them to the tree. The output format is virtually the same as gen_mmu_x86.py's verbose output. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Wrapper to assembly code working with CR3 register. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Lets us know what set of page tables were in use when the error occurred. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Need to enumerate the constraints on adding a partition to a memory domain, some may not be obvious. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

The current API was assuming too much, in that it expected that arch-specific memory domain configuration is only maintained in some global area, and updates to domains that are not currently active have no effect. This was true when all memory domain state was tracked in page tables or MPU registers, but no longer works when arch-specific memory management information is stored in thread-specific areas. This is needed for: zephyrproject-rtos#13441 zephyrproject-rtos#13074 zephyrproject-rtos#15135 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Previously, context switching on x86 with memory protection enabled involved walking the page tables, de-configuring all the partitions in the outgoing thread's memory domain, and then configuring all the partitions in the incoming thread's domain, on a global set of page tables. We now have a much faster design. Each thread has reserved in its stack object a number of pages to store page directories and page tables pertaining to the system RAM area. Each thread also has a toplevel PDPT which is configured to use the per-thread tables for system RAM, and the global tables for the rest of the address space. The result of this is on context switch, at most we just have to update the CR3 register to the incoming thread's PDPT. The x86_mmu_api test was making too many assumptions and has been adjusted to work with the new design. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Makes the code that defines stacks, and code referencing areas within the stack object, much clearer. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

zephyrbot added area: X86 x86 Architecture (32-bit) area: API Changes to public APIs area: Kernel labels Jul 26, 2019