Conversation
Replace the blanket TLBI VMALLE1IS that ran after every page-table- modifying syscall with a per-VA TLBI VAE1IS path bounded by 16 pages, upgrading to broadcast for larger ranges. Common cases (RELRO mprotect, small munmap, MAP_FIXED PROT_NONE invalidation) now keep unrelated TLB entries alive across the syscall return. Stage requests on a per-vCPU TLS slot (cpu_tlbi_req in core/guest.h) rather than a guest-global accumulator. A global slot let one vCPU's syscall epilogue drain another vCPU's pending request before the second vCPU eret'd back to EL0, leaving stale translations live until the broadcast TLBI from the first vCPU caught up. With per-vCPU TLS each thread strictly owns its own request and no concurrent vCPU can read, clear, or partially observe it. The slot is C11 _Thread_local, so fork-child and CLONE_THREAD workers start with TLBI_NONE for free. Extend the X8 wire protocol after HVC #5: 0 skips the flush, 1 keeps the broadcast meaning, 2 stays reserved for the execve drop-frame marker the shim handles separately, and 3 selects the new selective path with X9 carrying the page-aligned start VA and X10 the page count. The shim's tlbi_selective branch issues TLBI VAE1IS in a loop with a defensive cbz x10 guard against a stray zero-count request, and tails with DSB ISH + IC IALLU + DSB + ISB so callers like file-backed mmap of executable pages still see the same I-cache invalidation as the broadcast path. Switch the W^X HVC #9 fault handler in shim.S to single-page TLBI VAE1IS using FAR_EL1. Per ARM ARM B2.2.5.6, TLBI VAE1IS for any VA invalidates every cached entry containing that VA, so the per-page TLBI also retires any 2 MiB block entry the prior split_l2_block left behind. guest_split_block therefore no longer requests a separate TLBI: every caller follows it with guest_invalidate_ptes or guest_update_perms on the actually-changing range, and that subsequent per-page TLBI is sufficient. guest_update_perms now tracks the smallest sub-range whose L3 descriptor actually changed and only requests TLBI for that sub-range, eliminating the broadcast-on-no-op false positive previously emitted by adjacent same-perm mprotect storms (the common shape of dynamic- linker RELRO). Clear the per-vCPU slot at the end of guest_bootstrap_create_vcpu: guest_build_page_tables and the boot-time guest_invalidate_ptes calls (stack guard, null page) accumulate TLBI requests on the main thread's TLS, but the shim's _start does its own TLBI VMALLE1IS before enabling the MMU, so the first guest syscall must not redundantly broadcast on top.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace the blanket TLBI VMALLE1IS that ran after every page-table- modifying syscall with a per-VA TLBI VAE1IS path bounded by 16 pages, upgrading to broadcast for larger ranges. Common cases (RELRO mprotect, small munmap, MAP_FIXED PROT_NONE invalidation) now keep unrelated TLB entries alive across the syscall return.
Stage requests on a per-vCPU TLS slot (cpu_tlbi_req in core/guest.h) rather than a guest-global accumulator. A global slot let one vCPU's syscall epilogue drain another vCPU's pending request before the second vCPU eret'd back to EL0, leaving stale translations live until the broadcast TLBI from the first vCPU caught up. With per-vCPU TLS each thread strictly owns its own request and no concurrent vCPU can read, clear, or partially observe it. The slot is C11 _Thread_local, so fork-child and CLONE_THREAD workers start with TLBI_NONE for free.
Extend the X8 wire protocol after HVC #5: 0 skips the flush, 1 keeps the broadcast meaning, 2 stays reserved for the execve drop-frame marker the shim handles separately, and 3 selects the new selective path with X9 carrying the page-aligned start VA and X10 the page count. The shim's tlbi_selective branch issues TLBI VAE1IS in a loop with a defensive cbz x10 guard against a stray zero-count request, and tails with DSB ISH + IC IALLU + DSB + ISB so callers like file-backed mmap of executable pages still see the same I-cache invalidation as the broadcast path.
Switch the W^X HVC #9 fault handler in shim.S to single-page TLBI VAE1IS using FAR_EL1. Per ARM ARM B2.2.5.6, TLBI VAE1IS for any VA invalidates every cached entry containing that VA, so the per-page TLBI also retires any 2 MiB block entry the prior split_l2_block left behind. guest_split_block therefore no longer requests a separate TLBI: every caller follows it with guest_invalidate_ptes or guest_update_perms on the actually-changing range, and that subsequent per-page TLBI is sufficient.
guest_update_perms now tracks the smallest sub-range whose L3 descriptor actually changed and only requests TLBI for that sub-range, eliminating the broadcast-on-no-op false positive previously emitted by adjacent same-perm mprotect storms (the common shape of dynamic- linker RELRO).
Clear the per-vCPU slot at the end of guest_bootstrap_create_vcpu: guest_build_page_tables and the boot-time guest_invalidate_ptes calls (stack guard, null page) accumulate TLBI requests on the main thread's TLS, but the shim's _start does its own TLBI VMALLE1IS before enabling the MMU, so the first guest syscall must not redundantly broadcast on top.
Summary by cubic
Replaces always-broadcast TLB flushes with selective per-VA invalidation and a per‑vCPU accumulator to reduce flush cost and avoid cross‑vCPU races.
_Thread_localcpu_tlbi_reqaccumulates TLB work; syscall epilogue maps to X8/X9/X10 and then clears it. Bootstrap and execve also clear the slot.guest_split_blockno longer requests its own TLBI; subsequent per‑page invalidation retires the old block entry.guest_update_permstracks the minimal changed sub‑range and only requests TLBI for that range.Written for commit e13ade1. Summary will update on new commits.