Skip to content

Commits on May 5, 2023

  1. Merge tag 'kvm-x86-mmu-6.4-2' of https://github.com/kvm-x86/linux int…

    …o HEAD
    
    Fix a long-standing flaw in x86's TDP MMU where unloading roots on a vCPU can
    result in the root being freed even though the root is completely valid and
    can be reused as-is (with a TLB flush).
    bonzini committed May 5, 2023

Commits on May 1, 2023

  1. Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

    Pull kvm updates from Paolo Bonzini:
     "s390:
    
       - More phys_to_virt conversions
    
       - Improvement of AP management for VSIE (nested virtualization)
    
      ARM64:
    
       - Numerous fixes for the pathological lock inversion issue that
         plagued KVM/arm64 since... forever.
    
       - New framework allowing SMCCC-compliant hypercalls to be forwarded
         to userspace, hopefully paving the way for some more features being
         moved to VMMs rather than be implemented in the kernel.
    
       - Large rework of the timer code to allow a VM-wide offset to be
         applied to both virtual and physical counters as well as a
         per-timer, per-vcpu offset that complements the global one. This
         last part allows the NV timer code to be implemented on top.
    
       - A small set of fixes to make sure that we don't change anything
         affecting the EL1&0 translation regime just after having having
         taken an exception to EL2 until we have executed a DSB. This
         ensures that speculative walks started in EL1&0 have completed.
    
       - The usual selftest fixes and improvements.
    
      x86:
    
       - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
         enabled, and by giving the guest control of CR0.WP when EPT is
         enabled on VMX (VMX-only because SVM doesn't support per-bit
         controls)
    
       - Add CR0/CR4 helpers to query single bits, and clean up related code
         where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
         return as a bool
    
       - Move AMD_PSFD to cpufeatures.h and purge KVM's definition
    
       - Avoid unnecessary writes+flushes when the guest is only adding new
         PTEs
    
       - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
         optimizations when emulating invalidations
    
       - Clean up the range-based flushing APIs
    
       - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
         single A/D bit using a LOCK AND instead of XCHG, and skip all of
         the "handle changed SPTE" overhead associated with writing the
         entire entry
    
       - Track the number of "tail" entries in a pte_list_desc to avoid
         having to walk (potentially) all descriptors during insertion and
         deletion, which gets quite expensive if the guest is spamming
         fork()
    
       - Disallow virtualizing legacy LBRs if architectural LBRs are
         available, the two are mutually exclusive in hardware
    
       - Disallow writes to immutable feature MSRs (notably
         PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features
    
       - Overhaul the vmx_pmu_caps selftest to better validate
         PERF_CAPABILITIES
    
       - Apply PMU filters to emulated events and add test coverage to the
         pmu_event_filter selftest
    
       - AMD SVM:
           - Add support for virtual NMIs
           - Fixes for edge cases related to virtual interrupts
    
       - Intel AMX:
           - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
             XTILE_DATA is not being reported due to userspace not opting in
             via prctl()
           - Fix a bug in emulation of ENCLS in compatibility mode
           - Allow emulation of NOP and PAUSE for L2
           - AMX selftests improvements
           - Misc cleanups
    
      MIPS:
    
       - Constify MIPS's internal callbacks (a leftover from the hardware
         enabling rework that landed in 6.3)
    
      Generic:
    
       - Drop unnecessary casts from "void *" throughout kvm_main.c
    
       - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
         struct size by 8 bytes on 64-bit kernels by utilizing a padding
         hole
    
      Documentation:
    
       - Fix goof introduced by the conversion to rST"
    
    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
      KVM: s390: pci: fix virtual-physical confusion on module unload/load
      KVM: s390: vsie: clarifications on setting the APCB
      KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
      KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
      KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
      KVM: selftests: Test the PMU event "Instructions retired"
      KVM: selftests: Copy full counter values from guest in PMU event filter test
      KVM: selftests: Use error codes to signal errors in PMU event filter test
      KVM: selftests: Print detailed info in PMU event filter asserts
      KVM: selftests: Add helpers for PMC asserts in PMU event filter test
      KVM: selftests: Add a common helper for the PMU event filter guest code
      KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
      KVM: arm64: vhe: Drop extra isb() on guest exit
      KVM: arm64: vhe: Synchronise with page table walker on MMU update
      KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
      KVM: arm64: nvhe: Synchronise with page table walker on TLBI
      KVM: arm64: Handle 32bit CNTPCTSS traps
      KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
      KVM: arm64: vgic: Don't acquire its_lock before config_lock
      KVM: selftests: Add test to verify KVM's supported XCR0
      ...
    torvalds committed May 1, 2023

Commits on Apr 28, 2023

  1. Merge tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux…

    …/kernel/git/tip/tip
    
    Pull SMP cross-CPU function-call updates from Ingo Molnar:
    
     - Remove diagnostics and adjust config for CSD lock diagnostics
    
     - Add a generic IPI-sending tracepoint, as currently there's no easy
       way to instrument IPI origins: it's arch dependent and for some major
       architectures it's not even consistently available.
    
    * tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
      trace,smp: Trace all smp_function_call*() invocations
      trace: Add trace_ipi_send_cpu()
      sched, smp: Trace smp callback causing an IPI
      smp: reword smp call IPI comment
      treewide: Trace IPIs sent via smp_send_reschedule()
      irq_work: Trace self-IPIs sent via arch_irq_work_raise()
      smp: Trace IPIs sent via arch_send_call_function_ipi_mask()
      sched, smp: Trace IPIs sent via send_call_function_single_ipi()
      trace: Add trace_ipi_send_cpumask()
      kernel/smp: Make csdlock_debug= resettable
      locking/csd_lock: Remove per-CPU data indirection from CSD lock debugging
      locking/csd_lock: Remove added data from CSD lock debugging
      locking/csd_lock: Add Kconfig option for csd_debug default
    torvalds committed Apr 28, 2023

Commits on Apr 26, 2023

  1. KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

    Preserve TDP MMU roots until they are explicitly invalidated by gifting
    the TDP MMU itself a reference to a root when it is allocated.  Keeping a
    reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
    performance, and can potentially even soft-hang a vCPU, if a vCPU
    frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
    
    When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
    and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
    cache of previous roots).  Unloading roots is a simple way to ensure KVM
    flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
    when allocating a "new" root (from the vCPU's perspective).
    
    In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
    a per-VM hash table.  Unloading a shadow MMU root just wipes it from the
    per-vCPU cache; the root is still tracked in the per-VM hash table.  When
    KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
    in the per-VM hash table.
    
    Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
    per-VM structure, where "active" in this case means a root is either
    in-use or cached as a previous root by at least one vCPU.  When a TDP MMU
    root becomes inactive, i.e. the last vCPU reference to the root is put,
    KVM immediately frees the root (asterisk on "immediately" as the actual
    freeing may be done by a worker, but for all intents and purposes the root
    is gone).
    
    The TDP MMU behavior is especially problematic for 1-vCPU setups, as
    unloading all roots effectively frees all roots.  The issue is mitigated
    to some degree in multi-vCPU setups as a different vCPU usually holds a
    reference to an unloaded root and thus keeps the root alive, allowing the
    vCPU to reuse its old root after unloading (with a flush+sync).
    
    The TDP MMU flaw has been known for some time, as until very recently,
    KVM's handling of CR0.WP also triggered unloading of all roots.  The
    CR0.WP toggling scenario was eventually addressed by not unloading roots
    when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
    for emulating SMM as KVM must emulate a full TLB flush on entry and exit
    to/from SMM.  Given that the shadow MMU plays nice with unloading roots
    at will, teaching the TDP MMU to do the same is far less complex than
    modifying KVM to track which roots need to be flushed before reuse.
    
    Note, preserving all possible TDP MMU roots is not a concern with respect
    to memory consumption.  Now that the role for direct MMUs doesn't include
    information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
    are _at most_ six possible roots (where "guest_mode" here means L2):
    
      1. 4-level !SMM !guest_mode
      2. 4-level  SMM !guest_mode
      3. 5-level !SMM !guest_mode
      4. 5-level  SMM !guest_mode
      5. 4-level !SMM guest_mode
      6. 5-level !SMM guest_mode
    
    And because each vCPU can track 4 valid roots, a VM can already have all
    6 root combinations live at any given time.  Not to mention that, in
    practice, no sane VMM will advertise different guest.MAXPHYADDR values
    across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
    a single VM.  Furthermore, the vast majority of modern hypervisors will
    utilize EPT/NPT when available, thus the guest_mode=%true cases are also
    unlikely to be utilized.
    
    Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
    Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
    Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
    Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
    Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
    Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
    Cc: Ben Gardon <bgardon@google.com>
    Cc: David Matlack <dmatlack@google.com>
    Cc: stable@vger.kernel.org
    Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
    Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 26, 2023
  2. Merge tag 'kvm-x86-vmx-6.4' of https://github.com/kvm-x86/linux into …

    …HEAD
    
    KVM VMX changes for 6.4:
    
     - Fix a bug in emulation of ENCLS in compatibility mode
    
     - Allow emulation of NOP and PAUSE for L2
    
     - Misc cleanups
    bonzini committed Apr 26, 2023
  3. Merge tag 'kvm-x86-svm-6.4' of https://github.com/kvm-x86/linux into …

    …HEAD
    
    KVM SVM changes for 6.4:
    
     - Add support for virtual NMIs
    
     - Fixes for edge cases related to virtual interrupts
    bonzini committed Apr 26, 2023
  4. Merge tag 'kvm-x86-selftests-6.4' of https://github.com/kvm-x86/linux

    …into HEAD
    
    KVM selftests, and an AMX/XCR0 bugfix, for 6.4:
    
     - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
       not being reported due to userspace not opting in via prctl()
    
     - Overhaul the AMX selftests to improve coverage and cleanup the test
    
     - Misc cleanups
    bonzini committed Apr 26, 2023
  5. Merge tag 'kvm-x86-pmu-6.4' of https://github.com/kvm-x86/linux into …

    …HEAD
    
    KVM x86 PMU changes for 6.4:
    
     - Disallow virtualizing legacy LBRs if architectural LBRs are available,
       the two are mutually exclusive in hardware
    
     - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
       after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better
       validate PERF_CAPABILITIES
    
     - Apply PMU filters to emulated events and add test coverage to the
       pmu_event_filter selftest
    
     - Misc cleanups and fixes
    bonzini committed Apr 26, 2023
  6. Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into …

    …HEAD
    
    KVM x86 MMU changes for 6.4:
    
     - Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
       guest is only adding new PTEs
    
     - Overhaul .sync_page() and .invlpg() to share the .sync_page()
       implementation, i.e. utilize .sync_page()'s optimizations when emulating
       invalidations
    
     - Clean up the range-based flushing APIs
    
     - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
       A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
       changed SPTE" overhead associated with writing the entire entry
    
     - Track the number of "tail" entries in a pte_list_desc to avoid having
       to walk (potentially) all descriptors during insertion and deletion,
       which gets quite expensive if the guest is spamming fork()
    
     - Misc cleanups
    bonzini committed Apr 26, 2023
  7. Merge tag 'kvm-x86-misc-6.4' of https://github.com/kvm-x86/linux into…

    … HEAD
    
    KVM x86 changes for 6.4:
    
     - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
       and by giving the guest control of CR0.WP when EPT is enabled on VMX
       (VMX-only because SVM doesn't support per-bit controls)
    
     - Add CR0/CR4 helpers to query single bits, and clean up related code
       where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
       as a bool
    
     - Move AMD_PSFD to cpufeatures.h and purge KVM's definition
    
     - Misc cleanups
    bonzini committed Apr 26, 2023
  8. Merge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/g…

    …it/kvmarm/kvmarm into HEAD
    
    KVM/arm64 updates for 6.4
    
    - Numerous fixes for the pathological lock inversion issue that
      plagued KVM/arm64 since... forever.
    
    - New framework allowing SMCCC-compliant hypercalls to be forwarded
      to userspace, hopefully paving the way for some more features
      being moved to VMMs rather than be implemented in the kernel.
    
    - Large rework of the timer code to allow a VM-wide offset to be
      applied to both virtual and physical counters as well as a
      per-timer, per-vcpu offset that complements the global one.
      This last part allows the NV timer code to be implemented on
      top.
    
    - A small set of fixes to make sure that we don't change anything
      affecting the EL1&0 translation regime just after having having
      taken an exception to EL2 until we have executed a DSB. This
      ensures that speculative walks started in EL1&0 have completed.
    
    - The usual selftest fixes and improvements.
    bonzini committed Apr 26, 2023
  9. Merge tag 'v6.4-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/…

    …herbert/crypto-2.6
    
    Pull crypto updates from Herbert Xu:
     "API:
       - Total usage stats now include all that returned errors (instead of
         just some)
       - Remove maximum hash statesize limit
       - Add cloning support for hmac and unkeyed hashes
       - Demote BUG_ON in crypto_unregister_alg to a WARN_ON
    
      Algorithms:
       - Use RIP-relative addressing on x86 to prepare for PIE build
       - Add accelerated AES/GCM stitched implementation on powerpc P10
       - Add some test vectors for cmac(camellia)
       - Remove failure case where jent is unavailable outside of FIPS mode
         in drbg
       - Add permanent and intermittent health error checks in jitter RNG
    
      Drivers:
       - Add support for 402xx devices in qat
       - Add support for HiSTB TRNG
       - Fix hash concurrency issues in stm32
       - Add OP-TEE firmware support in caam"
    
    * tag 'v6.4-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (139 commits)
      i2c: designware: Add doorbell support for Mendocino
      i2c: designware: Use PCI PSP driver for communication
      powerpc: Move Power10 feature PPC_MODULE_FEATURE_P10
      crypto: p10-aes-gcm - Remove POWER10_CPU dependency
      crypto: testmgr - Add some test vectors for cmac(camellia)
      crypto: cryptd - Add support for cloning hashes
      crypto: cryptd - Convert hash to use modern init_tfm/exit_tfm
      crypto: hmac - Add support for cloning
      crypto: hash - Add crypto_clone_ahash/shash
      crypto: api - Add crypto_clone_tfm
      crypto: api - Add crypto_tfm_get
      crypto: x86/sha - Use local .L symbols for code
      crypto: x86/crc32 - Use local .L symbols for code
      crypto: x86/aesni - Use local .L symbols for code
      crypto: x86/sha256 - Use RIP-relative addressing
      crypto: x86/ghash - Use RIP-relative addressing
      crypto: x86/des3 - Use RIP-relative addressing
      crypto: x86/crc32c - Use RIP-relative addressing
      crypto: x86/cast6 - Use RIP-relative addressing
      crypto: x86/cast5 - Use RIP-relative addressing
      ...
    torvalds committed Apr 26, 2023

Commits on Apr 25, 2023

  1. Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/…

    …viro/vfs
    
    Pull vfs fget updates from Al Viro:
     "fget() to fdget() conversions"
    
    * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
      fuse_dev_ioctl(): switch to fdget()
      cgroup_get_from_fd(): switch to fdget_raw()
      bpf: switch to fdget_raw()
      build_mount_idmapped(): switch to fdget()
      kill the last remaining user of proc_ns_fget()
      SVM-SEV: convert the rest of fget() uses to fdget() in there
      convert sgx_set_attribute() to fdget()/fdput()
      convert setns(2) to fdget()/fdput()
    torvalds committed Apr 25, 2023

Commits on Apr 24, 2023

  1. Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/lin…

    …ux/kernel/git/jfern/linux
    
    Pull RCU updates from Joel Fernandes:
    
     - Updates and additions to MAINTAINERS files, with Boqun being added to
       the RCU entry and Zqiang being added as an RCU reviewer.
    
       I have also transitioned from reviewer to maintainer; however, Paul
       will be taking over sending RCU pull-requests for the next merge
       window.
    
     - Resolution of hotplug warning in nohz code, achieved by fixing
       cpu_is_hotpluggable() through interaction with the nohz subsystem.
    
       Tick dependency modifications by Zqiang, focusing on fixing usage of
       the TICK_DEP_BIT_RCU_EXP bitmask.
    
     - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
       kernels, fixed by Zqiang.
    
     - Improvements to rcu-tasks stall reporting by Neeraj.
    
     - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
       increased robustness, affecting several components like mac802154,
       drbd, vmw_vmci, tracing, and more.
    
       A report by Eric Dumazet showed that the API could be unknowingly
       used in an atomic context, so we'd rather make sure they know what
       they're asking for by being explicit:
    
          https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/
    
     - Documentation updates, including corrections to spelling,
       clarifications in comments, and improvements to the srcu_size_state
       comments.
    
     - Better srcu_struct cache locality for readers, by adjusting the size
       of srcu_struct in support of SRCU usage by Christoph Hellwig.
    
     - Teach lockdep to detect deadlocks between srcu_read_lock() vs
       synchronize_srcu() contributed by Boqun.
    
       Previously lockdep could not detect such deadlocks, now it can.
    
     - Integration of rcutorture and rcu-related tools, targeted for v6.4
       from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
       module parameter, and more
    
     - Miscellaneous changes, various code cleanups and comment improvements
    
    * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
      checkpatch: Error out if deprecated RCU API used
      mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
      rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
      ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
      net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
      net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
      lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
      tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
      misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
      drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
      rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
      rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
      rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
      rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
      rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
      rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
      rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
      rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
      rcu/trace: use strscpy() to instead of strncpy()
      tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
      ...
    torvalds committed Apr 24, 2023

Commits on Apr 21, 2023

  1. SVM-SEV: convert the rest of fget() uses to fdget() in there

    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
    Al Viro committed Apr 21, 2023

Commits on Apr 11, 2023

  1. KVM: x86: Filter out XTILE_CFG if XTILE_DATA isn't permitted

    Filter out XTILE_CFG from the supported XCR0 reported to userspace if the
    current process doesn't have access to XTILE_DATA.  Attempting to set
    XTILE_CFG in XCR0 will #GP if XTILE_DATA is also not set, and so keeping
    XTILE_CFG as supported results in explosions if userspace feeds
    KVM_GET_SUPPORTED_CPUID back into KVM and the guest doesn't sanity check
    CPUID.
    
    Fixes: 445ecdf ("kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID")
    Reported-by: Aaron Lewis <aaronlewis@google.com>
    Reviewed-by: Aaron Lewis <aaronlewis@google.com>
    Tested-by: Aaron Lewis <aaronlewis@google.com>
    Link: https://lore.kernel.org/r/20230405004520.421768-3-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 11, 2023
  2. KVM: x86: Add a helper to handle filtering of unpermitted XCR0 features

    Add a helper, kvm_get_filtered_xcr0(), to dedup code that needs to account
    for XCR0 features that require explicit opt-in on a per-process basis.  In
    addition to documenting when KVM should/shouldn't consult
    xstate_get_guest_group_perm(), the helper will also allow sanitizing the
    filtered XCR0 to avoid enumerating architecturally illegal XCR0 values,
    e.g. XTILE_CFG without XTILE_DATA.
    
    No functional changes intended.
    
    Signed-off-by: Aaron Lewis <aaronlewis@google.com>
    Reviewed-by: Mingwei Zhang <mizhang@google.com>
    [sean: rename helper, move to x86.h, massage changelog]
    Reviewed-by: Aaron Lewis <aaronlewis@google.com>
    Tested-by: Aaron Lewis <aaronlewis@google.com>
    Link: https://lore.kernel.org/r/20230405004520.421768-2-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    suomilewis authored and sean-jc committed Apr 11, 2023
  3. KVM: nVMX: Emulate NOPs in L2, and PAUSE if it's not intercepted

    Extend VMX's nested intercept logic for emulated instructions to handle
    "pause" interception, in quotes because KVM's emulator doesn't filter out
    NOPs when checking for nested intercepts.  Failure to allow emulation of
    NOPs results in KVM injecting a #UD into L2 on any NOP that collides with
    the emulator's definition of PAUSE, i.e. on all single-byte NOPs.
    
    For PAUSE itself, honor L1's PAUSE-exiting control, but ignore PLE to
    avoid unnecessarily injecting a #UD into L2.  Per the SDM, the first
    execution of PAUSE after VM-Entry is treated as the beginning of a new
    loop, i.e. will never trigger a PLE VM-Exit, and so L1 can't expect any
    given execution of PAUSE to deterministically exit.
    
      ... the processor considers this execution to be the first execution of
      PAUSE in a loop. (It also does so for the first execution of PAUSE at
      CPL 0 after VM entry.)
    
    All that said, the PLE side of things is currently a moot point, as KVM
    doesn't expose PLE to L1.
    
    Note, vmx_check_intercept() is still wildly broken when L1 wants to
    intercept an instruction, as KVM injects a #UD instead of synthesizing a
    nested VM-Exit.  That issue extends far beyond NOP/PAUSE and needs far
    more effort to fix, i.e. is a problem for the future.
    
    Fixes: 07721fe ("KVM: nVMX: Don't emulate instructions in guest mode")
    Cc: Mathias Krause <minipli@grsecurity.net>
    Cc: stable@vger.kernel.org
    Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
    Link: https://lore.kernel.org/r/20230405002359.418138-1-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 11, 2023

Commits on Apr 10, 2023

  1. KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permissio…

    …n faults
    
    Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for
    permission faults when emulating a guest memory access and CR0.WP may be
    guest owned.  If the guest toggles only CR0.WP and triggers emulation of
    a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a
    stale CR0.WP, i.e. use stale protection bits metadata.
    
    Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP
    is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't
    support per-bit interception controls for CR0.  Don't bother checking for
    EPT vs. NPT as the "old == new" check will always be true under NPT, i.e.
    the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0
    from the VMCB on VM-Exit).
    
    Reported-by: Mathias Krause <minipli@grsecurity.net>
    Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net
    Fixes: fb509f7 ("KVM: VMX: Make CR0.WP a guest owned bit")
    Tested-by: Mathias Krause <minipli@grsecurity.net>
    Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 10, 2023
  2. KVM: x86/mmu: Move filling of Hyper-V's TLB range struct into Hyper-V…

    … code
    
    Refactor Hyper-V's range-based TLB flushing API to take a gfn+nr_pages
    pair instead of a struct, and bury said struct in Hyper-V specific code.
    
    Passing along two params generates much better code for the common case
    where KVM is _not_ running on Hyper-V, as forwarding the flush on to
    Hyper-V's hv_flush_remote_tlbs_range() from kvm_flush_remote_tlbs_range()
    becomes a tail call.
    
    Cc: David Matlack <dmatlack@google.com>
    Reviewed-by: David Matlack <dmatlack@google.com>
    Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
    Link: https://lore.kernel.org/r/20230405003133.419177-3-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 10, 2023
  3. KVM: x86: Rename Hyper-V remote TLB hooks to match established scheme

    Rename the Hyper-V hooks for TLB flushing to match the naming scheme used
    by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code,
    arch hooks from common code, etc.
    
    Reviewed-by: David Matlack <dmatlack@google.com>
    Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
    Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 10, 2023

Commits on Apr 7, 2023

  1. KVM: x86/pmu: Prevent the PMU from counting disallowed events

    When counting "Instructions Retired" (0xc0) in a guest, KVM will
    occasionally increment the PMU counter regardless of if that event is
    being filtered. This is because some PMU events are incremented via
    kvm_pmu_trigger_event(), which doesn't know about the event filter. Add
    the event filter to kvm_pmu_trigger_event(), so events that are
    disallowed do not increment their counters.
    
    Fixes: 9cd803d ("KVM: x86: Update vPMCs when retiring instructions")
    Signed-off-by: Aaron Lewis <aaronlewis@google.com>
    Reviewed-by: Like Xu <likexu@tencent.com>
    Link: https://lore.kernel.org/r/20230307141400.1486314-2-aaronlewis@google.com
    [sean: prepend "pmc" to the new function]
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    suomilewis authored and sean-jc committed Apr 7, 2023
  2. KVM: x86/pmu: Fix a typo in kvm_pmu_request_counter_reprogam()

    Fix a "reprogam" => "reprogram" typo in kvm_pmu_request_counter_reprogam().
    
    Fixes: 68fb475 ("KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()")
    Signed-off-by: Like Xu <likexu@tencent.com>
    Link: https://lore.kernel.org/r/20230310113349.31799-1-likexu@tencent.com
    [sean: trim the changelog]
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Like Xu authored and sean-jc committed Apr 7, 2023

Commits on Apr 6, 2023

  1. KVM: x86/pmu: Rewrite reprogram_counters() to improve performance

    A valid pmc is always tested before using pmu->reprogram_pmi. Eliminate
    this part of the redundancy by setting the counter's bitmask directly,
    and in addition, trigger KVM_REQ_PMU only once to save more cpu cycles.
    
    Signed-off-by: Like Xu <likexu@tencent.com>
    Link: https://lore.kernel.org/r/20230214050757.9623-4-likexu@tencent.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Like Xu authored and sean-jc committed Apr 6, 2023
  2. KVM: VMX: Refactor intel_pmu_{g,}set_msr() to align with other helpers

    Invert the flows in intel_pmu_{g,s}et_msr()'s case statements so that
    they follow the kernel's preferred style of:
    
            if (<not valid>)
                    return <error>
    
            <commit change>
            return <success>
    
    which is also the style used by every other {g,s}et_msr() helper (except
    AMD's PMU variant, which doesn't use a switch statement).
    
    Modify the "set" paths with costly side effects, i.e. that reprogram
    counters, to skip only the side effects, i.e. to perform reserved bits
    checks even if the value is unchanged.  None of the reserved bits checks
    are expensive, so there's no strong justification for skipping them, and
    guarding only the side effect makes it slightly more obvious what is being
    skipped and why.
    
    No functional change intended (assuming no reserved bit bugs).
    
    Link: https://lkml.kernel.org/r/Y%2B6cfen%2FCpO3%2FdLO%40google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 6, 2023
  3. KVM: x86/pmu: Rename pmc_is_enabled() to pmc_is_globally_enabled()

    The name of function pmc_is_enabled() is a bit misleading. A PMC can
    be disabled either by PERF_CLOBAL_CTRL or by its corresponding EVTSEL.
    Append global semantics to its name.
    
    Suggested-by: Jim Mattson <jmattson@google.com>
    Signed-off-by: Like Xu <likexu@tencent.com>
    Link: https://lore.kernel.org/r/20230214050757.9623-2-likexu@tencent.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Like Xu authored and sean-jc committed Apr 6, 2023
  4. KVM: x86/pmu: Zero out LBR capabilities during PMU refresh

    Zero out the LBR capabilities during PMU refresh to avoid exposing LBRs
    to the guest against userspace's wishes. If userspace modifies the
    guest's CPUID model or invokes KVM_CAP_PMU_CAPABILITY to disable vPMU
    after an initial KVM_SET_CPUID2, but before the first KVM_RUN, KVM will
    retain the previous LBR info due to bailing before refreshing the LBR
    descriptor.
    
    Note, this is a very theoretical bug, there is no known use case where a
    VMM would deliberately enable the vPMU via KVM_SET_CPUID2, and then later
    disable the vPMU.
    
    Link: https://lore.kernel.org/r/20230311004618.920745-9-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 6, 2023
  5. KVM: x86/pmu: WARN and bug the VM if PMU is refreshed after vCPU has run

    Now that KVM disallows changing feature MSRs, i.e. PERF_CAPABILITIES,
    after running a vCPU, WARN and bug the VM if the PMU is refreshed after
    the vCPU has run.
    
    Note, KVM has disallowed CPUID updates after running a vCPU since commit
    feb627e ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN"), i.e.
    PERF_CAPABILITIES was the only remaining way to trigger a PMU refresh
    after KVM_RUN.
    
    Cc: Like Xu <like.xu.linux@gmail.com>
    Link: https://lore.kernel.org/r/20230311004618.920745-8-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 6, 2023
  6. KVM: x86: Disallow writes to immutable feature MSRs after KVM_RUN

    Disallow writes to feature MSRs after KVM_RUN to prevent userspace from
    changing the vCPU model after running the vCPU.  Similar to guest CPUID,
    KVM uses feature MSRs to configure intercepts, determine what operations
    are/aren't allowed, etc.  Changing the capabilities while the vCPU is
    active will at best yield unpredictable guest behavior, and at worst
    could be dangerous to KVM.
    
    Allow writing the current value, e.g. so that userspace can blindly set
    all MSRs when emulating RESET, and unconditionally allow writes to
    MSR_IA32_UCODE_REV so that userspace can emulate patch loads.
    
    Special case the VMX MSRs to keep the generic list small, i.e. so that
    KVM can do a linear walk of the generic list without incurring meaningful
    overhead.
    
    Cc: Like Xu <like.xu.linux@gmail.com>
    Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
    Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
    Link: https://lore.kernel.org/r/20230311004618.920745-7-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 6, 2023
  7. KVM: x86: Generate set of VMX feature MSRs using first/last definitions

    Add VMX MSRs to the runtime list of feature MSRs by iterating over the
    range of emulated MSRs instead of manually defining each MSR in the "all"
    list.  Using the range definition reduces the cost of emulating a new VMX
    MSR, e.g. prevents forgetting to add an MSR to the list.
    
    Extracting the VMX MSRs from the "all" list, which is a compile-time
    constant, also shrinks the list to the point where the compiler can
    heavily optimize code that iterates over the list.
    
    No functional change intended.
    
    Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
    Link: https://lore.kernel.org/r/20230311004618.920745-5-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 6, 2023
  8. KVM: x86: Add macros to track first...last VMX feature MSRs

    Add macros to track the range of VMX feature MSRs that are emulated by
    KVM to reduce the maintenance cost of extending the set of emulated MSRs.
    
    Note, KVM doesn't necessarily emulate all known/consumed VMX MSRs, e.g.
    PROCBASED_CTLS3 is consumed by KVM to enable IPI virtualization, but is
    not emulated as KVM doesn't emulate/virtualize IPI virtualization for
    nested guests.
    
    No functional change intended.
    
    Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
    Link: https://lore.kernel.org/r/20230311004618.920745-4-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 6, 2023
  9. KVM: x86: Add a helper to query whether or not a vCPU has ever run

    Add a helper to query if a vCPU has run so that KVM doesn't have to open
    code the check on last_vmentry_cpu being set to a magic value.
    
    No functional change intended.
    
    Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
    Cc: Like Xu <like.xu.linux@gmail.com>
    Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
    Link: https://lore.kernel.org/r/20230311004618.920745-3-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 6, 2023
  10. KVM: x86: Rename kvm_init_msr_list() to clarify it inits multiple lists

    Rename kvm_init_msr_list() to kvm_init_msr_lists() to clarify that it
    initializes multiple lists: MSRs to save, emulated MSRs, and feature MSRs.
    
    No functional change intended.
    
    Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
    Link: https://lore.kernel.org/r/20230311004618.920745-2-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    sean-jc committed Apr 6, 2023
  11. KVM: SVM: Return the local "r" variable from svm_set_msr()

    Rename "r" to "ret" and actually return it from svm_set_msr() to reduce
    the probability of repeating the mistake of commit 723d5fb ("kvm:
    svm: Add IA32_FLUSH_CMD guest support"), which set "r" thinking that it
    would be propagated to the caller.
    
    Alternatively, the declaration of "r" could be moved into the handling of
    MSR_TSC_AUX, but that risks variable shadowing in the future.  A wrapper
    for kvm_set_user_return_msr() would allow eliding a local variable, but
    that feels like delaying the inevitable.
    
    No functional change intended.
    
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20230322011440.2195485-7-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Apr 6, 2023
  12. KVM: x86: Virtualize FLUSH_L1D and passthrough MSR_IA32_FLUSH_CMD

    Virtualize FLUSH_L1D so that the guest can use the performant L1D flush
    if one of the many mitigations might require a flush in the guest, e.g.
    Linux provides an option to flush the L1D when switching mms.
    
    Passthrough MSR_IA32_FLUSH_CMD for write when it's supported in hardware
    and exposed to the guest, i.e. always let the guest write it directly if
    FLUSH_L1D is fully supported.
    
    Forward writes to hardware in host context on the off chance that KVM
    ends up emulating a WRMSR, or in the really unlikely scenario where
    userspace wants to force a flush.  Restrict these forwarded WRMSRs to
    the known command out of an abundance of caution.  Passing through the
    MSR means the guest can throw any and all values at hardware, but doing
    so in host context is arguably a bit more dangerous.
    
    Link: https://lkml.kernel.org/r/CALMp9eTt3xzAEoQ038bJQ9LN0ZOXrSWsN7xnNUD%2B0SS%3DWwF7Pg%40mail.gmail.com
    Link: https://lore.kernel.org/all/20230201132905.549148-2-eesposit@redhat.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20230322011440.2195485-6-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Apr 6, 2023
Older