Skip to content

Commit

Permalink
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into sta…
Browse files Browse the repository at this point in the history
…ging

virtio,pc,pci: features, cleanups, fixes

vhost-user enabled on non-linux systems
beginning of nvme sriov support
bigger tx queue for vdpa
virtio iommu bypass
pci tests for arm

Fixes, cleanups all over the place

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

# gpg: Signature made Fri 04 Mar 2022 13:31:14 GMT
# gpg:                using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469
# gpg:                issuer "mst@redhat.com"
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full]
# gpg:                 aka "Michael S. Tsirkin <mst@redhat.com>" [full]
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17  0970 C350 3912 AFBE 8E67
#      Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA  8A0D 281F 0DB8 D28D 5469

* remotes/mst/tags/for_upstream: (45 commits)
  docs: vhost-user: add subsection for non-Linux platforms
  configure, meson: allow enabling vhost-user on all POSIX systems
  vhost: use wfd on functions setting vring call fd
  event_notifier: add event_notifier_get_wfd()
  x86: cleanup unused compat_apic_id_mode
  vhost-vsock: detach the virqueue element in case of error
  pc: add option to disable PS/2 mouse/keyboard
  acpi: pcihp: pcie: set power on cap on parent slot
  pci: expose TYPE_XIO3130_DOWNSTREAM name
  pci: show id info when pci BDF conflict
  hw/misc/pvpanic: Use standard headers instead
  headers: Add pvpanic.h
  pci-bridge/xio3130_downstream: Fix error handling
  pci-bridge/xio3130_upstream: Fix error handling
  pcie: Add 1.2 version token for the Power Management Capability
  pcie: Add a helper to the SR/IOV API
  pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
  pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  virtio-net: Unlimit tx queue size if peer is vdpa
  hw/pci-bridge/pxb: Fix missing swizzle
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
  • Loading branch information
pm215 committed Mar 4, 2022
2 parents 3d1fbc5 + 74bc2c5 commit 0890184
Show file tree
Hide file tree
Showing 70 changed files with 1,581 additions and 206 deletions.
1 change: 0 additions & 1 deletion MAINTAINERS
Expand Up @@ -1819,7 +1819,6 @@ F: docs/specs/acpi_hw_reduced_hotplug.rst

ACPI/VIOT
M: Jean-Philippe Brucker <jean-philippe@linaro.org>
R: Ani Sinha <ani@anisinha.ca>
S: Supported
F: hw/acpi/viot.c
F: hw/acpi/viot.h
Expand Down
4 changes: 2 additions & 2 deletions configure
Expand Up @@ -1659,8 +1659,8 @@ fi
# vhost interdependencies and host support

# vhost backends
if test "$vhost_user" = "yes" && test "$linux" != "yes"; then
error_exit "vhost-user is only available on Linux"
if test "$vhost_user" = "yes" && test "$mingw32" = "yes"; then
error_exit "vhost-user is not available on Windows"
fi
test "$vhost_vdpa" = "" && vhost_vdpa=$linux
if test "$vhost_vdpa" = "yes" && test "$linux" != "yes"; then
Expand Down
8 changes: 8 additions & 0 deletions docs/about/deprecated.rst
Expand Up @@ -324,6 +324,14 @@ machine is hardly emulated at all (e.g. neither the LCD nor the USB part had
been implemented), so there is not much value added by this board. Use the
``ref405ep`` machine instead.

``pc-i440fx-1.4`` up to ``pc-i440fx-1.7`` (since 7.0)
'''''''''''''''''''''''''''''''''''''''''''''''''''''

These old machine types are quite neglected nowadays and thus might have
various pitfalls with regards to live migration. Use a newer machine type
instead.


Backend options
---------------

Expand Down
20 changes: 20 additions & 0 deletions docs/interop/vhost-user.rst
Expand Up @@ -38,6 +38,26 @@ conventions <backend_conventions>`.
*Master* and *slave* can be either a client (i.e. connecting) or
server (listening) in the socket communication.

Support for platforms other than Linux
--------------------------------------

While vhost-user was initially developed targeting Linux, nowadays it
is supported on any platform that provides the following features:

- A way for requesting shared memory represented by a file descriptor
so it can be passed over a UNIX domain socket and then mapped by the
other process.

- AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can
exchange messages through it, including ancillary data when needed.

- Either eventfd or pipe/pipe2. On platforms where eventfd is not
available, QEMU will automatically fall back to pipe2 or, as a last
resort, pipe. Each file descriptor will be used for receiving or
sending events by reading or writing (respectively) an 8-byte value
to the corresponding it. The 8-value itself has no meaning and
should not be interpreted.

Message Specification
=====================

Expand Down
115 changes: 115 additions & 0 deletions docs/pcie_sriov.txt
@@ -0,0 +1,115 @@
PCI SR/IOV EMULATION SUPPORT
============================

Description
===========
SR/IOV (Single Root I/O Virtualization) is an optional extended capability
of a PCI Express device. It allows a single physical function (PF) to appear as multiple
virtual functions (VFs) for the main purpose of eliminating software
overhead in I/O from virtual machines.

Qemu now implements the basic common functionality to enable an emulated device
to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a
proof-of-concept hack of the Intel igb can be found here:

git://github.com/knuto/qemu.git sriov_patches_v5

Implementation
==============
Implementing emulation of an SR/IOV capable device typically consists of
implementing support for two types of device classes; the "normal" physical device
(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just
like other devices, except that some of their properties are derived from
the PF.

A virtual function is different from a physical function in that the BAR
space for all VFs are defined by the BAR registers in the PFs SR/IOV
capability. All VFs have the same BARs and BAR sizes.

Accesses to these virtual BARs then is computed as

<VF BAR start> + <VF number> * <BAR sz> + <offset>

From our emulation perspective this means that there is a separate call for
setting up a BAR for a VF.

1) To enable SR/IOV support in the PF, it must be a PCI Express device so
you would need to add a PCI Express capability in the normal PCI
capability list. You might also want to add an ARI (Alternative
Routing-ID Interpretation) capability to indicate that your device
supports functions beyond it's "own" function space (0-7),
which is necessary to support more than 7 functions, or
if functions extends beyond offset 7 because they are placed at an
offset > 1 or have stride > 1.

...
#include "hw/pci/pcie.h"
#include "hw/pci/pcie_sriov.h"

pci_your_pf_dev_realize( ... )
{
...
int ret = pcie_endpoint_cap_init(d, 0x70);
...
pcie_ari_init(d, 0x100, 1);
...

/* Add and initialize the SR/IOV capability */
pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
vf_devid, initial_vfs, total_vfs,
fun_offset, stride);

/* Set up individual VF BARs (parameters as for normal BARs) */
pcie_sriov_pf_init_vf_bar( ... )
...
}

For cleanup, you simply call:

pcie_sriov_pf_exit(device);

which will delete all the virtual functions and associated resources.

2) Similarly in the implementation of the virtual function, you need to
make it a PCI Express device and add a similar set of capabilities
except for the SR/IOV capability. Then you need to set up the VF BARs as
subregions of the PFs SR/IOV VF BARs by calling
pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call:

pci_your_vf_dev_realize( ... )
{
...
int ret = pcie_endpoint_cap_init(d, 0x60);
...
pcie_ari_init(d, 0x100, 1);
...
memory_region_init(mr, ... )
pcie_sriov_vf_register_bar(d, bar_nr, mr);
...
}

Testing on Linux guest
======================
The easiest is if your device driver supports sysfs based SR/IOV
enabling. Support for this was added in kernel v.3.8, so not all drivers
support it yet.

To enable 4 VFs for a device at 01:00.0:

modprobe yourdriver
echo 4 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs

You should now see 4 VFs with lspci.
To turn SR/IOV off again - the standard requires you to turn it off before you can enable
another VF count, and the emulation enforces this:

echo 0 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs

Older drivers typically provide a max_vfs module parameter
to enable it at load time:

modprobe yourdriver max_vfs=4

To disable the VFs again then, you simply have to unload the driver:

rmmod yourdriver
200 changes: 200 additions & 0 deletions docs/specs/acpi_erst.rst
@@ -0,0 +1,200 @@
ACPI ERST DEVICE
================

The ACPI ERST device is utilized to support the ACPI Error Record
Serialization Table, ERST, functionality. This feature is designed for
storing error records in persistent storage for future reference
and/or debugging.

The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces
(APEI)", and specifically subsection "Error Serialization", outlines a
method for storing error records into persistent storage.

The format of error records is described in the UEFI specification[2],
in Appendix N "Common Platform Error Record".

While the ACPI specification allows for an NVRAM "mode" (see
GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is
directly exposed for direct access by the OS/guest, this device
implements the non-NVRAM "mode". This non-NVRAM "mode" is what is
implemented by most BIOS (since flash memory requires programming
operations in order to update its contents). Furthermore, as of the
time of this writing, Linux only supports the non-NVRAM "mode".


Background/Motivation
---------------------

Linux uses the persistent storage filesystem, pstore, to record
information (eg. dmesg tail) upon panics and shutdowns. Pstore is
independent of, and runs before, kdump. In certain scenarios (ie.
hosts/guests with root filesystems on NFS/iSCSI where networking
software and/or hardware fails, and thus kdump fails), pstore may
contain information available for post-mortem debugging.

Two common storage backends for the pstore filesystem are ACPI ERST
and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in all
guests. With QEMU supporting ACPI ERST, it becomes a viable pstore
storage backend for virtual machines (as it is now for bare metal
machines).

Enabling support for ACPI ERST facilitates a consistent method to
capture kernel panic information in a wide range of guests: from
resource-constrained microvms to very large guests, and in particular,
in direct-boot environments (which would lack UEFI run-time services).

Note that Microsoft Windows also utilizes the ACPI ERST for certain
crash information, if available[3].


Configuration|Usage
-------------------

To use ACPI ERST, a memory-backend-file object and acpi-erst device
can be created, for example:

qemu ...
-object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing,size=0x10000,share=on \
-device acpi-erst,memdev=erstnvram

For proper operation, the ACPI ERST device needs a memory-backend-file
object with the following parameters:

- id: The id of the memory-backend-file object is used to associate
this memory with the acpi-erst device.
- size: The size of the ACPI ERST backing storage. This parameter is
required.
- mem-path: The location of the ACPI ERST backing storage file. This
parameter is also required.
- share: The share=on parameter is required so that updates to the
ERST backing store are written to the file.

and ERST device:

- memdev: Is the object id of the memory-backend-file.
- record_size: Specifies the size of the records (or slots) in the
backend storage. Must be a power of two value greater than or
equal to 4096 (PAGE_SIZE).


PCI Interface
-------------

The ERST device is a PCI device with two BARs, one for accessing the
programming registers, and the other for accessing the record exchange
buffer.

BAR0 contains the programming interface consisting of ACTION and VALUE
64-bit registers. All ERST actions/operations/side effects happen on
the write to the ACTION, by design. Any data needed by the action must
be placed into VALUE prior to writing ACTION. Reading the VALUE
simply returns the register contents, which can be updated by a
previous ACTION.

BAR1 contains the 8KiB record exchange buffer, which is the
implemented maximum record size.


Backend Storage Format
----------------------

The backend storage is divided into fixed size "slots", 8KiB in
length, with each slot storing a single record. Not all slots need to
be occupied, and they need not be occupied in a contiguous fashion.
The ability to clear/erase specific records allows for the formation
of unoccupied slots.

Slot 0 contains a backend storage header that identifies the contents
as ERST and also facilitates efficient access to the records.
Depending upon the size of the backend storage, additional slots will
be designated to be a part of the slot 0 header. For example, at 8KiB,
the slot 0 header can accomodate 1021 records. Thus a storage size
of 8MiB (8KiB * 1024) requires an additional slot for use by the
header. In this scenario, slot 0 and slot 1 form the backend storage
header, and records can be stored starting at slot 2.

Below is an example layout of the backend storage format (for storage
size less than 8MiB). The size of the storage is a multiple of 8KiB,
and contains N number of slots to store records. The example below
shows two records (in CPER format) in the backend storage, while the
remaining slots are empty/available.

::

Slot Record
<------------------ 8KiB -------------------->
+--------------------------------------------+
0 | storage header |
+--------------------------------------------+
1 | empty/available |
+--------------------------------------------+
2 | CPER |
+--------------------------------------------+
3 | CPER |
+--------------------------------------------+
... | |
+--------------------------------------------+
N | empty/available |
+--------------------------------------------+

The storage header consists of some basic information and an array
of CPER record_id's to efficiently access records in the backend
storage.

All fields in the header are stored in little endian format.

::

+--------------------------------------------+
| magic | 0x0000
+--------------------------------------------+
| record_offset | record_size | 0x0008
+--------------------------------------------+
| record_count | reserved | version | 0x0010
+--------------------------------------------+
| record_id[0] | 0x0018
+--------------------------------------------+
| record_id[1] | 0x0020
+--------------------------------------------+
| record_id[...] |
+--------------------------------------------+
| record_id[N] | 0x1FF8
+--------------------------------------------+

The 'magic' field contains the value 0x524F545354535245.

The 'record_size' field contains the value 0x2000, 8KiB.

The 'record_offset' field points to the first record_id in the array,
0x0018.

The 'version' field contains 0x0100, the first version.

The 'record_count' field contains the number of valid records in the
backend storage.

The 'record_id' array fields are the 64-bit record identifiers of the
CPER record in the corresponding slot. Stated differently, the
location of a CPER record_id in the record_id[] array provides the
slot index for the corresponding record in the backend storage.

Note that, for example, with a backend storage less than 8MiB, slot 0
contains the header, so the record_id[0] will never contain a valid
CPER record_id. Instead slot 1 is the first available slot and thus
record_id_[1] may contain a CPER.

A 'record_id' of all 0s or all 1s indicates an invalid record (ie. the
slot is available).


References
----------

[1] "Advanced Configuration and Power Interface Specification",
version 4.0, June 2009.

[2] "Unified Extensible Firmware Interface Specification",
version 2.1, October 2008.

[3] "Windows Hardware Error Architecture", specfically
"Error Record Persistence Mechanism".
1 change: 1 addition & 0 deletions docs/specs/index.rst
Expand Up @@ -18,3 +18,4 @@ guest hardware that is specific to QEMU.
acpi_mem_hotplug
acpi_pci_hotplug
acpi_nvdimm
acpi_erst
1 change: 1 addition & 0 deletions docs/specs/pci-ids.txt
Expand Up @@ -65,6 +65,7 @@ PCI devices (other than virtio):
1b36:000f mdpy (mdev sample device), linux/samples/vfio-mdev/mdpy.c
1b36:0010 PCIe NVMe device (-device nvme)
1b36:0011 PCI PVPanic device (-device pvpanic-pci)
1b36:0012 PCI ACPI ERST device (-device acpi-erst)

All these devices are documented in docs/specs.

Expand Down

0 comments on commit 0890184

Please sign in to comment.