Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Merge tag 'pull-vfio-20230630' of https://github.com/legoater/qemu in…
…to staging

vfio queue:

* migration: New switchover ack to reduce downtime
* VFIO migration pre-copy support
* Removal of the VFIO migration experimental flag
* Alternate offset for GPUDirect Cliques
* Misc fixes

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEoPZlSPBIlev+awtgUaNDx8/77KEFAmSeVHYACgkQUaNDx8/7
# 7KHeZw/+LRe9QQpx8hU//vKBvLet2QvI3WUaXGHiHbblbRT6HhiHjWHB2/8j6jji
# QhAGJ6w9yoKODyY0kGpVFEnkmXOKyqwWssBheV219ntZs09pFGxZr/ldUhT22aBN
# kH8mHU9BZ3J+zF/kKphpcIC1sPxVu/DlrtnJu5vDGuRAOu8+3kFV217JC1yGs1Vh
# n+KOho8a8oP9qxtzfvQ9iZ4dpBOOKpE9vscS12wJAlen93AGB6esR7VaLxDjExRP
# yL1pguQ8ZZ1gEXXbXO62djKo3IViobtD08KmCXTzQ6TVquLleJzqgjp+A0THnYAe
# J9Rlja7LpsO9MYSxmRE9WcQccC+sAGn/t/ufB0tL8zR43FvfhbF5H0PzBBY0H7YA
# JlzN+fgrKEEHJwMhXANNvSddhWCwvrkjNxo/80u3ySYMQR1Hav/tsXYBlk16e5nS
# fmtrFGTwhsVdy1Q6ZqEOyTni1eiYt5stEQMZFODdUNj6b9FugSZ0BK+2WN/M0CzU
# 6mKmJQgZAG/nBoRJm/XCO5OKQ6wm/4tm6F4HSH5EJ6mDT+DqETAk4GRUWTbYa2/G
# yAAOlhTMu8Xc/NhMeJ7Z99dyq0SM8pi/XpVEIv7p9yBak8ix60iCWZtDE8vlDv3M
# UfMVMTAvTS30kbS6FDN2Yyl6l8/ETdcwVIN4l02ipGzpMCtn9EQ=
# =dKUj
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 30 Jun 2023 06:05:10 AM CEST
# gpg:                using RSA key A0F66548F04895EBFE6B0B6051A343C7CFFBECA1
# gpg: Good signature from "Cédric Le Goater <clg@kaod.org>" [undefined]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: A0F6 6548 F048 95EB FE6B  0B60 51A3 43C7 CFFB ECA1

* tag 'pull-vfio-20230630' of https://github.com/legoater/qemu:
  vfio/pci: Free leaked timer in vfio_realize error path
  vfio/pci: Fix a segfault in vfio_realize
  MAINTAINERS: Promote Cédric to VFIO co-maintainer
  vfio/migration: Make VFIO migration non-experimental
  vfio/migration: Reset bytes_transferred properly
  vfio/pci: Call vfio_prepare_kvm_msi_virq_batch() in MSI retry path
  hw/vfio/pci-quirks: Support alternate offset for GPUDirect Cliques
  vfio: Implement a common device info helper
  vfio/migration: Add support for switchover ack capability
  vfio/migration: Add VFIO migration pre-copy support
  vfio/migration: Store VFIO migration flags in VFIOMigration
  vfio/migration: Refactor vfio_save_block() to return saved data size
  tests: Add migration switchover ack capability test
  migration: Enable switchover ack capability
  migration: Implement switchover ack logic
  migration: Add switchover ack capability

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
  • Loading branch information
rth7680 committed Jun 30, 2023
2 parents f788416 + 0cc889c commit 408015a
Show file tree
Hide file tree
Showing 20 changed files with 600 additions and 118 deletions.
2 changes: 1 addition & 1 deletion MAINTAINERS
Expand Up @@ -2051,7 +2051,7 @@ F: hw/usb/dev-serial.c

VFIO
M: Alex Williamson <alex.williamson@redhat.com>
R: Cédric Le Goater <clg@redhat.com>
M: Cédric Le Goater <clg@redhat.com>
S: Supported
F: hw/vfio/*
F: include/hw/vfio/
Expand Down
45 changes: 35 additions & 10 deletions docs/devel/vfio-migration.rst
Expand Up @@ -7,12 +7,21 @@ the guest is running on source host and restoring this saved state on the
destination host. This document details how saving and restoring of VFIO
devices is done in QEMU.

Migration of VFIO devices currently consists of a single stop-and-copy phase.
During the stop-and-copy phase the guest is stopped and the entire VFIO device
data is transferred to the destination.

The pre-copy phase of migration is currently not supported for VFIO devices.
Support for VFIO pre-copy will be added later on.
Migration of VFIO devices consists of two phases: the optional pre-copy phase,
and the stop-and-copy phase. The pre-copy phase is iterative and allows to
accommodate VFIO devices that have a large amount of data that needs to be
transferred. The iterative pre-copy phase of migration allows for the guest to
continue whilst the VFIO device state is transferred to the destination, this
helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
VFIO_DEVICE_FEATURE_MIGRATION ioctl.

When pre-copy is supported, it's possible to further reduce downtime by
enabling "switchover-ack" migration capability.
VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
and recommends that the initial bytes are sent and loaded in the destination
before stopping the source VM. Enabling this migration capability will
guarantee that and thus, can potentially reduce downtime even further.

Note that currently VFIO migration is supported only for a single device. This
is due to VFIO migration's lack of P2P support. However, P2P support is planned
Expand All @@ -29,10 +38,23 @@ VFIO implements the device hooks for the iterative approach as follows:
* A ``load_setup`` function that sets the VFIO device on the destination in
_RESUMING state.

* A ``state_pending_estimate`` function that reports an estimate of the
remaining pre-copy data that the vendor driver has yet to save for the VFIO
device.

* A ``state_pending_exact`` function that reads pending_bytes from the vendor
driver, which indicates the amount of data that the vendor driver has yet to
save for the VFIO device.

* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
active only when the VFIO device is in pre-copy states.

* A ``save_live_iterate`` function that reads the VFIO device's data from the
vendor driver during iterative pre-copy phase.

* A ``switchover_ack_needed`` function that checks if the VFIO device uses
"switchover-ack" migration capability when this capability is enabled.

* A ``save_state`` function to save the device config space if it is present.

* A ``save_live_complete_precopy`` function that sets the VFIO device in
Expand Down Expand Up @@ -111,8 +133,10 @@ Flow of state changes during Live migration
===========================================

Below is the flow of state change during live migration.
The values in the brackets represent the VM state, the migration state, and
The values in the parentheses represent the VM state, the migration state, and
the VFIO device state, respectively.
The text in the square brackets represents the flow if the VFIO device supports
pre-copy.

Live migration save path
------------------------
Expand All @@ -124,11 +148,12 @@ Live migration save path
|
migrate_init spawns migration_thread
Migration thread then calls each device's .save_setup()
(RUNNING, _SETUP, _RUNNING)
(RUNNING, _SETUP, _RUNNING [_PRE_COPY])
|
(RUNNING, _ACTIVE, _RUNNING)
If device is active, get pending_bytes by .state_pending_exact()
(RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
If device is active, get pending_bytes by .state_pending_{estimate,exact}()
If total pending_bytes >= threshold_size, call .save_live_iterate()
[Data of VFIO device for pre-copy phase is copied]
Iterate till total pending bytes converge and are less than threshold
|
On migration completion, vCPU stops and calls .save_live_complete_precopy for
Expand Down
37 changes: 5 additions & 32 deletions hw/s390x/s390-pci-vfio.c
Expand Up @@ -289,38 +289,11 @@ static void s390_pci_read_pfip(S390PCIBusDevice *pbdev,
memcpy(pbdev->zpci_fn.pfip, cap->pfip, CLP_PFIP_NR_SEGMENTS);
}

static struct vfio_device_info *get_device_info(S390PCIBusDevice *pbdev,
uint32_t argsz)
static struct vfio_device_info *get_device_info(S390PCIBusDevice *pbdev)
{
struct vfio_device_info *info = g_malloc0(argsz);
VFIOPCIDevice *vfio_pci;
int fd;
VFIOPCIDevice *vfio_pci = container_of(pbdev->pdev, VFIOPCIDevice, pdev);

vfio_pci = container_of(pbdev->pdev, VFIOPCIDevice, pdev);
fd = vfio_pci->vbasedev.fd;

/*
* If the specified argsz is not large enough to contain all capabilities
* it will be updated upon return from the ioctl. Retry until we have
* a big enough buffer to hold the entire capability chain. On error,
* just exit and rely on CLP defaults.
*/
retry:
info->argsz = argsz;

if (ioctl(fd, VFIO_DEVICE_GET_INFO, info)) {
trace_s390_pci_clp_dev_info(vfio_pci->vbasedev.name);
g_free(info);
return NULL;
}

if (info->argsz > argsz) {
argsz = info->argsz;
info = g_realloc(info, argsz);
goto retry;
}

return info;
return vfio_get_device_info(vfio_pci->vbasedev.fd);
}

/*
Expand All @@ -335,7 +308,7 @@ bool s390_pci_get_host_fh(S390PCIBusDevice *pbdev, uint32_t *fh)

assert(fh);

info = get_device_info(pbdev, sizeof(*info));
info = get_device_info(pbdev);
if (!info) {
return false;
}
Expand All @@ -356,7 +329,7 @@ void s390_pci_get_clp_info(S390PCIBusDevice *pbdev)
{
g_autofree struct vfio_device_info *info = NULL;

info = get_device_info(pbdev, sizeof(*info));
info = get_device_info(pbdev);
if (!info) {
return;
}
Expand Down
68 changes: 53 additions & 15 deletions hw/vfio/common.c
Expand Up @@ -381,7 +381,7 @@ static unsigned int vfio_migratable_device_num(void)
return device_num;
}

int vfio_block_multiple_devices_migration(Error **errp)
int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp)
{
int ret;

Expand All @@ -390,6 +390,12 @@ int vfio_block_multiple_devices_migration(Error **errp)
return 0;
}

if (vbasedev->enable_migration == ON_OFF_AUTO_ON) {
error_setg(errp, "Migration is currently not supported with multiple "
"VFIO devices");
return -EINVAL;
}

error_setg(&multiple_devices_migration_blocker,
"Migration is currently not supported with multiple "
"VFIO devices");
Expand Down Expand Up @@ -427,7 +433,7 @@ static bool vfio_viommu_preset(void)
return false;
}

int vfio_block_giommu_migration(Error **errp)
int vfio_block_giommu_migration(VFIODevice *vbasedev, Error **errp)
{
int ret;

Expand All @@ -436,6 +442,12 @@ int vfio_block_giommu_migration(Error **errp)
return 0;
}

if (vbasedev->enable_migration == ON_OFF_AUTO_ON) {
error_setg(errp,
"Migration is currently not supported with vIOMMU enabled");
return -EINVAL;
}

error_setg(&giommu_migration_blocker,
"Migration is currently not supported with vIOMMU enabled");
ret = migrate_add_blocker(giommu_migration_blocker, errp);
Expand Down Expand Up @@ -492,7 +504,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
}

if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
return false;
}
}
Expand Down Expand Up @@ -537,7 +550,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
return false;
}

if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
continue;
} else {
return false;
Expand Down Expand Up @@ -2844,11 +2858,35 @@ void vfio_put_group(VFIOGroup *group)
}
}

struct vfio_device_info *vfio_get_device_info(int fd)
{
struct vfio_device_info *info;
uint32_t argsz = sizeof(*info);

info = g_malloc0(argsz);

retry:
info->argsz = argsz;

if (ioctl(fd, VFIO_DEVICE_GET_INFO, info)) {
g_free(info);
return NULL;
}

if (info->argsz > argsz) {
argsz = info->argsz;
info = g_realloc(info, argsz);
goto retry;
}

return info;
}

int vfio_get_device(VFIOGroup *group, const char *name,
VFIODevice *vbasedev, Error **errp)
{
struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
int ret, fd;
g_autofree struct vfio_device_info *info = NULL;
int fd;

fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
if (fd < 0) {
Expand All @@ -2860,11 +2898,11 @@ int vfio_get_device(VFIOGroup *group, const char *name,
return fd;
}

ret = ioctl(fd, VFIO_DEVICE_GET_INFO, &dev_info);
if (ret) {
info = vfio_get_device_info(fd);
if (!info) {
error_setg_errno(errp, errno, "error getting device info");
close(fd);
return ret;
return -1;
}

/*
Expand Down Expand Up @@ -2892,14 +2930,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
vbasedev->group = group;
QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);

vbasedev->num_irqs = dev_info.num_irqs;
vbasedev->num_regions = dev_info.num_regions;
vbasedev->flags = dev_info.flags;
vbasedev->num_irqs = info->num_irqs;
vbasedev->num_regions = info->num_regions;
vbasedev->flags = info->flags;

trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);

trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
dev_info.num_irqs);
vbasedev->reset_works = !!(info->flags & VFIO_DEVICE_FLAGS_RESET);

vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
return 0;
}

Expand Down

0 comments on commit 408015a

Please sign in to comment.