Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Merge tag 'pull-vfio-20230911' of https://github.com/legoater/qemu in…
…to staging

vfio queue:

* Small downtime optimisation for VFIO migration
* P2P support for VFIO migration
* Introduction of a save_prepare() handler to fail VFIO migration
* Fix on DMA logging ranges calculation for OVMF enabling dynamic window

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEoPZlSPBIlev+awtgUaNDx8/77KEFAmT+uZQACgkQUaNDx8/7
# 7KGFSw//UIqSet6MUxZZh/t7yfNFUTnxx6iPdChC3BphBaDDh99FCQrw5mPZ8ImF
# 4rz0cIwSaHXraugEsC42TDaGjEmcAmYD0Crz+pSpLU21nKtYyWtZy6+9kyYslMNF
# bUq0UwD0RGTP+ZZi6GBy1hM30y/JbNAGeC6uX8kyJRuK5Korfzoa/X5h+B2XfouW
# 78G1mARHq5eOkGy91+rAJowdjqtkpKrzkfCJu83330Bb035qAT/PEzGs5LxdfTla
# ORNqWHy3W+d8ZBicBQ5vwrk6D5JIZWma7vdXJRhs1wGO615cuyt1L8nWLFr8klW5
# MJl+wM7DZ6UlSODq7r839GtSuWAnQc2j7JKc+iqZuBBk1v9fGXv2tZmtuTGkG2hN
# nYXSQfuq1igu1nGVdxJv6WorDxsK9wzLNO2ckrOcKTT28RFl8oCDNSPPTKpwmfb5
# i5RrGreeXXqRXIw0VHhq5EqpROLjAFwE9tkJndO8765Ag154plxssaKTUWo5wm7/
# kjQVuRuhs5nnMXfL9ixLZkwD1aFn5fWAIaR0psH5vGD0fnB1Pba+Ux9ZzHvxp5D8
# Kg3H6dKlht6VXdQ/qb0Up1LXCGEa70QM6Th2iO924ydZkkmqrSj+CFwGHvBsINa4
# 89fYd77nbRbdwWurj3JIznJYVipau2PmfbjZ/jTed4RxjBQ+fPA=
# =44e0
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 11 Sep 2023 02:54:12 EDT
# gpg:                using RSA key A0F66548F04895EBFE6B0B6051A343C7CFFBECA1
# gpg: Good signature from "Cédric Le Goater <clg@redhat.com>" [unknown]
# gpg:                 aka "Cédric Le Goater <clg@kaod.org>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: A0F6 6548 F048 95EB FE6B  0B60 51A3 43C7 CFFB ECA1

* tag 'pull-vfio-20230911' of https://github.com/legoater/qemu:
  vfio/common: Separate vfio-pci ranges
  vfio/migration: Block VFIO migration with background snapshot
  vfio/migration: Block VFIO migration with postcopy migration
  migration: Add .save_prepare() handler to struct SaveVMHandlers
  migration: Move more initializations to migrate_init()
  vfio/migration: Fail adding device with enable-migration=on and existing blocker
  migration: Add migration prefix to functions in target.c
  vfio/migration: Allow migration of multiple P2P supporting devices
  vfio/migration: Add P2P support for VFIO migration
  vfio/migration: Refactor PRE_COPY and RUNNING state checks
  qdev: Add qdev_add_vm_change_state_handler_full()
  sysemu: Add prepare callback to struct VMChangeStateEntry
  vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup()

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
  • Loading branch information
Stefan Hajnoczi committed Sep 11, 2023
2 parents cb6c406 + a31fe5d commit 9ef4977
Show file tree
Hide file tree
Showing 14 changed files with 377 additions and 99 deletions.
93 changes: 57 additions & 36 deletions docs/devel/vfio-migration.rst
Expand Up @@ -23,9 +23,21 @@ and recommends that the initial bytes are sent and loaded in the destination
before stopping the source VM. Enabling this migration capability will
guarantee that and thus, can potentially reduce downtime even further.

Note that currently VFIO migration is supported only for a single device. This
is due to VFIO migration's lack of P2P support. However, P2P support is planned
to be added later on.
To support migration of multiple devices that might do P2P transactions between
themselves, VFIO migration uAPI defines an intermediate P2P quiescent state.
While in the P2P quiescent state, P2P DMA transactions cannot be initiated by
the device, but the device can respond to incoming ones. Additionally, all
outstanding P2P transactions are guaranteed to have been completed by the time
the device enters this state.

All the devices that support P2P migration are first transitioned to the P2P
quiescent state and only then are they stopped or started. This makes migration
safe P2P-wise, since starting and stopping the devices is not done atomically
for all the devices together.

Thus, multiple VFIO devices migration is allowed only if all the devices
support P2P migration. Single VFIO device migration is allowed regardless of
P2P migration support.

A detailed description of the UAPI for VFIO device migration can be found in
the comment for the ``vfio_device_mig_state`` structure in the header file
Expand Down Expand Up @@ -132,54 +144,63 @@ will be blocked.
Flow of state changes during Live migration
===========================================

Below is the flow of state change during live migration.
Below is the state change flow during live migration for a VFIO device that
supports both precopy and P2P migration. The flow for devices that don't
support it is similar, except that the relevant states for precopy and P2P are
skipped.
The values in the parentheses represent the VM state, the migration state, and
the VFIO device state, respectively.
The text in the square brackets represents the flow if the VFIO device supports
pre-copy.

Live migration save path
------------------------

::

QEMU normal running state
(RUNNING, _NONE, _RUNNING)
|
QEMU normal running state
(RUNNING, _NONE, _RUNNING)
|
migrate_init spawns migration_thread
Migration thread then calls each device's .save_setup()
(RUNNING, _SETUP, _RUNNING [_PRE_COPY])
|
(RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
If device is active, get pending_bytes by .state_pending_{estimate,exact}()
If total pending_bytes >= threshold_size, call .save_live_iterate()
[Data of VFIO device for pre-copy phase is copied]
Iterate till total pending bytes converge and are less than threshold
|
On migration completion, vCPU stops and calls .save_live_complete_precopy for
each active device. The VFIO device is then transitioned into _STOP_COPY state
(FINISH_MIGRATE, _DEVICE, _STOP_COPY)
|
For the VFIO device, iterate in .save_live_complete_precopy until
pending data is 0
(FINISH_MIGRATE, _DEVICE, _STOP)
|
(FINISH_MIGRATE, _COMPLETED, _STOP)
Migraton thread schedules cleanup bottom half and exits
Migration thread then calls each device's .save_setup()
(RUNNING, _SETUP, _PRE_COPY)
|
(RUNNING, _ACTIVE, _PRE_COPY)
If device is active, get pending_bytes by .state_pending_{estimate,exact}()
If total pending_bytes >= threshold_size, call .save_live_iterate()
Data of VFIO device for pre-copy phase is copied
Iterate till total pending bytes converge and are less than threshold
|
On migration completion, the vCPUs and the VFIO device are stopped
The VFIO device is first put in P2P quiescent state
(FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
|
Then the VFIO device is put in _STOP_COPY state
(FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
.save_live_complete_precopy() is called for each active device
For the VFIO device, iterate in .save_live_complete_precopy() until
pending data is 0
|
(POSTMIGRATE, _COMPLETED, _STOP_COPY)
Migraton thread schedules cleanup bottom half and exits
|
.save_cleanup() is called
(POSTMIGRATE, _COMPLETED, _STOP)

Live migration resume path
--------------------------

::

Incoming migration calls .load_setup for each device
(RESTORE_VM, _ACTIVE, _STOP)
|
For each device, .load_state is called for that device section data
(RESTORE_VM, _ACTIVE, _RESUMING)
|
At the end, .load_cleanup is called for each device and vCPUs are started
(RUNNING, _NONE, _RUNNING)
Incoming migration calls .load_setup() for each device
(RESTORE_VM, _ACTIVE, _STOP)
|
For each device, .load_state() is called for that device section data
(RESTORE_VM, _ACTIVE, _RESUMING)
|
At the end, .load_cleanup() is called for each device and vCPUs are started
The VFIO device is first put in P2P quiescent state
(RUNNING, _ACTIVE, _RUNNING_P2P)
|
(RUNNING, _NONE, _RUNNING)

Postcopy
========
Expand Down
14 changes: 13 additions & 1 deletion hw/core/vm-change-state-handler.c
Expand Up @@ -55,8 +55,20 @@ static int qdev_get_dev_tree_depth(DeviceState *dev)
VMChangeStateEntry *qdev_add_vm_change_state_handler(DeviceState *dev,
VMChangeStateHandler *cb,
void *opaque)
{
return qdev_add_vm_change_state_handler_full(dev, cb, NULL, opaque);
}

/*
* Exactly like qdev_add_vm_change_state_handler() but passes a prepare_cb
* argument too.
*/
VMChangeStateEntry *qdev_add_vm_change_state_handler_full(
DeviceState *dev, VMChangeStateHandler *cb,
VMChangeStateHandler *prepare_cb, void *opaque)
{
int depth = qdev_get_dev_tree_depth(dev);

return qemu_add_vm_change_state_handler_prio(cb, opaque, depth);
return qemu_add_vm_change_state_handler_prio_full(cb, prepare_cb, opaque,
depth);
}
126 changes: 102 additions & 24 deletions hw/vfio/common.c
Expand Up @@ -27,6 +27,7 @@

#include "hw/vfio/vfio-common.h"
#include "hw/vfio/vfio.h"
#include "hw/vfio/pci.h"
#include "exec/address-spaces.h"
#include "exec/memory.h"
#include "exec/ram_addr.h"
Expand Down Expand Up @@ -363,41 +364,54 @@ bool vfio_mig_active(void)

static Error *multiple_devices_migration_blocker;

static unsigned int vfio_migratable_device_num(void)
/*
* Multiple devices migration is allowed only if all devices support P2P
* migration. Single device migration is allowed regardless of P2P migration
* support.
*/
static bool vfio_multiple_devices_migration_is_supported(void)
{
VFIOGroup *group;
VFIODevice *vbasedev;
unsigned int device_num = 0;
bool all_support_p2p = true;

QLIST_FOREACH(group, &vfio_group_list, next) {
QLIST_FOREACH(vbasedev, &group->device_list, next) {
if (vbasedev->migration) {
device_num++;

if (!(vbasedev->migration->mig_flags & VFIO_MIGRATION_P2P)) {
all_support_p2p = false;
}
}
}
}

return device_num;
return all_support_p2p || device_num <= 1;
}

int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp)
{
int ret;

if (multiple_devices_migration_blocker ||
vfio_migratable_device_num() <= 1) {
if (vfio_multiple_devices_migration_is_supported()) {
return 0;
}

if (vbasedev->enable_migration == ON_OFF_AUTO_ON) {
error_setg(errp, "Migration is currently not supported with multiple "
"VFIO devices");
error_setg(errp, "Multiple VFIO devices migration is supported only if "
"all of them support P2P migration");
return -EINVAL;
}

if (multiple_devices_migration_blocker) {
return 0;
}

error_setg(&multiple_devices_migration_blocker,
"Migration is currently not supported with multiple "
"VFIO devices");
"Multiple VFIO devices migration is supported only if all of "
"them support P2P migration");
ret = migrate_add_blocker(multiple_devices_migration_blocker, errp);
if (ret < 0) {
error_free(multiple_devices_migration_blocker);
Expand All @@ -410,7 +424,7 @@ int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp)
void vfio_unblock_multiple_devices_migration(void)
{
if (!multiple_devices_migration_blocker ||
vfio_migratable_device_num() > 1) {
!vfio_multiple_devices_migration_is_supported()) {
return;
}

Expand All @@ -437,6 +451,22 @@ static void vfio_set_migration_error(int err)
}
}

bool vfio_device_state_is_running(VFIODevice *vbasedev)
{
VFIOMigration *migration = vbasedev->migration;

return migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P;
}

bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
{
VFIOMigration *migration = vbasedev->migration;

return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
}

static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
{
VFIOGroup *group;
Expand All @@ -457,8 +487,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
}

if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
(vfio_device_state_is_running(vbasedev) ||
vfio_device_state_is_precopy(vbasedev))) {
return false;
}
}
Expand Down Expand Up @@ -503,8 +533,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
return false;
}

if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
if (vfio_device_state_is_running(vbasedev) ||
vfio_device_state_is_precopy(vbasedev)) {
continue;
} else {
return false;
Expand Down Expand Up @@ -1371,6 +1401,8 @@ typedef struct VFIODirtyRanges {
hwaddr max32;
hwaddr min64;
hwaddr max64;
hwaddr minpci64;
hwaddr maxpci64;
} VFIODirtyRanges;

typedef struct VFIODirtyRangesListener {
Expand All @@ -1379,6 +1411,31 @@ typedef struct VFIODirtyRangesListener {
MemoryListener listener;
} VFIODirtyRangesListener;

static bool vfio_section_is_vfio_pci(MemoryRegionSection *section,
VFIOContainer *container)
{
VFIOPCIDevice *pcidev;
VFIODevice *vbasedev;
VFIOGroup *group;
Object *owner;

owner = memory_region_owner(section->mr);

QLIST_FOREACH(group, &container->group_list, container_next) {
QLIST_FOREACH(vbasedev, &group->device_list, next) {
if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
continue;
}
pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
if (OBJECT(pcidev) == owner) {
return true;
}
}
}

return false;
}

static void vfio_dirty_tracking_update(MemoryListener *listener,
MemoryRegionSection *section)
{
Expand All @@ -1395,19 +1452,32 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
}

/*
* The address space passed to the dirty tracker is reduced to two ranges:
* one for 32-bit DMA ranges, and another one for 64-bit DMA ranges.
* The address space passed to the dirty tracker is reduced to three ranges:
* one for 32-bit DMA ranges, one for 64-bit DMA ranges and one for the
* PCI 64-bit hole.
*
* The underlying reports of dirty will query a sub-interval of each of
* these ranges.
*
* The purpose of the dual range handling is to handle known cases of big
* holes in the address space, like the x86 AMD 1T hole. The alternative
* would be an IOVATree but that has a much bigger runtime overhead and
* unnecessary complexity.
* The purpose of the three range handling is to handle known cases of big
* holes in the address space, like the x86 AMD 1T hole, and firmware (like
* OVMF) which may relocate the pci-hole64 to the end of the address space.
* The latter would otherwise generate large ranges for tracking, stressing
* the limits of supported hardware. The pci-hole32 will always be below 4G
* (overlapping or not) so it doesn't need special handling and is part of
* the 32-bit range.
*
* The alternative would be an IOVATree but that has a much bigger runtime
* overhead and unnecessary complexity.
*/
min = (end <= UINT32_MAX) ? &range->min32 : &range->min64;
max = (end <= UINT32_MAX) ? &range->max32 : &range->max64;

if (vfio_section_is_vfio_pci(section, dirty->container) &&
iova >= UINT32_MAX) {
min = &range->minpci64;
max = &range->maxpci64;
} else {
min = (end <= UINT32_MAX) ? &range->min32 : &range->min64;
max = (end <= UINT32_MAX) ? &range->max32 : &range->max64;
}
if (*min > iova) {
*min = iova;
}
Expand All @@ -1432,6 +1502,7 @@ static void vfio_dirty_tracking_init(VFIOContainer *container,
memset(&dirty, 0, sizeof(dirty));
dirty.ranges.min32 = UINT32_MAX;
dirty.ranges.min64 = UINT64_MAX;
dirty.ranges.minpci64 = UINT64_MAX;
dirty.listener = vfio_dirty_tracking_listener;
dirty.container = container;

Expand Down Expand Up @@ -1502,7 +1573,8 @@ vfio_device_feature_dma_logging_start_create(VFIOContainer *container,
* DMA logging uAPI guarantees to support at least a number of ranges that
* fits into a single host kernel base page.
*/
control->num_ranges = !!tracking->max32 + !!tracking->max64;
control->num_ranges = !!tracking->max32 + !!tracking->max64 +
!!tracking->maxpci64;
ranges = g_try_new0(struct vfio_device_feature_dma_logging_range,
control->num_ranges);
if (!ranges) {
Expand All @@ -1521,11 +1593,17 @@ vfio_device_feature_dma_logging_start_create(VFIOContainer *container,
if (tracking->max64) {
ranges->iova = tracking->min64;
ranges->length = (tracking->max64 - tracking->min64) + 1;
ranges++;
}
if (tracking->maxpci64) {
ranges->iova = tracking->minpci64;
ranges->length = (tracking->maxpci64 - tracking->minpci64) + 1;
}

trace_vfio_device_dirty_tracking_start(control->num_ranges,
tracking->min32, tracking->max32,
tracking->min64, tracking->max64);
tracking->min64, tracking->max64,
tracking->minpci64, tracking->maxpci64);

return feature;
}
Expand Down

0 comments on commit 9ef4977

Please sign in to comment.