Skip to content

Commit

Permalink
migration/doc: We broke backwards compatibility
Browse files Browse the repository at this point in the history
When we detect that we have broken backwards compatibility in a
released version, we can't do anything for that version.  But once we
fix that bug on the next released version, we can "mitigate" that
problem when migrating to new versions to give a way out of that
machine until it does a hard reboot.

Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <20231018112827.1325-5-quintela@redhat.com>
  • Loading branch information
Juan Quintela committed Oct 30, 2023
1 parent 593c28c commit e773261
Showing 1 changed file with 202 additions and 0 deletions.
202 changes: 202 additions & 0 deletions docs/devel/migration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1235,3 +1235,205 @@ In this section we have considered that we are using the same QEMU
binary in both sides of the migration. If we use different QEMU
versions process, then we need to have into account all other
differences and the examples become even more complicated.
How to mitigate when we have a backward compatibility error
-----------------------------------------------------------
We broke migration for old machine types continuously during
development. But as soon as we find that there is a problem, we fix
it. The problem is what happens when we detect after we have done a
release that something has gone wrong.
Let see how it worked with one example.
After the release of qemu-8.0 we found a problem when doing migration
of the machine type pc-7.2.
- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2
This migration works
- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2
This migration works
- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2
This migration fails
- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2
This migration fails
So clearly something fails when migration between qemu-7.2 and
qemu-8.0 with machine type pc-7.2. The error messages, and git bisect
pointed to this commit.
In qemu-8.0 we got this commit::
commit 010746ae1db7f52700cb2e2c46eb94f299cfa0d2
Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: Thu Mar 2 13:37:02 2023 +0000
hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register
The relevant bits of the commit for our example are this ones::
--- a/hw/pci/pcie_aer.c
+++ b/hw/pci/pcie_aer.c
@@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
PCI_ERR_UNC_SUPPORTED);
+ pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
+ PCI_ERR_UNC_MASK_DEFAULT);
+ pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
+ PCI_ERR_UNC_SUPPORTED);
pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
PCI_ERR_UNC_SEVERITY_DEFAULT);
The patch changes how we configure PCI space for AER. But QEMU fails
when the PCI space configuration is different between source and
destination.
The following commit shows how this got fixed::
commit 5ed3dabe57dd9f4c007404345e5f5bf0e347317f
Author: Leonardo Bras <leobras@redhat.com>
Date: Tue May 2 21:27:02 2023 -0300
hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0
[...]
The relevant parts of the fix in QEMU are as follow:
First, we create a new property for the device to be able to configure
the old behaviour or the new behaviour::
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 8a87ccc8b0..5153ad63d6 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -79,6 +79,8 @@ static Property pci_props[] = {
DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
failover_pair_id),
DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0),
+ DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
+ QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
DEFINE_PROP_END_OF_LIST()
};
Notice that we enable the feature for new machine types.
Now we see how the fix is done. This is going to depend on what kind
of breakage happens, but in this case it is quite simple::
diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
index 103667c368..374d593ead 100644
--- a/hw/pci/pcie_aer.c
+++ b/hw/pci/pcie_aer.c
@@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
uint16_t offset,
pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
PCI_ERR_UNC_SUPPORTED);
- pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
- PCI_ERR_UNC_MASK_DEFAULT);
- pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
- PCI_ERR_UNC_SUPPORTED);
+
+ if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
+ pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
+ PCI_ERR_UNC_MASK_DEFAULT);
+ pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
+ PCI_ERR_UNC_SUPPORTED);
+ }
pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
PCI_ERR_UNC_SEVERITY_DEFAULT);
I.e. If the property bit is enabled, we configure it as we did for
qemu-8.0. If the property bit is not set, we configure it as it was in 7.2.
And now, everything that is missing is disabling the feature for old
machine types::
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 47a34841a5..07f763eb2e 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
{ "e1000e", "migrate-timadj", "off" },
{ "virtio-mem", "x-early-migration", "false" },
{ "migration", "x-preempt-pre-7-2", "true" },
+ { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
};
const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
And now, when qemu-8.0.1 is released with this fix, all combinations
are going to work as supposed.
- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)
- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)
So the normality has been restored and everything is ok, no?
Not really, now our matrix is much bigger. We started with the easy
cases, migration from the same version to the same version always
works:
- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2
- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2
- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
Now the interesting ones. When the QEMU processes versions are
different. For the 1st set, their fail and we can do nothing, both
versions are released and we can't change anything.
- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2
- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2
This two are the ones that work. The whole point of making the
change in qemu-8.0.1 release was to fix this issue:
- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2
But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
qemu-8.0.1.
- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0 -M pc-7.2
So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
anything except to qemu-8.0.
Can we do better?
Yeap. If we know that we are going to do this migration:
- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
We can launch the appropriate devices with::
--device...,x-pci-e-err-unc-mask=on
And now we can receive a migration from 8.0. And from now on, we can
do that migration to new machine types if we remember to enable that
property for pc-7.2. Notice that we need to remember, it is not
enough to know that the source of the migration is qemu-8.0. Think of
this example:
$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
In the second migration, the source is not qemu-8.0, but we still have
that "problem" and have that property enabled. Notice that we need to
continue having this mark/property until we have this machine
rebooted. But it is not a normal reboot (that don't reload QEMU) we
need the machine to poweroff/poweron on a fixed QEMU. And from now
on we can use the proper real machine.

0 comments on commit e773261

Please sign in to comment.