Skip to content

Add support for virtio-pci transport #5240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: feature/pcie
Choose a base branch
from

Conversation

bchalios
Copy link
Contributor

@bchalios bchalios commented Jun 2, 2025

Changes

This PR adds support for using a PCI transport for VirtIO devices. The PR adds PCI support for x86 and Aarch64 systems both for booted and snapshot-restored microVMs. The VirtIO-PCI transport is used whenever Firecracker is launched with PCI enabled. Otherwise, we fall back to the MMIO transport.

This PR also adds support for MSI-X interrupts for VirtIO devices. When using the PCI transport VirtIO devices always use MSI-X type of interrupts. MMIO devices still use the IRQ based ones.

Since we needed to change various things in order to support PCI along with MMIO transport the PR brings a few quality of life changes in the code base:

  • PCI devices allow that the guest moves the MMIO ranges allocated to a PCI device. This operation requires access to the ResourceAllocator, so that we free old memory and allocate new one, the MMIO and Port IO buses, so that we register the device to its new allocated address range and the KVM Vm file descriptor so that we correctly set the new IO event fds. As a result, we move buses, and resource allocator inside the Vm object and make the latter responsible for the relocation when needed. This moves simplify the signatures of a lot of functions that were using a combination of ResourceAllocator, Vm and Bus objects. In the future, we might want to even move DeviceManager inside the Vm object. This should simplify things even further.
  • Clean-up the MMIO restore logic. There was a lot of repetition for every device type we support. We now keep a generic type for (re)storing the state of VirtIO devices. In the future, we might even be able to further generalize, so that the restoration logic is common across MMIO and PCI devices if needed.
  • We make use of the serde support in the new version of vm-allocator crate, so that we don't have to perform allocations with fixed addresses upon snapshot restore.

Reason

Support VirtIO devices with PCI transport

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Jun 2, 2025

Codecov Report

Attention: Patch coverage is 72.63830% with 643 lines in your changes missing coverage. Please review.

Project coverage is 80.26%. Comparing base (ca4f1e7) to head (8fefa5b).

Files with missing lines Patch % Lines
src/vmm/src/devices/virtio/transport/pci/device.rs 55.83% 318 Missing ⚠️
.../src/devices/virtio/transport/pci/common_config.rs 52.15% 89 Missing ⚠️
src/vmm/src/device_manager/mod.rs 77.41% 42 Missing ⚠️
src/vmm/src/builder.rs 75.00% 33 Missing ⚠️
src/vmm/src/device_manager/pci_mngr.rs 91.66% 33 Missing ⚠️
src/vmm/src/devices/virtio/net/device.rs 25.00% 21 Missing ⚠️
src/vmm/src/vstate/resources.rs 52.77% 17 Missing ⚠️
src/vmm/src/devices/virtio/vsock/device.rs 0.00% 12 Missing ⚠️
src/vmm/src/devices/virtio/balloon/persist.rs 8.33% 11 Missing ⚠️
src/vmm/src/devices/virtio/block/device.rs 0.00% 11 Missing ⚠️
... and 15 more
Additional details and impacted files
@@               Coverage Diff                @@
##           feature/pcie    #5240      +/-   ##
================================================
+ Coverage         79.23%   80.26%   +1.02%     
================================================
  Files               262      265       +3     
  Lines             29347    30967    +1620     
================================================
+ Hits              23254    24856    +1602     
- Misses             6093     6111      +18     
Flag Coverage Δ
5.10-c5n.metal 80.02% <68.93%> (+0.97%) ⬆️
5.10-m5n.metal 80.01% <68.93%> (+0.96%) ⬆️
5.10-m6a.metal 79.21% <68.93%> (+1.03%) ⬆️
5.10-m6g.metal 76.62% <71.92%> (+1.43%) ⬆️
5.10-m6i.metal 80.01% <68.93%> (+0.97%) ⬆️
5.10-m7a.metal-48xl 79.21% <68.93%> (+1.05%) ⬆️
5.10-m7g.metal 76.62% <71.92%> (+1.43%) ⬆️
5.10-m7i.metal-24xl 79.98% <68.93%> (+0.96%) ⬆️
5.10-m7i.metal-48xl 79.98% <68.93%> (+0.96%) ⬆️
5.10-m8g.metal-24xl 76.62% <71.92%> (+1.43%) ⬆️
5.10-m8g.metal-48xl 76.62% <71.92%> (+1.43%) ⬆️
6.1-c5n.metal 80.05% <68.93%> (+0.96%) ⬆️
6.1-m5n.metal 80.05% <68.93%> (+0.96%) ⬆️
6.1-m6a.metal 79.25% <68.93%> (+1.04%) ⬆️
6.1-m6g.metal 76.62% <71.92%> (+1.43%) ⬆️
6.1-m6i.metal 80.05% <68.93%> (+0.96%) ⬆️
6.1-m7a.metal-48xl 79.25% <68.93%> (+1.05%) ⬆️
6.1-m7g.metal 76.62% <71.92%> (+1.43%) ⬆️
6.1-m7i.metal-24xl 80.06% <68.93%> (+0.96%) ⬆️
6.1-m7i.metal-48xl 80.07% <68.93%> (+0.96%) ⬆️
6.1-m8g.metal-24xl 76.62% <71.92%> (+1.43%) ⬆️
6.1-m8g.metal-48xl 76.62% <71.92%> (+1.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bchalios bchalios force-pushed the virtio-pci branch 7 times, most recently from a80faa9 to 15ed3eb Compare June 4, 2025 13:49
@bchalios bchalios force-pushed the virtio-pci branch 6 times, most recently from 61c6be1 to 9d18f11 Compare June 11, 2025 11:25
@bchalios bchalios force-pushed the virtio-pci branch 15 times, most recently from c0ada38 to 507cbce Compare June 19, 2025 09:30
@bchalios bchalios force-pushed the virtio-pci branch 2 times, most recently from 8436358 to a82f14c Compare June 30, 2025 10:06
bchalios added 25 commits June 30, 2025 12:07
We need the new KvmIrqRouting FamStruct wrapper from kvm-bindings, which
though forces us to update vmm-sys-util to 0.14.0 and also bump all
downstream dependencies of vmm-sys-util to use that version.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Define thiserror::Error and displaydoc::Display for various error types
in the vended PCI crate. This way we can embed them in our error types
downstream. Also export a few types and struct fields that were private
and we will be needing them.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Instead of returning an `EventFd` type, which will actually force us to
clone the file descriptor in the Firecracker side.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
This is code we are not going to use in Firecracker. Remove it, so we
can keep the crates we vend as minimal as possible, including only
things we are actually using.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
We'd like to be able to store Vm within an atomic reference so we can
pass it around and share it with other components. The main issue with
doing this change is that we need Vm to be `mut` during initialization
and the builder.rs code was creating Vmm with Vm embedded in it.

To solve this, we break down the initialization of the Vmm object. We
first create its individual parts (Vm, Kvm and DeviceManager), perform
any necessary initialization logic on Vm and once this done add it
within an Arc.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Add logic to track the device interrupts used by the microVM. This is
not strictly needed right now, but we will need it when adding support
for MSI-X interrupts. MSI-X interrupts are configured at runtime and we
need to interact with KVM to set the interruput routes. To do it, we
need to keep track all of the interrupts the VM is using.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Enable Vm to vend and manage MSI/MSI-X interrupts. This adds the logic
to create a set of MSI vectors and then handle their lifetime.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Vm object is now maintaining information about the interrupts (both
traditional IRQs and MSI-X vectors) that are being used by microVM
devices. Derive Serialize/Deserialize add logic for recreating objects
for relevant types.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Commit d8c2714 (refactor: use VirtioInterrupt in VirtIO devices) which
refactored devices to use new VirtioInterrupt type introduced a bug
with the index used to trigger a queue interrupt. Instead of using the
actual queue index, we were using the index of the used descriptor.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Apparently, PCI needs Queue::size to be initialized to the maximum
possible size supported by the device, otherwise initialization fails.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Remove the flags in FADT that were declaring we do not support MSI and
PCI ASPM.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Merge the device-related errors that DeviceManager might return. This
way, we can avoid adding yet another error type for PCI devices and
reduce some the variants of StartMicrovmError.

Suggested-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Signed-off-by: Babis Chalios <bchalios@amazon.es>
Add a VirtIO PCI transport implementation. When a Firecracker microVM is
launched with --enable-pci, we will create all VirtIO devices using the
PCI transport layer. Snapshotting of VirtIO PCI devices is not supported
and we will add this functionality in later commit.

Add a couple of tests that ensure that PCI configuration space is what
expected. We read common fields and make sure the BAR we allocate for
the VirtIO device is what expected.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
We are now calling KVM_CHECK_EXTENSION for checking the
KVM_CAP_MSI_DEVID capability. We are also calling KVM_SET_GSI_ROUTING to
set the interrupts routes and KVM_IRQFD to set/unset interrupt lines.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Add some unit tests to PciSegment. We now test that the
next_device_bdf() method and the initialization logic work as expected.
We also check that the configuration space of the PCI segment is
correctly registered with the MMIO and, on x86, PIO bus.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
vm-allocator now allows us to (De)serialize IdAllocator and
AddressAllocator types. Add ResourceAllocator in DeviceManager snapshot
state and restore it when loading a snapshot. Like this we can avoid
doing the ExactMatch allocations during snapshot resumes for reserving
the exact same MMIO ranges.

Moreover, change DeviceManager and PciDevices to provide save/restore
functionality via the Persist trait. Like that we can avoid first
creating the objects and then restoring their state, overwriting their
fields.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
VirtIO MMIO restore logic activates the device the moment we restore the
device state, if the device was activated when snapshotted. Move the
activation responsibility to the logic the restores the MMIO transport.
The reason for this change is that that's how it will be done for the
PCI transport. Unifying this will allow us reusing the same types for
restoring the non-transport state of devices.

Note that we needed to change the way Net devices are saved/restored.
RxBuffer type of Net devices holds RX descriptors that we have parsed
from the Queue ahead of time. The way we restored this info was
manipulating the queue to re-parse the RX descriptors during the restore
phase. However, we need the device to be activated to do so, which now
isn't. So, instead of storing this info inside the snapshot make sure we
have flushed everything before taking the snapshot.

Also, simplify a bit the types that we use for serializing/deserializing
the state of a device.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Support serializing the device-specific and transport state of a VirtIO
device that uses the PCI transport.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
ResourceAllocator object was part of DeviceManager since it is (mainly)
devices that use it. ResourceAllocator is as well the object that
implements (in a dummy way, for the moment) the DeviceRelocation trait
which PciDevices use to move the address space of a PciDevice when
triggered from the guest.

Problem with DeviceRelocation is that it also needs the Vm file
descriptor to perform the relocation, because we need to move register
the new IO event fd for VirtIO devices.

To make things simpler, move ResourceAllocator inside the Vm object. In
subsequent commit we will remove the DeviceRelocation from
ResourceAllocator and move it to Vm instead.

This has the nice secondary effect that we were able to simplify the
signature of many device-related methods that received Vm and
ResourceAllocator arguments.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
We had previously added MMIO and Port IO buses inside ResourceAllocator
so that we could implement DeviceRelocation for the type. Now, we will
delegate device relocation responsibilities to ArchVm instead. That is
because device relocation requires access to the Vm file descriptor as
well.

As a result, we can move buses to the Vm object itself. Add MMIO bus to
VmCommon as both architectures use it. Add PortIO bus for x86
architecture only.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Implement PCI device relocation logic for Vm type. Up until now, we were
only implementing it in a dummy way. Make sure we correctly deallocate
previously allocate ranges and unregister them from the corresponding
bus. Try to allocate new addresses and register them with the bus.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Add support for ITS device which provides support for MSI interrupts on
ARM architecture. This is currently supported only on systems with GICv3
interrupt controller.

In order to make saving/restore of ITS state work properly, we need to
change the order in which we restore redistributor register GICR_CTLR.
We need to make sure that this register is restored last. Otherwise,
restoring GICR_PROPBASER doesn't have any effect and ITS depends on it
in order to save/restore ITS tables to/from guest memory.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Refactor the test code that inserts VirtIO devices in a Vmm object and
then add a test which creates a Vmm with PCI devices and then serializes
and deserializes the device manager and ensures that everything is as
restored as expected.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Use pci_enabled fixture for boot time, block, and network tests to
create PCI microVM variants as well.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
We only pass pci=off if PCI is disabled in Firecracker. Adapt tests and
comments to reflect that.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
So that we don't have to downcast VirtioDevice trait objects to the
actual device type before calling the logic to process events for each
device.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
impl MsiVector {
/// Enable vector
fn enable(&self, vmfd: &VmFd) -> Result<(), errno::Error> {
if !self.enabled.load(Ordering::Acquire) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be an compare_exchange (with a rollback if register_irqfd fails, although I guess we dont really expect it to fail) instead to prevent the function from racing with itself? Can it even race with itself?


impl InterruptSourceGroup for MsiVectorGroup {
fn enable(&self) -> vm_device::interrupt::Result<()> {
for (_, route) in self.irq_routes.iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
for (_, route) in self.irq_routes.iter() {
for route in self.irq_routes.values() {

(there's a few more cases of this)

&self,
index: vm_device::interrupt::InterruptIndex,
) -> vm_device::interrupt::Result<()> {
if let Some(route) = self.irq_routes.get(&index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reuse the notifier function from below here maybe?


if self.common.fd.check_extension(kvm_ioctls::Cap::MsiDevid) {
// On AArch64, there is limitation on the range of the 'devid',
// it cannot be greater than 65536 (the max of u16).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

65535?

for (i, gsi) in resource_allocator
.allocate_gsi(count as u32)?
.iter()
.enumerate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.zip(0u32..) here instead of enumerate() lets you avoid the cast below

}

Err(std::io::Error::other(format!(
"MSI-X update: invalid vector index {index}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an aside (irrelevant to this PR because its vendored code): I really dislike this error handling using std::io::Error::other

Comment on lines 433 to 434
let (device_activated, interrupt_status, cap_pci_cfg_info) =
(false, 0, VirtioPciCfgCapInfo::default());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I dont quite get why we keep these 3 in local variables instead of just inlining them into the struct initializer below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants