Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QEMU fails to open VFIO devices when run within nspawn #6648

Closed
crawford opened this issue Aug 19, 2017 · 3 comments
Closed

QEMU fails to open VFIO devices when run within nspawn #6648

crawford opened this issue Aug 19, 2017 · 3 comments

Comments

@crawford
Copy link
Contributor

Submission type

Bug report

systemd version the issue has been seen with

234

Used distribution

Container Linux (CoreOS)

Expected behaviour you didn't see

qemu-system-x86_64 is able to start successfully (as it does with systemd-nspawn 233)

Unexpected behaviour you saw

qemu-system-x86_64: -device vfio-pci,host=01:00.0: vfio: error opening /dev/vfio/1: Operation not permitted
qemu-system-x86_64: -device vfio-pci,host=01:00.0: vfio: failed to get group 1
qemu-system-x86_64: -device vfio-pci,host=01:00.0: Device initialization failed

Steps to reproduce the problem

Here is the service unit I am using to start my QEMU instance:

[Service]
ExecStartPre=/usr/sbin/modprobe vfio-pci

ExecStartPre=/bin/sh -c "echo '10de 0640' > /sys/bus/pci/drivers/vfio-pci/new_id"

ExecStartPre=/bin/sh -c "echo '8086 a121' > /sys/bus/pci/drivers/vfio-pci/new_id"
ExecStartPre=/bin/sh -c "echo '8086 a123' > /sys/bus/pci/drivers/vfio-pci/new_id"
ExecStartPre=/bin/sh -c "echo '8086 a143' > /sys/bus/pci/drivers/vfio-pci/new_id"
ExecStartPre=/bin/sh -c "echo '8086 a170' > /sys/bus/pci/drivers/vfio-pci/new_id"

ExecStartPre=/bin/sh -c "echo '8086 a12f' > /sys/bus/pci/drivers/vfio-pci/new_id"
ExecStartPre=/bin/sh -c "echo '8086 a131' > /sys/bus/pci/drivers/vfio-pci/new_id"

ExecStart=/usr/bin/systemd-inhibit \
        --what=shutdown \
        --who="Windows Virtual Machine" \
        /usr/bin/systemd-nspawn \
                --directory=/srv/windows \
                --capability=all \
                --share-system \
                --bind=/dev/kvm:/dev/kvm \
                --bind=/dev/vfio:/dev/vfio \
                --bind=/sys:/sys \
                /usr/bin/qemu-kvm \
                        -m 2G \
                        -cpu host,kvm=off \
                        -smp cores=1,threads=1,sockets=1 \
                        -netdev tap,id=t0,ifname=tap-windows,script=no,downscript=no \
                        -device virtio-net,netdev=t0 \
                        -drive if=virtio,file=/srv/windows.bin,format=raw \
                        -device vfio-pci,host=01:00.0 \
                        -device vfio-pci,host=00:1f.3 \
                        -device vfio-pci,host=00:14.0 \
                        -device vfio-pci,host=00:14.2 \
                        -nographic

This has worked fine for quite a while, but after updating to a release of Container Linux with systemd 234, I've started seeing the failures listed above. If I use the systemd-nspawn binary and libsystemd-shared.so library from system 233, it begins working again. This makes it seem like there is something going on with nspawn itself.

I glanced through the git history for src/nspawn but nothing jumped out at me. I also wouldn't be surprised if this was related to my use of the deprecated --share-system flag.

@euank
Copy link

euank commented Aug 20, 2017

I was curious so I dug into this a little bit.

I believe this is being denied by the device cgroup.

Before you were avoiding DevicePolicy=closed. That was because --share-system implies --register=no, and before DevicePolicy=closed was only set with --register=yes.

Starting with #6166, however, DevicePolicy=closed is set in both register and in the new allocate_scope code.

I think you can solve your problem in two ways:

  1. Adding --keep-unit will avoid calling allocate_scope, and so you'll be back to the previous state.
  2. Switching to bindmounting specifically /dev/vfio/vfio, not the directory, will allow nspawn's logic to generate DeviceAllows for each bindmounted device to actually do the right thing.

Mind checking if either / both of those work? I specifically think option 2 is the correct thing since it properly expresses your intent.

@poettering
Copy link
Member

Uh, most of /dev is not properly virtualized for containers I fear, and that needs fixing on the kernel side. There are hacky ways how you can expose device nodes of the host inside of a container, but YMMV because /sys and /dev might deviate from each other, no dynamic device node propagation is done, no concept of device ownership related to conatiners exists or anything else like this. If you do decide to expose host device nodes in a container, you void your warranty. (and that's the same on any container manager, regarldess what the other ones claim, we just don't lie about it) If you want to do that anyway with nspawn, make sure to have the same --property=DeviceAllow= parameters as well as --bind=/dev/xyz paramters on the nspawn cmdline, so that accessto the device nodes is permitted and the device nodes are made available.

Sorry, but that's all I can suggest. Closing this here now. I don't think there's anything we can fix here in nspawn, this needs some kernel work first. As soon as the kernel properly supports device virtualzation/namespacing we can support it in npsawn, but without it it's always going to be a mess...

Hope that makes sense... Closing.

@crawford
Copy link
Contributor Author

Huh, I don't know how I missed the notifications on this one. I ended up going with @euank's suggestion to explicitly mount all of the VFIO nodes. I'm not too worried about voiding my warranty since I just use this machine to watch Netflix. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants