stackhpc · oneswig · Nov 30, 2020 · Nov 15, 2020 · Nov 16, 2020 · Nov 16, 2020
@@ -0,0 +1,341 @@
+.. include:: vars.rst
+
+=============================
+Support for GPUs in OpenStack
+=============================
+
+This guide has been developed for Nvidia GPUs and CentOS 8.
+
+See `Kayobe Ops <https://github.com/stackhpc/kayobe-ops>`_ for
+a playbook implementation of host setup for GPU.
+
+BIOS Configuration Requirements
+-------------------------------
+
+On an Intel system:
+
+* Enable `VT-x` in the BIOS for virtualisation support.
+* Enable `VT-d` in the BIOS for IOMMU support.
+
+Hypervisor Configuration Requirements
+-------------------------------------
+
+Find the GPU device IDs
+^^^^^^^^^^^^^^^^^^^^^^^
+
+From the host OS, use ``lspci -nn`` to find the PCI vendor ID and
+device ID for the GPU device and supporting components.  These are
+4-digit hex numbers.
+
+For example:
+
+.. code-block:: text
+
+   01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204M [GeForce GTX 980M] [10de:13d7] (rev a1) (prog-if 00 [VGA controller])
+   01:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)
+
+In this case the vendor ID is ``10de``, display ID is ``13d7`` and audio ID is ``0fbb``.
+
+Alternatively, for an Nvidia Quadro RTX 6000:
+
+.. code-block:: yaml
+
+   # NVIDIA Quadro RTX 6000/8000 PCI device IDs
+   vendor_id: "10de"
+   display_id: "1e30"
+   audio_id: "10f7"
+   usba_id: "1ad6"
+   usba_class: "0c0330"
+   usbc_id: "1ad7"
+   usbc_class: "0c8000"
+
+These parameters will be used for device-specific configuration.
+
+Kernel Ramdisk Reconfiguration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ramdisk loaded during kernel boot can be extended to include the
+vfio PCI drivers and ensure they are loaded early in system boot.
+
+.. code-block:: yaml
+
+   - name: Template dracut config
+     blockinfile:
+       path: /etc/dracut.conf.d/gpu-vfio.conf
+       block: |
+         add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd"
+       owner: root
+       group: root
+       mode: 0660
+       create: true
+     become: true
+     notify:
+       - Regenerate initramfs
+       - reboot
+
+The handler for regenerating the Dracut initramfs is:
+
+.. code-block:: yaml
+
+   - name: Regenerate initramfs
+     shell: |-
+       #!/bin/bash
+       set -eux
+       dracut -v -f /boot/initramfs-$(uname -r).img $(uname -r)
+     become: true
+
+Kernel Boot Parameters
+^^^^^^^^^^^^^^^^^^^^^^
+
+Set the following kernel parameters by adding to
+``GRUB_CMDLINE_LINUX_DEFAULT`` or ``GRUB_CMDLINE_LINUX`` in
+``/etc/default/grub.conf``.  We can use the
+`stackhpc.grubcmdline <https://galaxy.ansible.com/stackhpc/grubcmdline>`_
+role from Ansible Galaxy:
+
+.. code-block:: yaml
+
+   - name: Add vfio-pci.ids kernel args
+     include_role:
+       name: stackhpc.grubcmdline
+     vars:
+       kernel_cmdline:
+         - intel_iommu=on
+         - iommu=pt
+         - "vfio-pci.ids={{ vendor_id }}:{{ display_id }},{{ vendor_id }}:{{ audio_id }}"
+       kernel_cmdline_remove:
+         - iommu
+         - intel_iommu
+         - vfio-pci.ids
+
+Kernel Device Management
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the hypervisor, we must prevent kernel device initialisation of
+the GPU and prevent drivers from loading for binding the GPU in the
+host OS.  We do this using ``udev`` rules:
+
+.. code-block:: yaml
+
+   - name: Template udev rules to blacklist GPU usb controllers
+     blockinfile:
+       # We want this to execute as soon as possible
+       path: /etc/udev/rules.d/99-gpu.rules
+       block: |
+         #Remove NVIDIA USB xHCI Host Controller Devices, if present
+         ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usba_class }}", ATTR{remove}="1"
+         #Remove NVIDIA USB Type-C UCSI devices, if present
+         ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usbc_class }}", ATTR{remove}="1"
+       owner: root
+       group: root
+       mode: 0644
+       create: true
+      become: true
+
+Kernel Drivers
+^^^^^^^^^^^^^^
+
+Prevent the ``nouveau`` kernel driver from loading by
+blacklisting the module:
+
+.. code-block:: yaml
+
+   - name: Blacklist nouveau
+     blockinfile:
+       path: /etc/modprobe.d/blacklist-nouveau.conf
+       block: |
+         blacklist nouveau
+         options nouveau modeset=0
+       mode: 0664
+       owner: root
+       group: root
+       create: true
+     become: true
+     notify:
+       - reboot
+       - Regenerate initramfs
+
+Ensure that the ``vfio`` drivers are loaded into the kernel on boot:
+
+.. code-block:: yaml
+
+   - name: Add vfio to modules-load.d
+     blockinfile:
+       path: /etc/modules-load.d/vfio.conf
+       block: |
+         vfio
+         vfio_iommu_type1
+         vfio_pci
+         vfio_virqfd
+       owner: root
+       group: root
+       mode: 0664
+       create: true
+     become: true
+     notify: reboot
+
+Once this code has taken effect (after a reboot), the VFIO kernel drivers should be loaded on boot:
+
+.. code-block:: text
+
+   # lsmod | grep vfio
+   vfio_pci               49152  0
+   vfio_virqfd            16384  1 vfio_pci
+   vfio_iommu_type1       28672  0
+   vfio                   32768  2 vfio_iommu_type1,vfio_pci
+   irqbypass              16384  5 vfio_pci,kvm
+
+   # lspci -nnk -s 3d:00.0
+   3d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Tesla M10] [10de:13bd] (rev a2)
+	 Subsystem: NVIDIA Corporation Tesla M10 [10de:1160]
+	 Kernel driver in use: vfio-pci
+	 Kernel modules: nouveau
+
+IOMMU should be enabled at kernel level as well - we can verify that on the compute host:
+
+.. code-block:: text
+
+   # docker exec -it nova_libvirt virt-host-validate | grep IOMMU
+   QEMU: Checking for device assignment IOMMU support                         : PASS
+   QEMU: Checking if IOMMU is enabled by kernel                               : PASS
+
+OpenStack Nova configuration
+----------------------------
+
+Configure nova-scheduler
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The nova-scheduler service must be configured to enable the ``PciPassthroughFilter``
+To enable it add it to the list of filters to Kolla-Ansible configuration file:
+``etc/kayobe/kolla/config/nova.conf``, for instance:
+
+.. code-block:: yaml
+
+   [filter_scheduler]
+   available_filters = nova.scheduler.filters.all_filters
+   enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, PciPassthroughFilter
+
+Configure nova-compute
+^^^^^^^^^^^^^^^^^^^^^^
+
+Configuration can be applied in flexible ways using Kolla-Ansible's
+methods for `inventory-driven customisation of configuration
+<https://docs.openstack.org/kayobe/latest/configuration/reference/kolla-ansible.html#service-configuration>`_.
+The following configuration could be added to
+``etc/kayobe/kolla/config/nova/nova-compute.conf`` to enable PCI
+passthrough of GPU devices for hosts in a group named ``compute_gpu``.
+Again, the 4-digit PCI Vendor ID and Device ID extracted from ``lspci
+-nn`` can be used here to specify the GPU device(s).
+
+.. code-block:: jinja
+
+   [pci]
+   {% raw %}
+   {% if inventory_hostname in groups['compute_gpu'] %}
+   # We could support multiple models of GPU.
+   # This can be done more selectively using different inventory groups.
+   # GPU models defined here:
+   # NVidia Tesla V100 16GB
+   # NVidia Tesla V100 32GB
+   # NVidia Tesla P100 16GB
+   passthrough_whitelist = [{ "vendor_id":"10de", "product_id":"1db4" },
+                            { "vendor_id":"10de", "product_id":"1db5" },
+                            { "vendor_id":"10de", "product_id":"15f8" }]
+   alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" }
+   alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" }
+   alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" }
+   {% endif %}
+   {% endraw %}
+
+Configure nova-api
+^^^^^^^^^^^^^^^^^^
+
+pci.alias also needs to be configured on the controller. 
+This configuration should match the configuration found on the compute nodes.
+Add it to Kolla-Ansible configuration file:
+``etc/kayobe/kolla/config/nova/nova-api.conf``, for instance:
+
+.. code-block:: yaml
+
+   [pci]
+   alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" }
+   alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" }
+   alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" }
+
+Reconfigure nova service
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: text
+
+   kayobe overcloud service reconfigure --kolla-tags nova --kolla-skip-tags common --skip-prechecks
+
+Configure a flavor
+^^^^^^^^^^^^^^^^^^
+
+For example, to request two of the GPUs with alias gpu-p100
+
+.. code-block:: text
+
+   openstack flavor set m1.medium --property "pci_passthrough:alias"="gpu-p100:2"
+
+
+This can be also defined in the |project_config| repository:
+|project_config_source_url|
+
+add extra_specs to flavor in etc/|project_config|/|project_config|.yml:
+
+.. code-block:: console
+   :substitutions:
+
+   admin# cd |base_path|/src/|project_config|
+   admin# vim etc/|project_config|/|project_config|.yml
+
+    name: "m1.medium"
+    ram: 4096
+    disk: 40
+    vcpus: 2
+    extra_specs:
+      "pci_passthrough:alias": "gpu-p100:2"
+
+Invoke configuration playbooks afterwards:
+
+.. code-block:: console
+   :substitutions:
+
+   admin# source |base_path|/src/|kayobe_config|/etc/kolla/public-openrc.sh
+   admin# source |base_path|/venvs/|project_config|/bin/activate
+   admin# tools/|project_config| --vault-password-file |vault_password_file_path|
+
+Create instance with GPU passthrough
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: text
+
+   openstack server create --flavor m1.medium --image ubuntu2004 --wait test-pci
+
+Testing GPU in a Guest VM
+-------------------------
+
+The Nvidia drivers must be installed first.  For example, on an Ubuntu guest:
+
+.. code-block:: text
+
+   sudo apt install nvidia-headless-440 nvidia-utils-440 nvidia-compute-utils-440
+
+The ``nvidia-smi`` command will generate detailed output if the driver has loaded
+successfully.
+
+Further Reference
+-----------------
+
+For PCI Passthrough and GPUs in OpenStack:
+
+* Consumer-grade GPUs: https://gist.github.com/claudiok/890ab6dfe76fa45b30081e58038a9215
+* https://www.jimmdenton.com/gpu-offloading-openstack/
+* https://docs.openstack.org/nova/latest/admin/pci-passthrough.html
+* https://docs.openstack.org/nova/latest/admin/virtual-gpu.html (vGPU only)
+* Tesla models in OpenStack: https://egallen.com/openstack-nvidia-tesla-gpu-passthrough/
+* https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF
+* https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
+* https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough
+* https://www.gresearch.co.uk/article/utilising-the-openstack-placement-service-to-schedule-gpu-and-nvme-workloads-alongside-general-purpose-instances/
@@ -25,6 +25,7 @@ Contents
    managing_users_and_projects
    operations_and_monitoring
    customising_deployment
+   gpus_in_openstack
 
 Indices and search
 ==================