Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM hangs at boot #811

Closed
pykello opened this issue Oct 31, 2023 · 4 comments
Closed

VM hangs at boot #811

pykello opened this issue Oct 31, 2023 · 4 comments

Comments

@pykello
Copy link
Contributor

pykello commented Oct 31, 2023

Serial log

VM is stuck, and last few lines are the following:

[    0.129822] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.129822] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[    0.129822] Freeing SMP alternatives memory: 44K
[    0.129822] pid_max: default: 32768 minimum: 301
[    0.129822] LSM: initializing lsm=lockdown,capability,landlock,yama,integrity,apparmor
[    0.129822] landlock: Up and running.
[    0.129822] Yama: becoming mindful.
[    0.129822] AppArmor: AppArmor initialized
[    0.129822] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, linear)
[    0.129822] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, linear)

In a successful boot, next line should be:

[    0.173927] smpboot: CPU0: AMD EPYC 7502P 32-Core Processor (family: 0x17, model: 0x31, stepping: 0x0)

gdb

Also gdb backtrace doesn't show anything useful. Maybe we should install cloud-hypervisor debug symbols:

Reading symbols from /opt/cloud-hypervisor/v31.0/cloud-hypervisor...
(No debugging symbols found in /opt/cloud-hypervisor/v31.0/cloud-hypervisor)

strace output

strace -p [vcpu thread id] outputs following and is stuck

ioctl(27, KVM_RUN

and strace -p [cloud hypervisor pid] outputs following and is stuck

futex(0x7fc5e006e910, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 3008632, NULL, FUTEX_BITSET_MATCH_ANY

SPDK

Last lines of SPDK log:

Oct 25 23:44:41 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) read message VHOST_USER_SET_VRING_BASE
Oct 25 23:44:41 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) vring base idx:0 last_used_idx:0 last_avail_idx:0.
Oct 25 23:44:41 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) read message VHOST_USER_SET_VRING_KICK
Oct 25 23:44:41 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) vring kick idx:0 file:105
Oct 25 23:44:41 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) read message VHOST_USER_SET_VRING_ENABLE
Oct 25 23:44:41 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) set queue enable: 1 to qp idx: 0
Oct 25 23:44:42 Ubuntu-2204-jammy-amd64-base vhost[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) read message VHOST_USER_SET_VRING_ENABLE
Oct 25 23:44:42 Ubuntu-2204-jammy-amd64-base vhost[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) set queue enable: 0 to qp idx: 0
Oct 25 23:44:42 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) read message VHOST_USER_SET_VRING_ENABLE
Oct 25 23:44:42 Ubuntu-2204-jammy-amd64-base vhost[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) read message VHOST_USER_GET_VRING_BASE
Oct 25 23:44:42 Ubuntu-2204-jammy-amd64-base vhost[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) vring base idx:0 file:2021
Oct 25 23:44:42 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) set queue enable: 0 to qp idx: 0
Oct 25 23:44:42 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) read message VHOST_USER_GET_VRING_BASE
Oct 25 23:44:42 Ubuntu-2204-jammy-amd64-base spdk[1494]: VHOST_CONFIG: (/var/storage/vhost/vm6zye9j_0) vring base idx:0 file:2021

VM's cpu, memory, and disk config

{
  "cpus": {
    "boot_vcpus": 4,
    "max_vcpus": 4,
    "topology": {
      "threads_per_core": 2,
      "cores_per_die": 2,
      "dies_per_package": 1,
      "packages": 1
    },
    "kvm_hyperv": false,
    "max_phys_bits": 46,
    "affinity": null,
    "features": {
      "amx": false
    }
  },
  "memory": {
    "size": 17179869184,
    "mergeable": false,
    "hotplug_method": "Acpi",
    "hotplug_size": null,
    "hotplugged_size": null,
    "shared": false,
    "hugepages": true,
    "hugepage_size": 1073741824,
    "prefault": false,
    "zones": null,
    "thp": true
  },
  "payload": {
    "firmware": null,
    "kernel": "/opt/fw/edk2-stable202302/x64/CLOUDHV.fd",
    "cmdline": null,
    "initramfs": null
  },
  "disks": [
    {
      "path": null,
      "readonly": false,
      "direct": false,
      "iommu": false,
      "num_queues": 1,
      "queue_size": 256,
      "vhost_user": true,
      "vhost_socket": "/var/storage/vmwanwm8/0/vhost.sock",
      "rate_limiter_config": null,
      "id": "_disk0",
      "disable_io_uring": false,
      "pci_segment": 0
    },
    {
      "path": "/vm/vmwanwm8/cloudinit.img",
      "readonly": false,
      "direct": false,
      "iommu": false,
      "num_queues": 1,
      "queue_size": 128,
      "vhost_user": false,
      "vhost_socket": null,
      "rate_limiter_config": null,
      "id": "_disk1",
      "disable_io_uring": false,
      "pci_segment": 0
    }
  ],
...
}
@pykello
Copy link
Contributor Author

pykello commented Oct 31, 2023

Code references

Building with guest debug

General build instructions: building.md

$ cargo build --release --target=x86_64-unknown-linux-musl --all --features guest_debug
$ cargo build --target=x86_64-unknown-linux-musl --all --features guest_debug

Core dump

sudo ~/projects/cloud-hypervisor/target/x86_64-unknown-linux-musl/debug/ch-remote \
   --api-socket /tmp/ch.sock coredump file:///tmp/coredump

/opt/cloud-hypervisor/v31.0/ch-remote --api-socket /vm/vmvmhxf8/ch-api.sock pause

/opt/cloud-hypervisor/v31.0/ch-remote --api-socket /vm/vmvmhxf8/ch-api.sock coredump file:///tmp/vmvmhxf8.coredump

Crash Utility

mnt/usr/src/linux-headers-5.15.0-86/scripts/extract-vmlinux mnt/boot/vmlinuz > /tmp/vmlinux
crash /tmp/vmlinux /tmp/vmvmhxf8.coredump

...
crash: /tmp/vmlinux: no .gnu_debuglink section

Debug symbols

Installing: https://hadibrais.wordpress.com/2017/03/13/installing-ubuntu-kernel-debugging-symbols/

apt install linux-image-5.15.0-86-generic-dbgsym
crash /usr/lib/debug/boot/vmlinux-5.15.0-86-generic /tmp/vmvmhxf8.coredump

@pykello
Copy link
Contributor Author

pykello commented Nov 6, 2023

Linux kernel code that prints the expected message:

https://github.com/torvalds/linux/blob/d2f51b3516dade79269ff45eae2a7668ae711b25/arch/x86/kernel/smpboot.c#L1231-L1232

        pr_info("CPU0: ");
        print_cpu_info(&cpu_data(0));

https://stackoverflow.com/questions/60803129/kernel-debugging-vmlinux-gdb-py-fails-to-run-on-gdb

git clone --branch v5.15 --depth 1 https://github.com/torvalds/linux
cd linux
make oldconfig
make scripts_gdb
sudo apt install linux-image-5.15.0-83-generic-dbgsym

sudo gdb -q

(gdb) add-auto-load-safe-path ~/projects/linux/
(gdb) souce vmlinux-gdb.py
(gdb) target remote /tmp/gdb.sock
(gdb) c

https://www.kernel.org/doc/html/latest/dev-tools/gdb-kernel-debugging.html
https://wiki.ubuntu.com/DebuggingKernelWithQEMU
cloud-hypervisor/cloud-hypervisor#3213
https://mhcerri.github.io/posts/debugging-the-ubuntu-kernel-with-gdb-and-qemu/

@pykello
Copy link
Contributor Author

pykello commented Nov 9, 2023

sudo apt install linux-image-6.2.0-1014-azure-dbgsym

@fdr
Copy link
Collaborator

fdr commented Dec 21, 2023

Ah, we forgot to close this, but it was indeed https://gitlab.com/qemu-project/qemu/-/issues/1696, and identifying that the default kernel shipped on the GitHub Action image does not have that kernel patch applied, I inspected the Azure 6.5 variant package of the kernel, found that the tick patch was applied, we tried it, and the problem went away.

@fdr fdr closed this as completed Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants