Describe the bug
We currently have issues executing CAPO tests on the OpenStack from CityNetwork. Depending on which VMs are used fro the test it's either always succeeds or fails. My guess is that it depends on the CPUs used.
Related Job information
- job name: cluster-api-provider-openstack-current-acceptance-test-v1.18
- failed jobs:
- successful jobs:
Steps to reproduce the issue
Describe how to reproduce the issue:
Additional context
The CAPO tests installs a devstack and then spins up a Kubernetes cluster via CAPO. In failed runs the Kubernetes master node doesn't come up (server boots but Kubernetes master components are in CrashLoop because of CPU starvation). I guess that's either caused by over-provisioning or some problem with nested virtualization. Load average on the master node is either 20-40 (failed runs) or 2-3 (successful runs). I don't really think it's over-provisioning because the overall performance on the host (not the master node) looks very similar (14min vs 16 min to install devstack)
I compared the runs. That's what I found:
Failed runs always use: (https://logs.openlabtesting.org/logs/33/533/cd08a0a397ca17bb36aefab2902a31d3fe1eb206/check/cluster-api-provider-openstack-current-acceptance-test-v1.18/a113326/zuul-info/host-info.ubuntu-bionic-large.yaml)
ansible_processor:
- '0'
- GenuineIntel
- Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)
ansible_product_name: OpenStack Compute
Successful runs always use: (https://logs.openlabtesting.org/logs/33/533/cd08a0a397ca17bb36aefab2902a31d3fe1eb206/check/cluster-api-provider-openstack-current-acceptance-test-v1.18-1/202b9bc/zuul-info/host-info.ubuntu-bionic-large.yaml)
ansible_processor:
- '0'
- GenuineIntel
- Intel Xeon E3-12xx v2 (Ivy Bridge)
ansible_product_name: OpenStack Nova
I already tried the following libvirt configurations. It doesn't make any difference:
[libvirt]
cpu_mode = host-model
cpu_model_extra_flags = pcid
[libvirt]
cpu_mode = custom
cpu_model = IvyBridge
cpu_model_extra_flags = pcid
[libvirt]
cpu_mode = host-passthrough
I'm not sure if there are different CPUs or just different microcode patch levels used. Does anybody have an idea what could cause this or if I can solve this by (libvirt) configuration? I'm also curious if there are different OpenStack distros or versions used on CityNetwork side (OpenStack Nova vs OpenStack Compute).
Solutions could be:
- pinning the job to the servers which work
- configuring libvirt some other way so that nested virt is performant
- "fixing" the CPUs on which the job fails (not sure if there is anything that could be done)
Describe the bug
We currently have issues executing CAPO tests on the OpenStack from CityNetwork. Depending on which VMs are used fro the test it's either always succeeds or fails. My guess is that it depends on the CPUs used.
Related Job information
Steps to reproduce the issue
Describe how to reproduce the issue:
Additional context
The CAPO tests installs a devstack and then spins up a Kubernetes cluster via CAPO. In failed runs the Kubernetes master node doesn't come up (server boots but Kubernetes master components are in CrashLoop because of CPU starvation). I guess that's either caused by over-provisioning or some problem with nested virtualization. Load average on the master node is either 20-40 (failed runs) or 2-3 (successful runs). I don't really think it's over-provisioning because the overall performance on the host (not the master node) looks very similar (14min vs 16 min to install devstack)
I compared the runs. That's what I found:
Failed runs always use: (https://logs.openlabtesting.org/logs/33/533/cd08a0a397ca17bb36aefab2902a31d3fe1eb206/check/cluster-api-provider-openstack-current-acceptance-test-v1.18/a113326/zuul-info/host-info.ubuntu-bionic-large.yaml)
Successful runs always use: (https://logs.openlabtesting.org/logs/33/533/cd08a0a397ca17bb36aefab2902a31d3fe1eb206/check/cluster-api-provider-openstack-current-acceptance-test-v1.18-1/202b9bc/zuul-info/host-info.ubuntu-bionic-large.yaml)
I already tried the following libvirt configurations. It doesn't make any difference:
I'm not sure if there are different CPUs or just different microcode patch levels used. Does anybody have an idea what could cause this or if I can solve this by (libvirt) configuration? I'm also curious if there are different OpenStack distros or versions used on CityNetwork side (OpenStack Nova vs OpenStack Compute).
Solutions could be: