Flaky ClusterAPI OpenStack tests 

### Describe the bug
We currently have issues executing CAPO tests on the OpenStack from CityNetwork. Depending on which VMs are used fro the test it's either always succeeds or fails. My guess is that it depends on the CPUs used.

### Related Job information

* job name: cluster-api-provider-openstack-current-acceptance-test-v1.18
* failed jobs:
  * https://logs.openlabtesting.org/logs/33/533/9fd1b8854981adced3b997f0eb7413c3d1d9b1ea/check/cluster-api-provider-openstack-current-acceptance-test-v1.18/846b1c0/
  * https://logs.openlabtesting.org/logs/33/533/9fd1b8854981adced3b997f0eb7413c3d1d9b1ea/check/cluster-api-provider-openstack-current-acceptance-test-v1.18-1/a657070/
  * https://logs.openlabtesting.org/logs/33/533/75ca3a7c0b1b93d694232cd288169c9b2969f58b/check/cluster-api-provider-openstack-current-acceptance-test-v1.18/480b183/
* successful jobs:
  * https://logs.openlabtesting.org/logs/33/533/cd08a0a397ca17bb36aefab2902a31d3fe1eb206/check/cluster-api-provider-openstack-current-acceptance-test-v1.18-1/202b9bc/

### Steps to reproduce the issue
Describe how to reproduce the issue:
* comment "recheck" on this PR: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/533


### Additional context

The CAPO tests installs a devstack and then spins up a Kubernetes cluster via CAPO. In failed runs the Kubernetes master node doesn't come up (server boots but Kubernetes master components are in CrashLoop because of CPU starvation). I guess that's either caused by over-provisioning or some problem with nested virtualization. Load average on the master node is either 20-40 (failed runs) or 2-3 (successful runs). I don't really think it's over-provisioning because the overall performance on the host (not the master node) looks very similar (14min vs 16 min to install devstack)

I compared the runs. That's what I found:
Failed runs always use: (https://logs.openlabtesting.org/logs/33/533/cd08a0a397ca17bb36aefab2902a31d3fe1eb206/check/cluster-api-provider-openstack-current-acceptance-test-v1.18/a113326/zuul-info/host-info.ubuntu-bionic-large.yaml)
```
ansible_processor:
  - '0'
  - GenuineIntel
  - Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)
ansible_product_name: OpenStack Compute
```

Successful runs always use: (https://logs.openlabtesting.org/logs/33/533/cd08a0a397ca17bb36aefab2902a31d3fe1eb206/check/cluster-api-provider-openstack-current-acceptance-test-v1.18-1/202b9bc/zuul-info/host-info.ubuntu-bionic-large.yaml)
```
  ansible_processor:
  - '0'
  - GenuineIntel
  - Intel Xeon E3-12xx v2 (Ivy Bridge)
  ansible_product_name: OpenStack Nova
```
I already tried the following libvirt configurations. It doesn't make any difference:
```
[libvirt]
cpu_mode = host-model
cpu_model_extra_flags = pcid

[libvirt]
cpu_mode = custom
cpu_model = IvyBridge
cpu_model_extra_flags = pcid

[libvirt]
cpu_mode = host-passthrough
```

I'm not sure if there are different CPUs or just different microcode patch levels used. Does anybody have an idea what could cause this or if I can solve this by (libvirt) configuration? I'm also curious if there are different OpenStack distros or versions used on CityNetwork side (OpenStack Nova vs OpenStack Compute). 

Solutions could be:
* pinning the job to the servers which work
* configuring libvirt some other way so that nested virt is performant
* "fixing" the CPUs on which the job fails (not sure if there is anything that could be done)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flaky ClusterAPI OpenStack tests #504

Describe the bug

Related Job information

Steps to reproduce the issue

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Flaky ClusterAPI OpenStack tests #504

Description

Describe the bug

Related Job information

Steps to reproduce the issue

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions