Skip to content

Flaky ClusterAPI OpenStack tests  #504

Description

@sbueringer

Describe the bug

We currently have issues executing CAPO tests on the OpenStack from CityNetwork. Depending on which VMs are used fro the test it's either always succeeds or fails. My guess is that it depends on the CPUs used.

Related Job information

Steps to reproduce the issue

Describe how to reproduce the issue:

Additional context

The CAPO tests installs a devstack and then spins up a Kubernetes cluster via CAPO. In failed runs the Kubernetes master node doesn't come up (server boots but Kubernetes master components are in CrashLoop because of CPU starvation). I guess that's either caused by over-provisioning or some problem with nested virtualization. Load average on the master node is either 20-40 (failed runs) or 2-3 (successful runs). I don't really think it's over-provisioning because the overall performance on the host (not the master node) looks very similar (14min vs 16 min to install devstack)

I compared the runs. That's what I found:
Failed runs always use: (https://logs.openlabtesting.org/logs/33/533/cd08a0a397ca17bb36aefab2902a31d3fe1eb206/check/cluster-api-provider-openstack-current-acceptance-test-v1.18/a113326/zuul-info/host-info.ubuntu-bionic-large.yaml)

ansible_processor:
  - '0'
  - GenuineIntel
  - Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)
ansible_product_name: OpenStack Compute

Successful runs always use: (https://logs.openlabtesting.org/logs/33/533/cd08a0a397ca17bb36aefab2902a31d3fe1eb206/check/cluster-api-provider-openstack-current-acceptance-test-v1.18-1/202b9bc/zuul-info/host-info.ubuntu-bionic-large.yaml)

  ansible_processor:
  - '0'
  - GenuineIntel
  - Intel Xeon E3-12xx v2 (Ivy Bridge)
  ansible_product_name: OpenStack Nova

I already tried the following libvirt configurations. It doesn't make any difference:

[libvirt]
cpu_mode = host-model
cpu_model_extra_flags = pcid

[libvirt]
cpu_mode = custom
cpu_model = IvyBridge
cpu_model_extra_flags = pcid

[libvirt]
cpu_mode = host-passthrough

I'm not sure if there are different CPUs or just different microcode patch levels used. Does anybody have an idea what could cause this or if I can solve this by (libvirt) configuration? I'm also curious if there are different OpenStack distros or versions used on CityNetwork side (OpenStack Nova vs OpenStack Compute).

Solutions could be:

  • pinning the job to the servers which work
  • configuring libvirt some other way so that nested virt is performant
  • "fixing" the CPUs on which the job fails (not sure if there is anything that could be done)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions