Skip to content

tests: Bump DevStack to Dalmatian (2024.2) #2742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

stephenfin
Copy link
Member

@stephenfin stephenfin commented Dec 9, 2024

What this PR does / why we need it:

Bump the version of DevStack used in CI from Bobcat (2023.2), which is now EOL, to Dalmatian (2024.2). A future change will bump this further to Epoxy (2025.2).

Which issue this PR fixes(if applicable):

(none)

Special notes for reviewers:

(none)

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 9, 2024
@k8s-ci-robot k8s-ci-robot requested review from dulek and zetaab December 9, 2024 12:56
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Dec 9, 2024
@stephenfin
Copy link
Member Author

stephenfin commented Dec 9, 2024

/hold

This is the second attempt after the first was reverted (#2730). I need to see how this performs. fwiw though, I saw no performance issues locally.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 9, 2024
@kayrus
Copy link
Contributor

kayrus commented Dec 9, 2024

@stephenfin see #2730

@stephenfin
Copy link
Member Author

@stephenfin see #2730

Yup, see my comment right above 😄

@EmilienM
Copy link
Contributor

I wonder if #2747 would help.

@kayrus
Copy link
Contributor

kayrus commented Dec 12, 2024

/retest

@kayrus
Copy link
Contributor

kayrus commented Dec 12, 2024

/test openstack-cloud-csi-manila-e2e-test
previously manila tests took 49m29s
cinder tests took 1h50m18s and failed due to timeout

@kayrus
Copy link
Contributor

kayrus commented Dec 12, 2024

/test openstack-cloud-csi-manila-e2e-test

@kayrus
Copy link
Contributor

kayrus commented Dec 13, 2024

@EmilienM looks like the #2747 doesn't help

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2025
@kayrus
Copy link
Contributor

kayrus commented Mar 13, 2025

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2025
@stephenfin stephenfin changed the title tests: Bump DevStack to Dalmatian (2024.2) tests: Bump DevStack to Expoxy (2025.1) May 8, 2025
@stephenfin stephenfin changed the title tests: Bump DevStack to Expoxy (2025.1) tests: Bump DevStack to Epoxy (2025.1) May 8, 2025
@stephenfin
Copy link
Member Author

Error due to missing zpool module param:

+ lib/host:configure_zswap:45              :   sudo tee /sys/module/zswap/parameters/zpool
z3fold
tee: /sys/module/zswap/parameters/zpool: No such file or directory 

However, once again we appear to have ended up with a Jammy image despite requesting Noble 😕 Investigating.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zetaab for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mnaser
Copy link
Contributor

mnaser commented May 8, 2025

@stephenfin thanks for picking this up, fwiw..

https://review.opendev.org/c/openstack/devstack/+/942755

also, expect to see some failures because of:

#2884

@k8s-ci-robot k8s-ci-robot removed the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 8, 2025
@stephenfin stephenfin changed the title tests: Bump DevStack to Epoxy (2025.1) tests: Bump DevStack to Dalmatian (2024.2) May 8, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 8, 2025
@stephenfin
Copy link
Member Author

also, expect to see some failures because of:
#2884

Thanks. It might make sense to stick with 2024.2, fix that, then bump to 2025.2 so. Will think on it 🤔

I've done this.

@stephenfin
Copy link
Member Author

Turns out we were never running against Ubuntu 24.04. While Boskos reaps networks, instances, disks etc., it doesn't reap images. We've likely been using the same (Ubuntu 24.04) image for who knows how long at this point 😅

https://github.com/kubernetes-sigs/boskos/blob/5993cef5a1c719c33c0936d416b7d935058e1204/cmd/janitor/gcp_janitor.py#L38

stephenfin added 7 commits May 9, 2025 13:17
While boskos will reap most resources for us, it doesn't reap images
[1]. This has resulted in us using the same image for who knows how long
at this point.

Encode the Ubuntu version to prevent us picking up other version by
mistake.

[1] https://github.com/kubernetes-sigs/boskos/blob/5993cef5a1c719c33c0936d416b7d935058e1204/cmd/janitor/gcp_janitor.py#L46-L88

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Use the Ubuntu 24.04 version, rather than the 22.04 version.
This aligns with what we're using for DevStack itself.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
It's all Python 3 now, baby.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Per the Ansible 2.19 porting guide [1].

[1] https://ansible.readthedocs.io/projects/ansible-core/devel/porting_guides/porting_guide_core_2.19.html

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
So that we actually get test results.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Add a timeout to the Manila job and otherwise move some lines around.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 12, 2025
@stephenfin
Copy link
Member Author

stephenfin commented May 12, 2025

Investigating the performance degradation by comparing two recent builds: the last passing one and this failing one.

DevStack is about 60% slower to deploy at 467 seconds (7m47s) versus 652 seconds (10m52s), but that's so small and so variable (based on other failures in between) as to be irrelevant. Looks like it's the tests themselves that take longer. I'm going to rework things so we actually get a response back from ginkgo if the test run fails.

@k8s-ci-robot
Copy link
Contributor

@stephenfin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
openstack-cloud-csi-cinder-e2e-test d827feb link true /test openstack-cloud-csi-cinder-e2e-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@stephenfin
Copy link
Member Author

stephenfin commented May 12, 2025

Looks like there are some very significant changes in runtime for tests across the board. Now to figure out why. I've been using the below script to compare results from JUnit files (specifically, the JUnit files from the last success and the most recent failure). The result can be seen in results.csv.

#!/usr/bin/env python3

import csv
import pprint

from lxml import etree


def diff(before: str, after: str):
    with open(before) as fh:
        passing = etree.parse(fh)

    with open(after) as fh:
        failing = etree.parse(fh)

    passing_results = {}
    results_diff = {}

    for testcase in passing.findall('.//testcase'):
        passing_results[testcase.get('name')] = (
            testcase.get('status'), testcase.get('time')
        )

    for testcase in failing.findall('.//testcase'):
        name = testcase.get('name')
        if name not in passing_results:
            raise Exception('tests missing from runs: this should not happen')

        if (
            testcase.get('status') != passing_results[name][0] or
            testcase.get('status') != 'skipped'
        ):
            results_diff[testcase.get('name')] = {
                'before': passing_results[name],
                'after': (testcase.get('status'), testcase.get('time')),
            }

    with open('results.csv', 'w', newline='') as fh:
        writer = csv.writer(fh)

        for name, diff in results_diff.items():
            if name in {
                '[ReportBeforeSuite]',
                '[SynchronizedBeforeSuite]',
                '[SynchronizedAfterSuite]',
                '[ReportAfterSuite] Kubernetes e2e suite report',
            }:
                continue

            if diff['before'][0] != diff['after'][0]:
                # we might want to look at this later
                continue

            before_sec = float(diff['before'][1])
            after_sec = float(diff['after'][1])

            diff_sec = ((after_sec - before_sec) / before_sec) * 100
            print(f'{name}')
            print(f'\tbefore: {before_sec:0.2f} seconds')
            print(f'\tafter:  {after_sec:0.2f} seconds')
            print(f'\tchange: {diff_sec:0.2f}%')

            writer.writerow([name, before_sec, after_sec, diff_sec])


def main():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument(
        'before',
        help='Before result (passing)',
    )
    parser.add_argument(
        'after',
        help='After result (failing)',
    )
    args = parser.parse_args()
    diff(args.before, args.after)


if __name__ == '__main__':
    main()

@stephenfin
Copy link
Member Author

Let's see if we get the same performance issues in Caracal, since that's cuts our diff in half. Proposed here.

@gouthampacha
Copy link
Contributor

Let's see if we get the same performance issues in Caracal, since that's cuts our diff in half. Proposed #2888.

Tangentially, since we have limited resources, i think in this repo, we should only test with SLURP releases.. i.e., 2024.1 and 2025.1 are more appropriate/relevant than the .2 releases due to their popularity .. we could override this for individual test jobs if necessary to a .2 release..

@stephenfin
Copy link
Member Author

Let's see if we get the same performance issues in Caracal, since that's cuts our diff in half. Proposed #2888.

Tangentially, since we have limited resources, i think in this repo, we should only test with SLURP releases.. i.e., 2024.1 and 2025.1 are more appropriate/relevant than the .2 releases due to their popularity .. we could override this for individual test jobs if necessary to a .2 release..

I agree.

@stephenfin
Copy link
Member Author

stephenfin commented Jun 5, 2025

I'm currently deploying Bobcat locally using the below local.conf, generated with changes from #2905.

[[local|localrc]]
RECLONE=False
#HOST_IP={{ local_ip_address }}
DEST=/opt/stack
DATA_DIR=${DEST}/data
USE_PYTHON3=True
LOGFILE=$DEST/logs/stack.sh.log
VERBOSE=True
LOG_COLOR=False
LOGDAYS=1
SERVICE_TIMEOUT=300

DATABASE_PASSWORD=password
ADMIN_PASSWORD=password
SERVICE_PASSWORD=password
SERVICE_TOKEN=password
RABBIT_PASSWORD=password

GIT_BASE=https://github.com
TARGET_BRANCH=2023.2-eol

ENABLED_SERVICES=rabbit,mysql,key

# Host tuning
# From: https://opendev.org/openstack/devstack/src/commit/05f7d302cfa2da73b2887afcde92ef65b1001194/.zuul.yaml#L645-L662
# Tune the host to optimize memory usage and hide io latency
# these setting will configure the kernel to treat the host page
# cache and swap with equal priority, and prefer deferring writes
# changing the default swappiness, dirty_ratio and
# the vfs_cache_pressure
ENABLE_SYSCTL_MEM_TUNING=true
# The net tuning optimizes ipv4 tcp fast open and config the default
# qdisk policy to pfifo_fast which effectively disable all qos.
# this minimizes the cpu load of the host network stack
ENABLE_SYSCTL_NET_TUNING=true
# zswap allows the kernel to compress pages in memory before swapping
# them to disk. this can reduce the amount of swap used and improve
# performance. effectivly this trades a small amount of cpu for an
# increase in swap performance by reducing the amount of data
# written to disk. the overall speedup is porportional to the
# compression ratio and the speed of the swap device.
ENABLE_ZSWAP=false

# Nova
enable_service n-api
enable_service n-cpu
enable_service n-cond
enable_service n-sch
enable_service n-api-meta

enable_service placement-api
enable_service placement-client

# Glance
enable_service g-api
enable_service g-reg

# Cinder
enable_service cinder
enable_service c-api
enable_service c-vol
enable_service c-sch

# Neutron
enable_plugin neutron ${GIT_BASE}/openstack/neutron.git 2023.2-eol
enable_service q-svc
enable_service q-ovn-metadata-agent
enable_service q-trunk
enable_service q-qos
enable_service ovn-controller
enable_service ovn-northd
enable_service ovs-vswitchd
enable_service ovsdb-server

ML2_L3_PLUGIN="ovn-router,trunk,qos"
OVN_L3_CREATE_PUBLIC_NETWORK="True"
PUBLIC_BRIDGE_MTU="1430"

IP_VERSION=4
IPV4_ADDRS_SAFE_TO_USE=10.1.0.0/26
FIXED_RANGE=10.1.0.0/26
NETWORK_GATEWAY=10.1.0.1
FLOATING_RANGE=172.24.5.0/24
PUBLIC_NETWORK_GATEWAY=172.24.5.1

# Add a pre-install script to upgrade pip and setuptools
[[local|pre-install]]
# Activate the virtual environment and upgrade pip and setuptools
if [ -f /opt/stack/data/venv/bin/activate ]; then
    source /opt/stack/data/venv/bin/activate
    pip install --upgrade pip setuptools
    deactivate
fi

[[post-config|$GLANCE_API_CONF]]
[glance_store]
default_store = file

[[post-config|$NEUTRON_CONF]]
[DEFAULT]
global_physnet_mtu = 1430

Sharing in case it helps anyone else.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants