fix(ci): improve K8s nightly stability - readiness check and CrashLoopBackOff detection by zdrapela · Pull Request #4790 · redhat-developer/rhdh

zdrapela · 2026-05-11T14:21:58Z

Summary

Fixes the root causes of consistent AKS/EKS/GKE nightly E2E job failures (100% failure rate across all 3 K8s platforms for the last 2+ weeks).

Root Causes Identified

RBAC phase aborted by CrashLoopBackOff fast-fail: The lightspeed-core sidecar (shipped in chart 1.10-114-CI with lightspeed-stack:0.5.0) crashes on all platforms including GKE. On GKE, the backstage-backend happens to become ready before the detection fires, so tests run. On AKS/EKS, the detection fires first and aborts the deployment — 0 RBAC tests ever ran.
Showcase phase test failures (guest sign-in / 503s): The CI health check used curl -I <root_url> (HEAD to the frontend), which returns 200 as soon as the ingress serves the SPA — before the backend API (including auth) is initialized. Tests started too early and hit 503s on /api/auth/guest/refresh. Additionally, the ALB/nginx ingress intermittently returns 502 during backend health propagation even after the first successful response.

Changes

Readiness endpoint: Use /.backstage/health/v1/readiness instead of the root URL. This endpoint returns 503 until all backend plugins (including auth) complete initialization.
Stabilization check: Require 3 consecutive successful readiness responses (at 5s intervals) before declaring Backstage ready, preventing ingress propagation race conditions.
CrashLoopBackOff exclusion: Exclude lightspeed-core sidecar from the fast-fail detection. The sidecar is non-essential for E2E tests (GKE proves this — all 63 RBAC tests pass with lightspeed-core in CrashLoopBackOff). The fallback check is narrowed to Init:CrashLoopBackOff only (still catches init container crashes like the install-dynamic-plugins zip bomb).

Verified on EKS

Test	Before	After
RBAC deployment	Aborted (0 tests ran)	Pod recovered, 62 tests started
Instance health check	Failed (HTML error)	Passed (576ms)
Guest sign-in	Failed	Passed (10.9s)
Settings page	Failed	Passed
Home page customization	Failed	Passed

openshift-ci · 2026-05-11T14:22:02Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

zdrapela · 2026-05-11T14:22:11Z

/qodo

zdrapela · 2026-05-11T14:22:16Z

/test e2e-eks-helm-nightly

zdrapela · 2026-05-11T14:22:21Z

/test e2e-aks-helm-nightly

zdrapela · 2026-05-11T14:27:07Z

/agentic_review

rhdh-qodo-merge · 2026-05-11T14:27:17Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

1. Crashloop filter hides failures 🐞 Bug ☼ Reliability

Description

CrashLoopBackOff detection filters out any pod line containing the string "lightspeed-core", which
can hide CrashLoopBackOff for essential containers if the pod also has a lightspeed-core sidecar.
This defeats the intended “fail fast on essential pod crashes” behavior and can cause the job to
wait until the overall timeout even though the backend/postgresql is already crash-looping.

Code

.ci/pipelines/lib/testing.sh[R205-209]

      crash_pods=$(oc get pods -n "${namespace}" -l "app.kubernetes.io/instance in (${release_name},redhat-developer-hub,developer-hub,${release_name}-postgresql)" \
-        -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.phase}{" "}{range .status.containerStatuses[*]}{.state.waiting.reason}{end}{range .status.initContainerStatuses[*]}{.state.waiting.reason}{end}{"\n"}{end}' 2> /dev/null | grep -E "CrashLoopBackOff" || true)
+        -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.phase}{" "}{range .status.containerStatuses[*]}{.containerID}{"|"}{.name}{"|"}{.state.waiting.reason}{","}{end}{range .status.initContainerStatuses[*]}{.containerID}{"|"}{.name}{"|"}{.state.waiting.reason}{","}{end}{"\n"}{end}' 2> /dev/null \
+        | grep -E "CrashLoopBackOff" \
+        | grep -v "lightspeed-core" \
+        || true)

Evidence

The jsonpath prints *all* container names for each pod on a single line, then `grep -v
"lightspeed-core"` drops the *entire pod line* (not just the crashing container entry). If any
essential container in that pod is CrashLoopBackOff, the line still contains "lightspeed-core"
(because .name is always printed), so the pod is excluded and the crash is missed.

.ci/pipelines/lib/testing.sh[200-214]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
CrashLoopBackOff detection currently excludes any *pod* line that contains `lightspeed-core`, which can hide CrashLoopBackOff for other containers in the same pod.

### Issue Context
The jsonpath output is formatted as one line per pod and includes every container name (`{.name}`) regardless of state, so `grep -v lightspeed-core` is an unsafe filter.

### Fix Focus Areas
- .ci/pipelines/lib/testing.sh[200-214]

### Implementation direction
- Change the pipeline to filter CrashLoopBackOff at the **container entry** level, not the whole pod line.
- Options:
 - Use `-o json` + `jq` to emit only crashing container statuses and exclude only `.name=="lightspeed-core"`.
 - Or adjust the jsonpath to emit one line per container (pod + container), then filter out only `lightspeed-core` entries while preserving other containers from the same pod.
- Ensure that if `backstage-backend`/`postgresql` containers crash-loop, they are still detected even when a non-essential sidecar exists in the same pod.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Fallback misses crashloops 🐞 Bug ☼ Reliability

Description

The fallback CrashLoopBackOff detection (used when label selection yields no results) now only
matches Init:CrashLoopBackOff, so it can miss regular CrashLoopBackOff in pods not covered by
the primary label selector. This reduces the safety net for the exact cases the fallback is intended
to catch (e.g., pods with different labels/output formats).

Code

.ci/pipelines/lib/testing.sh[R210-214]

      # Also check by name pattern for postgresql pods that may have different labels
+      # and for init container CrashLoopBackOff (e.g., install-dynamic-plugins)
      if [[ -z "${crash_pods}" ]]; then
-        crash_pods=$(oc get pods -n "${namespace}" --no-headers 2> /dev/null | grep -E "(${release_name}|developer-hub|postgresql)" | grep -E "CrashLoopBackOff|Init:CrashLoopBackOff" || true)
+        crash_pods=$(oc get pods -n "${namespace}" --no-headers 2> /dev/null | grep -E "(${release_name}|developer-hub|postgresql)" | grep -E "Init:CrashLoopBackOff" || true)
      fi

Evidence

The code comment states the fallback is for pods that may have different labels, but the fallback
grep was narrowed to init crashloops only; a non-init CrashLoopBackOff in those pods will no longer
be detected by this path.

.ci/pipelines/lib/testing.sh[210-214]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The fallback crash detection only checks `Init:CrashLoopBackOff`, missing `CrashLoopBackOff` for regular containers when the primary label-based query doesn't return pods.

### Issue Context
The fallback exists specifically for pods that might not match the label selector; it should remain broad enough to detect both init and regular container crashloops.

### Fix Focus Areas
- .ci/pipelines/lib/testing.sh[210-214]

### Implementation direction
- Expand the fallback grep back to include both `CrashLoopBackOff` and `Init:CrashLoopBackOff`, **or** add a second fallback branch specifically for regular `CrashLoopBackOff`.
- If the narrowing was intentional for a specific false-positive case, document that and add an alternative reliable detection for regular container crashloops in the fallback scenario.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Curl polling can hang 🐞 Bug ☼ Reliability

Description

The readiness polling uses curl without --connect-timeout/--max-time, so a stalled connection
can hang indefinitely and bypass max_attempts. This risk is amplified because the new
stabilization logic performs multiple additional curl calls per successful attempt.

Code

.ci/pipelines/lib/testing.sh[R165-178]

+    http_status=$(curl --insecure -s -o /dev/null -w "%{http_code}" "${readiness_url}" || echo "000")

    if [[ "${http_status}" -eq 200 ]]; then
-      log::success "Backstage is up and running!"
-      return 0
+      # Require multiple consecutive successful checks to ensure stability.
+      # On K8s platforms, the ingress (ALB/nginx/GCE) may still be propagating
+      # the backend health status and intermittently returning 502/503 even
+      # after the first successful readiness response.
+      local consecutive_ok=1
+      local required_ok=3
+      local stabilize_wait=5
+      while [[ ${consecutive_ok} -lt ${required_ok} ]]; do
+        sleep "${stabilize_wait}"
+        http_status=$(curl --insecure -s -o /dev/null -w "%{http_code}" "${readiness_url}" || echo "000")
+        if [[ "${http_status}" -eq 200 ]]; then

Evidence

Both the initial readiness check and the stabilization checks call curl without any explicit
timeouts; the || echo "000" only helps when curl exits, not when it blocks waiting on network/TCP
timeouts.

.ci/pipelines/lib/testing.sh[162-180]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Readiness polling can block indefinitely because curl has no connect or overall timeout.

### Issue Context
`max_attempts` is intended to bound the wait, but it doesn’t help if a single curl invocation hangs.

### Fix Focus Areas
- .ci/pipelines/lib/testing.sh[162-180]

### Implementation direction
- Add reasonable timeouts to both curl invocations, e.g.:
 - `--connect-timeout 5 --max-time 10`
- Keep the existing `|| echo "000"` fallback, but ensure hung calls terminate quickly so the loop progresses and `max_attempts` remains meaningful.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

zdrapela · 2026-05-11T14:54:29Z

/test e2e-eks-helm-nightly

zdrapela · 2026-05-11T14:54:30Z

/test e2e-aks-helm-nightly

github-actions · 2026-05-11T14:57:17Z

Image was built and published successfully. It is available at:

codecov · 2026-05-11T14:57:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.49%. Comparing base (b4b7910) to head (b7534ee).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #4790       +/-   ##
===========================================
+ Coverage   40.88%   69.49%   +28.60%     
===========================================
  Files         119      109       -10     
  Lines        2228     4710     +2482     
  Branches      562      513       -49     
===========================================
+ Hits          911     3273     +2362     
- Misses       1311     1437      +126     
+ Partials        6        0        -6

Flag	Coverage Δ
install-dynamic-plugins	`92.44% <ø> (?)`
rhdh	`38.65% <ø> (-2.24%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4b7910...b7534ee. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-05-11T15:26:26Z

Image was built and published successfully. It is available at:

zdrapela · 2026-05-11T19:36:11Z

/test e2e-aks-helm-nightly

zdrapela · 2026-05-12T08:55:37Z

/test e2e-eks-helm-nightly

zdrapela · 2026-05-12T08:55:38Z

/test e2e-aks-helm-nightly

github-actions · 2026-05-12T09:27:41Z

Image was built and published successfully. It is available at:

openshift-ci · 2026-05-12T10:08:21Z

@zdrapela: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aks-helm-nightly	`45ff335`	link	false	`/test e2e-aks-helm-nightly`
ci/prow/e2e-eks-helm-nightly	`45ff335`	link	false	`/test e2e-eks-helm-nightly`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Add SKIP_TESTS guard to the 3 jobs that bypassed it by calling testing::run_tests directly (ocp-nightly runtime, ocp-operator runtime, auth-providers). Introduce DEPLOYMENT_TYPE env var (showcase, showcase-rbac, all) for K8s jobs (AKS, EKS, GKE) to allow deploying only one deployment type and keeping it alive for local test re-runs. When set to a single deployment, namespace cleanup and DNS cleanup are skipped. Add -d/--deployment CLI flag to local-run.sh with interactive prompts for K8s jobs. Update local-test-setup.sh to accept showcase-rbac as primary argument (rbac kept as alias). CI behavior is unchanged: DEPLOYMENT_TYPE defaults to 'all' when unset. Assisted-by: OpenCode

Update e2e-deploy-rhdh skill to document the new -d/--deployment flag for K8s jobs and remove the OCP-only restriction from deploy-only mode. Update e2e-parse-ci-failure skill to include -d flag in local-run.sh command output. Update local-test-setup.sh argument references from 'rbac' to 'showcase-rbac' in reproduce-failure, verify-fix, and diagnose-and-fix skills. Assisted-by: OpenCode

The Helm chart (1.10-114-CI) enables lightspeed by default, which causes two deployment failures on K8s platforms: 1. lightspeed-core sidecar crashes with sqlite3.OperationalError (attempt to write a readonly database), making the pod not ready (1/2 containers) and the ingress return 503 2. lightspeed-backend dynamic plugin triggers 'Zip bomb detected' error in the init container, preventing startup entirely Disable the lightspeed sidecar (global.lightspeed.enabled: false) and both sets of lightspeed plugin references (chart's registry.access and catalog index's ghcr.io entries) in all K8s diff-values files (AKS, EKS, GKE x showcase, showcase-rbac). Assisted-by: OpenCode

sonarqubecloud · 2026-05-12T12:32:53Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-12T12:33:24Z

The container image build workflow finished with status: cancelled.

github-actions · 2026-05-12T13:03:11Z

Image was built and published successfully. It is available at:

openshift-ci Bot added the do-not-merge/work-in-progress label May 11, 2026

zdrapela added 2 commits May 12, 2026 12:30

zdrapela force-pushed the fix/k8s-nightly-crashloop-readiness branch from 45ff335 to ab88355 Compare May 12, 2026 12:30

zdrapela force-pushed the fix/k8s-nightly-crashloop-readiness branch from ab88355 to b7534ee Compare May 12, 2026 12:31

Conversation

zdrapela commented May 11, 2026

Summary

Root Causes Identified

Changes

Verified on EKS

Uh oh!

openshift-ci Bot commented May 11, 2026

Uh oh!

zdrapela commented May 11, 2026

Uh oh!

zdrapela commented May 11, 2026

Uh oh!

zdrapela commented May 11, 2026

Uh oh!

zdrapela commented May 11, 2026

Uh oh!

rhdh-qodo-merge Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

zdrapela commented May 11, 2026

Uh oh!

zdrapela commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

zdrapela commented May 11, 2026

Uh oh!

zdrapela commented May 12, 2026

Uh oh!

zdrapela commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

sonarqubecloud Bot commented May 12, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rhdh-qodo-merge Bot commented May 11, 2026 •

edited

Loading

codecov Bot commented May 11, 2026 •

edited

Loading