Skip to content

fix(ci): improve K8s nightly stability - readiness check and CrashLoopBackOff detection#4790

Draft
zdrapela wants to merge 3 commits into
redhat-developer:mainfrom
zdrapela:fix/k8s-nightly-crashloop-readiness
Draft

fix(ci): improve K8s nightly stability - readiness check and CrashLoopBackOff detection#4790
zdrapela wants to merge 3 commits into
redhat-developer:mainfrom
zdrapela:fix/k8s-nightly-crashloop-readiness

Conversation

@zdrapela
Copy link
Copy Markdown
Member

Summary

Fixes the root causes of consistent AKS/EKS/GKE nightly E2E job failures (100% failure rate across all 3 K8s platforms for the last 2+ weeks).

Root Causes Identified

  1. RBAC phase aborted by CrashLoopBackOff fast-fail: The lightspeed-core sidecar (shipped in chart 1.10-114-CI with lightspeed-stack:0.5.0) crashes on all platforms including GKE. On GKE, the backstage-backend happens to become ready before the detection fires, so tests run. On AKS/EKS, the detection fires first and aborts the deployment — 0 RBAC tests ever ran.

  2. Showcase phase test failures (guest sign-in / 503s): The CI health check used curl -I <root_url> (HEAD to the frontend), which returns 200 as soon as the ingress serves the SPA — before the backend API (including auth) is initialized. Tests started too early and hit 503s on /api/auth/guest/refresh. Additionally, the ALB/nginx ingress intermittently returns 502 during backend health propagation even after the first successful response.

Changes

  • Readiness endpoint: Use /.backstage/health/v1/readiness instead of the root URL. This endpoint returns 503 until all backend plugins (including auth) complete initialization.
  • Stabilization check: Require 3 consecutive successful readiness responses (at 5s intervals) before declaring Backstage ready, preventing ingress propagation race conditions.
  • CrashLoopBackOff exclusion: Exclude lightspeed-core sidecar from the fast-fail detection. The sidecar is non-essential for E2E tests (GKE proves this — all 63 RBAC tests pass with lightspeed-core in CrashLoopBackOff). The fallback check is narrowed to Init:CrashLoopBackOff only (still catches init container crashes like the install-dynamic-plugins zip bomb).

Verified on EKS

Test Before After
RBAC deployment Aborted (0 tests ran) Pod recovered, 62 tests started
Instance health check Failed (HTML error) Passed (576ms)
Guest sign-in Failed Passed (10.9s)
Settings page Failed Passed
Home page customization Failed Passed

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 11, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@zdrapela
Copy link
Copy Markdown
Member Author

/qodo

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-eks-helm-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-aks-helm-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 11, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

Grey Divider


Action required

1. Crashloop filter hides failures 🐞 Bug ☼ Reliability
Description
CrashLoopBackOff detection filters out any pod line containing the string "lightspeed-core", which
can hide CrashLoopBackOff for essential containers if the pod also has a lightspeed-core sidecar.
This defeats the intended “fail fast on essential pod crashes” behavior and can cause the job to
wait until the overall timeout even though the backend/postgresql is already crash-looping.
Code

.ci/pipelines/lib/testing.sh[R205-209]

      crash_pods=$(oc get pods -n "${namespace}" -l "app.kubernetes.io/instance in (${release_name},redhat-developer-hub,developer-hub,${release_name}-postgresql)" \
-        -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.phase}{" "}{range .status.containerStatuses[*]}{.state.waiting.reason}{end}{range .status.initContainerStatuses[*]}{.state.waiting.reason}{end}{"\n"}{end}' 2> /dev/null | grep -E "CrashLoopBackOff" || true)
+        -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.phase}{" "}{range .status.containerStatuses[*]}{.containerID}{"|"}{.name}{"|"}{.state.waiting.reason}{","}{end}{range .status.initContainerStatuses[*]}{.containerID}{"|"}{.name}{"|"}{.state.waiting.reason}{","}{end}{"\n"}{end}' 2> /dev/null \
+        | grep -E "CrashLoopBackOff" \
+        | grep -v "lightspeed-core" \
+        || true)
Evidence
The jsonpath prints *all* container names for each pod on a single line, then `grep -v
"lightspeed-core"` drops the *entire pod line* (not just the crashing container entry). If any
essential container in that pod is CrashLoopBackOff, the line still contains "lightspeed-core"
(because .name is always printed), so the pod is excluded and the crash is missed.

.ci/pipelines/lib/testing.sh[200-214]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
CrashLoopBackOff detection currently excludes any *pod* line that contains `lightspeed-core`, which can hide CrashLoopBackOff for other containers in the same pod.

### Issue Context
The jsonpath output is formatted as one line per pod and includes every container name (`{.name}`) regardless of state, so `grep -v lightspeed-core` is an unsafe filter.

### Fix Focus Areas
- .ci/pipelines/lib/testing.sh[200-214]

### Implementation direction
- Change the pipeline to filter CrashLoopBackOff at the **container entry** level, not the whole pod line.
- Options:
 - Use `-o json` + `jq` to emit only crashing container statuses and exclude only `.name=="lightspeed-core"`.
 - Or adjust the jsonpath to emit one line per container (pod + container), then filter out only `lightspeed-core` entries while preserving other containers from the same pod.
- Ensure that if `backstage-backend`/`postgresql` containers crash-loop, they are still detected even when a non-essential sidecar exists in the same pod.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Fallback misses crashloops 🐞 Bug ☼ Reliability
Description
The fallback CrashLoopBackOff detection (used when label selection yields no results) now only
matches Init:CrashLoopBackOff, so it can miss regular CrashLoopBackOff in pods not covered by
the primary label selector. This reduces the safety net for the exact cases the fallback is intended
to catch (e.g., pods with different labels/output formats).
Code

.ci/pipelines/lib/testing.sh[R210-214]

      # Also check by name pattern for postgresql pods that may have different labels
+      # and for init container CrashLoopBackOff (e.g., install-dynamic-plugins)
      if [[ -z "${crash_pods}" ]]; then
-        crash_pods=$(oc get pods -n "${namespace}" --no-headers 2> /dev/null | grep -E "(${release_name}|developer-hub|postgresql)" | grep -E "CrashLoopBackOff|Init:CrashLoopBackOff" || true)
+        crash_pods=$(oc get pods -n "${namespace}" --no-headers 2> /dev/null | grep -E "(${release_name}|developer-hub|postgresql)" | grep -E "Init:CrashLoopBackOff" || true)
      fi
Evidence
The code comment states the fallback is for pods that may have different labels, but the fallback
grep was narrowed to init crashloops only; a non-init CrashLoopBackOff in those pods will no longer
be detected by this path.

.ci/pipelines/lib/testing.sh[210-214]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The fallback crash detection only checks `Init:CrashLoopBackOff`, missing `CrashLoopBackOff` for regular containers when the primary label-based query doesn't return pods.

### Issue Context
The fallback exists specifically for pods that might not match the label selector; it should remain broad enough to detect both init and regular container crashloops.

### Fix Focus Areas
- .ci/pipelines/lib/testing.sh[210-214]

### Implementation direction
- Expand the fallback grep back to include both `CrashLoopBackOff` and `Init:CrashLoopBackOff`, **or** add a second fallback branch specifically for regular `CrashLoopBackOff`.
- If the narrowing was intentional for a specific false-positive case, document that and add an alternative reliable detection for regular container crashloops in the fallback scenario.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Curl polling can hang 🐞 Bug ☼ Reliability
Description
The readiness polling uses curl without --connect-timeout/--max-time, so a stalled connection
can hang indefinitely and bypass max_attempts. This risk is amplified because the new
stabilization logic performs multiple additional curl calls per successful attempt.
Code

.ci/pipelines/lib/testing.sh[R165-178]

+    http_status=$(curl --insecure -s -o /dev/null -w "%{http_code}" "${readiness_url}" || echo "000")

    if [[ "${http_status}" -eq 200 ]]; then
-      log::success "Backstage is up and running!"
-      return 0
+      # Require multiple consecutive successful checks to ensure stability.
+      # On K8s platforms, the ingress (ALB/nginx/GCE) may still be propagating
+      # the backend health status and intermittently returning 502/503 even
+      # after the first successful readiness response.
+      local consecutive_ok=1
+      local required_ok=3
+      local stabilize_wait=5
+      while [[ ${consecutive_ok} -lt ${required_ok} ]]; do
+        sleep "${stabilize_wait}"
+        http_status=$(curl --insecure -s -o /dev/null -w "%{http_code}" "${readiness_url}" || echo "000")
+        if [[ "${http_status}" -eq 200 ]]; then
Evidence
Both the initial readiness check and the stabilization checks call curl without any explicit
timeouts; the || echo "000" only helps when curl exits, not when it blocks waiting on network/TCP
timeouts.

.ci/pipelines/lib/testing.sh[162-180]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Readiness polling can block indefinitely because curl has no connect or overall timeout.

### Issue Context
`max_attempts` is intended to bound the wait, but it doesn’t help if a single curl invocation hangs.

### Fix Focus Areas
- .ci/pipelines/lib/testing.sh[162-180]

### Implementation direction
- Add reasonable timeouts to both curl invocations, e.g.:
 - `--connect-timeout 5 --max-time 10`
- Keep the existing `|| echo "000"` fallback, but ensure hung calls terminate quickly so the loop progresses and `max_attempts` remains meaningful.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-eks-helm-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-aks-helm-nightly

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.49%. Comparing base (b4b7910) to head (b7534ee).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4790       +/-   ##
===========================================
+ Coverage   40.88%   69.49%   +28.60%     
===========================================
  Files         119      109       -10     
  Lines        2228     4710     +2482     
  Branches      562      513       -49     
===========================================
+ Hits          911     3273     +2362     
- Misses       1311     1437      +126     
+ Partials        6        0        -6     
Flag Coverage Δ
install-dynamic-plugins 92.44% <ø> (?)
rhdh 38.65% <ø> (-2.24%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4b7910...b7534ee. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-aks-helm-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-eks-helm-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-aks-helm-nightly

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 12, 2026

@zdrapela: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks-helm-nightly 45ff335 link false /test e2e-aks-helm-nightly
ci/prow/e2e-eks-helm-nightly 45ff335 link false /test e2e-eks-helm-nightly

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

zdrapela added 2 commits May 12, 2026 12:30
Add SKIP_TESTS guard to the 3 jobs that bypassed it by calling
testing::run_tests directly (ocp-nightly runtime, ocp-operator runtime,
auth-providers).

Introduce DEPLOYMENT_TYPE env var (showcase, showcase-rbac, all) for K8s
jobs (AKS, EKS, GKE) to allow deploying only one deployment type and
keeping it alive for local test re-runs. When set to a single deployment,
namespace cleanup and DNS cleanup are skipped.

Add -d/--deployment CLI flag to local-run.sh with interactive prompts
for K8s jobs. Update local-test-setup.sh to accept showcase-rbac as
primary argument (rbac kept as alias).

CI behavior is unchanged: DEPLOYMENT_TYPE defaults to 'all' when unset.

Assisted-by: OpenCode
Update e2e-deploy-rhdh skill to document the new -d/--deployment flag
for K8s jobs and remove the OCP-only restriction from deploy-only mode.
Update e2e-parse-ci-failure skill to include -d flag in local-run.sh
command output. Update local-test-setup.sh argument references from
'rbac' to 'showcase-rbac' in reproduce-failure, verify-fix, and
diagnose-and-fix skills.

Assisted-by: OpenCode
@zdrapela zdrapela force-pushed the fix/k8s-nightly-crashloop-readiness branch from 45ff335 to ab88355 Compare May 12, 2026 12:30
The Helm chart (1.10-114-CI) enables lightspeed by default, which causes
two deployment failures on K8s platforms:

1. lightspeed-core sidecar crashes with sqlite3.OperationalError
   (attempt to write a readonly database), making the pod not ready
   (1/2 containers) and the ingress return 503
2. lightspeed-backend dynamic plugin triggers 'Zip bomb detected'
   error in the init container, preventing startup entirely

Disable the lightspeed sidecar (global.lightspeed.enabled: false) and
both sets of lightspeed plugin references (chart's registry.access and
catalog index's ghcr.io entries) in all K8s diff-values files
(AKS, EKS, GKE x showcase, showcase-rbac).

Assisted-by: OpenCode
@zdrapela zdrapela force-pushed the fix/k8s-nightly-crashloop-readiness branch from ab88355 to b7534ee Compare May 12, 2026 12:31
@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant