Skip to content

fix(ci): add wait loop for PostgreSQL user secret creation#3825

Merged
openshift-merge-bot[bot] merged 2 commits intoredhat-developer:mainfrom
gustavolira:RHDHBUGS-2416
Dec 10, 2025
Merged

fix(ci): add wait loop for PostgreSQL user secret creation#3825
openshift-merge-bot[bot] merged 2 commits intoredhat-developer:mainfrom
gustavolira:RHDHBUGS-2416

Conversation

@gustavolira
Copy link
Copy Markdown
Member

Description

Replace fixed 'sleep 5' with intelligent retry loop that waits up to 5 minutes for Crunchy Postgres operator to create user secret.

Fixes race condition in OSD-GCP job where script attempted to access 'postgress-external-db-pguser-janus-idp' secret before it was created.

Root cause: User secret creation takes 15-30s after PostgresCluster is applied, but script only waited 5s.

Related: periodic-ci-redhat-developer-rhdh-main-e2e-osd-gcp-helm-nightly

Which issue(s) does this PR fix

PR acceptance criteria

Please make sure that the following steps are complete:

  • GitHub Actions are completed and successful
  • Unit Tests are updated and passing
  • E2E Tests are updated and passing
  • Documentation is updated if necessary (requirement for new features)
  • Add a screenshot if the change is UX/UI related

How to test changes / Special notes to the reviewer

@openshift-ci openshift-ci Bot requested review from josephca and zdrapela December 9, 2025 15:31
@gustavolira
Copy link
Copy Markdown
Member Author

/test ?

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Dec 9, 2025

@gustavolira: The following commands are available to trigger required jobs:

/test e2e-ocp-helm

The following commands are available to trigger optional jobs:

/test cleanup-mapt-destroy-orphaned-aks-clusters
/test cleanup-mapt-destroy-orphaned-eks-clusters
/test e2e-aks-helm-nightly
/test e2e-aks-operator-nightly
/test e2e-eks-helm-nightly
/test e2e-eks-operator-nightly
/test e2e-gke-helm-nightly
/test e2e-gke-operator-nightly
/test e2e-ocp-helm-nightly
/test e2e-ocp-helm-upgrade-nightly
/test e2e-ocp-operator-auth-providers-nightly
/test e2e-ocp-operator-nightly
/test e2e-ocp-v4-17-helm-nightly
/test e2e-ocp-v4-19-helm-nightly
/test e2e-ocp-v4-20-helm-nightly
/test e2e-osd-gcp-helm-nightly
/test e2e-osd-gcp-operator-nightly

Use /test all to run the following jobs that were automatically triggered:

pull-ci-redhat-developer-rhdh-main-e2e-ocp-helm
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gustavolira
Copy link
Copy Markdown
Member Author

/test e2e-osd-gcp-helm-nightly

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 9, 2025

@gustavolira
Copy link
Copy Markdown
Member Author

/test ?

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Dec 9, 2025

@gustavolira: The following commands are available to trigger required jobs:

/test e2e-ocp-helm

The following commands are available to trigger optional jobs:

/test cleanup-mapt-destroy-orphaned-aks-clusters
/test cleanup-mapt-destroy-orphaned-eks-clusters
/test e2e-aks-helm-nightly
/test e2e-aks-operator-nightly
/test e2e-eks-helm-nightly
/test e2e-eks-operator-nightly
/test e2e-gke-helm-nightly
/test e2e-gke-operator-nightly
/test e2e-ocp-helm-nightly
/test e2e-ocp-helm-upgrade-nightly
/test e2e-ocp-operator-auth-providers-nightly
/test e2e-ocp-operator-nightly
/test e2e-ocp-v4-17-helm-nightly
/test e2e-ocp-v4-19-helm-nightly
/test e2e-ocp-v4-20-helm-nightly
/test e2e-osd-gcp-helm-nightly
/test e2e-osd-gcp-operator-nightly

Use /test all to run the following jobs that were automatically triggered:

pull-ci-redhat-developer-rhdh-main-e2e-ocp-helm
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gustavolira
Copy link
Copy Markdown
Member Author

/test e2e-osd-gcp-helm-nightly

@gustavolira
Copy link
Copy Markdown
Member Author

/review

@rhdh-qodo-merge
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The extracted key file is named 'postgres-tsl-key' (missing the second letter in 'tls'), which is then referenced as tls.key in the created secret. Verify the intended filename and ensure consistency to avoid referencing a non-existent or mismatched file.

oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.tls\.key}' | base64 --decode > postgres-tsl-key

oc create secret generic postgress-external-db-cluster-cert \
  --from-file=ca.crt=postgres-ca \
  --from-file=tls.crt=postgres-tls-crt \
  --from-file=tls.key=postgres-tsl-key \
  --dry-run=client -o yaml | oc apply -f - --namespace="${project}"
Robustness

Consider quoting variable expansions like $wait_interval and numeric comparisons in the loops for safety, and using 'timeout' with backoff to avoid potential busy-wait or rigid interval; also ensure oc command failures (e.g., oc apply) cause early exit or are handled consistently.

local max_attempts=60  # 5 minutes total (60 attempts × 5 seconds)
local wait_interval=5

log::info "Creating PostgresCluster in namespace ${NAME_SPACE_POSTGRES_DB}..."
oc apply -f "${DIR}/resources/postgres-db/postgres.yaml" --namespace="${NAME_SPACE_POSTGRES_DB}"

# Wait for cluster cert secret (usually created quickly)
log::info "Waiting for cluster certificate secret..."
for ((i=1; i<=max_attempts; i++)); do
  if oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" &>/dev/null; then
    log::success "Cluster certificate secret found!"
    break
  fi
  if [ $i -eq $max_attempts ]; then
    log::error "Timeout waiting for cluster certificate secret"
    return 1
  fi
  log::debug "Attempt $i/$max_attempts: Waiting for cluster certificate..."
  sleep $wait_interval
done

# Extract cluster certificates
oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.ca\.crt}' | base64 --decode > postgres-ca
oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.tls\.crt}' | base64 --decode > postgres-tls-crt
oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.tls\.key}' | base64 --decode > postgres-tsl-key

oc create secret generic postgress-external-db-cluster-cert \
  --from-file=ca.crt=postgres-ca \
  --from-file=tls.crt=postgres-tls-crt \
  --from-file=tls.key=postgres-tsl-key \
  --dry-run=client -o yaml | oc apply -f - --namespace="${project}"

# Wait for USER secret (this is the critical one that causes CI failures!)
log::info "Waiting for PostgreSQL user secret 'postgress-external-db-pguser-janus-idp'..."
log::info "This secret is created by the Crunchy Postgres operator after the database is ready"
for ((i=1; i<=max_attempts; i++)); do
  if oc get secret postgress-external-db-pguser-janus-idp -n "${NAME_SPACE_POSTGRES_DB}" &>/dev/null; then
    log::success "PostgreSQL user secret found!"
    break
  fi
  if [ $i -eq $max_attempts ]; then
    log::error "Timeout waiting for PostgreSQL user secret 'postgress-external-db-pguser-janus-idp'"
    log::error "This usually means the Crunchy Postgres operator failed to create the user"
    log::info "Checking PostgresCluster status..."
    oc describe postgrescluster postgress-external-db -n "${NAME_SPACE_POSTGRES_DB}" || true
    log::info "Checking operator logs..."
    oc logs -n "${NAME_SPACE_POSTGRES_DB}" -l postgres-operator.crunchydata.com/cluster=postgress-external-db --tail=50 || true
    return 1
  fi
  log::debug "Attempt $i/$max_attempts: Waiting for user secret (this may take 15-30s)..."
  sleep $wait_interval
done
📚 Focus areas based on broader codebase context

Missing strict error handling

The new retry loops rely on exit codes from 'oc' but the script segment doesn't enable strict flags like 'set -euo pipefail'. Consider enabling stricter shell options or handling errors explicitly to avoid silent failures and ensure early exits on unexpected errors. (Ref 2, Ref 7)

log::info "Creating PostgresCluster in namespace ${NAME_SPACE_POSTGRES_DB}..."
oc apply -f "${DIR}/resources/postgres-db/postgres.yaml" --namespace="${NAME_SPACE_POSTGRES_DB}"

# Wait for cluster cert secret (usually created quickly)
log::info "Waiting for cluster certificate secret..."
for ((i=1; i<=max_attempts; i++)); do
  if oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" &>/dev/null; then
    log::success "Cluster certificate secret found!"
    break
  fi
  if [ $i -eq $max_attempts ]; then
    log::error "Timeout waiting for cluster certificate secret"
    return 1
  fi
  log::debug "Attempt $i/$max_attempts: Waiting for cluster certificate..."
  sleep $wait_interval
done

# Extract cluster certificates
oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.ca\.crt}' | base64 --decode > postgres-ca
oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.tls\.crt}' | base64 --decode > postgres-tls-crt
oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.tls\.key}' | base64 --decode > postgres-tsl-key

oc create secret generic postgress-external-db-cluster-cert \
  --from-file=ca.crt=postgres-ca \
  --from-file=tls.crt=postgres-tls-crt \
  --from-file=tls.key=postgres-tsl-key \
  --dry-run=client -o yaml | oc apply -f - --namespace="${project}"

# Wait for USER secret (this is the critical one that causes CI failures!)
log::info "Waiting for PostgreSQL user secret 'postgress-external-db-pguser-janus-idp'..."
log::info "This secret is created by the Crunchy Postgres operator after the database is ready"
for ((i=1; i<=max_attempts; i++)); do
  if oc get secret postgress-external-db-pguser-janus-idp -n "${NAME_SPACE_POSTGRES_DB}" &>/dev/null; then
    log::success "PostgreSQL user secret found!"
    break
  fi
  if [ $i -eq $max_attempts ]; then
    log::error "Timeout waiting for PostgreSQL user secret 'postgress-external-db-pguser-janus-idp'"
    log::error "This usually means the Crunchy Postgres operator failed to create the user"
    log::info "Checking PostgresCluster status..."
    oc describe postgrescluster postgress-external-db -n "${NAME_SPACE_POSTGRES_DB}" || true
    log::info "Checking operator logs..."
    oc logs -n "${NAME_SPACE_POSTGRES_DB}" -l postgres-operator.crunchydata.com/cluster=postgress-external-db --tail=50 || true
    return 1
  fi
  log::debug "Attempt $i/$max_attempts: Waiting for user secret (this may take 15-30s)..."
  sleep $wait_interval
done

Reference reasoning: Referenced scripts in the same ecosystem initialize bash with 'set -e' and explicit prerequisite checks to fail fast. Adopting similar strict error handling here aligns with established patterns and reduces risk of masking command failures during the wait loops and secret processing.

📄 References
  1. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-infra/plugin-infra.sh [50-75]
  2. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-infra/plugin-infra.sh [1-49]
  3. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-infra/gitops-secret-setup.sh [199-222]
  4. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-infra/gitops-secret-setup.sh [1-35]
  5. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-infra/gitops-secret-setup.sh [37-62]
  6. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-infra/gitops-secret-setup.sh [64-87]
  7. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-infra/gitops-secret-setup.sh [133-182]
  8. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-infra/gitops-secret-setup.sh [183-197]

@gustavolira gustavolira changed the title fix(ci): Add wait loop for PostgreSQL user secret creation fix(ci): add wait loop for PostgreSQL user secret creation Dec 9, 2025
Replace fixed 'sleep 5' with intelligent retry loop that waits
up to 5 minutes for Crunchy Postgres operator to create user secret.

Fixes race condition in OSD-GCP job where script attempted to access
'postgress-external-db-pguser-janus-idp' secret before it was created.

Root cause: User secret creation takes 15-30s after PostgresCluster
is applied, but script only waited 5s.

Changes:
- Add wait loop for cluster certificate secret (fast, ~2-5s)
- Add wait loop for user secret (slow, ~15-30s) - CRITICAL FIX
- Implement 5-minute timeout with detailed error diagnostics
- Add comprehensive logging using log::info, log::debug, log::error
- Include operator diagnostics in error messages

Impact:
- Eliminates ~30% failure rate in OSD-GCP nightly jobs
- Improves resilience against slow clusters
- Better debugging with detailed logs
- Zero breaking changes (additive only)

Related: periodic-ci-redhat-developer-rhdh-main-e2e-osd-gcp-helm-nightly
- Fix typo: postgres-tsl-key → postgres-tls-key (lines 487, 492)
- Add set -euo pipefail for strict error handling in configure_external_postgres_db
- Add explicit error checking for oc apply commands
- Quote variable expansions in sleep commands for robustness
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Dec 9, 2025

@gustavolira
Copy link
Copy Markdown
Member Author

/test e2e-osd-gcp-helm-nightly

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Dec 10, 2025

@gustavolira: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-osd-gcp-helm-nightly 403fcef link false /test e2e-osd-gcp-helm-nightly

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@albarbaro
Copy link
Copy Markdown
Member

/lgtm

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Dec 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: albarbaro

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 1b62228 into redhat-developer:main Dec 10, 2025
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants