Skip to content

Installer: verify readiness & credentials before reporting success; RHEL-family support#171

Open
LukasWodka wants to merge 4 commits into
developfrom
fix/installer-hardening
Open

Installer: verify readiness & credentials before reporting success; RHEL-family support#171
LukasWodka wants to merge 4 commits into
developfrom
fix/installer-hardening

Conversation

@LukasWodka
Copy link
Copy Markdown
Contributor

Summary

Running the curl/irm installer on real Ubuntu 24.04 / AlmaLinux 9 / CentOS Stream 9 VMs + macOS (and tracing the Windows path) surfaced two classes of problems for the people who actually run it — clinical researchers and hospital IT:

  1. It reported success before the client was actually up. helm upgrade --install runs with no --wait, verify_cluster only logged, and print_summary printed "installed successfully — your data never leaves" unconditionally — while pods were still ContainerCreating/CrashLoopBackOff. Wrong credentials produced the same green screen, because credentials were never validated.
  2. It failed outright on the entire RHEL family (the enterprise-Linux standard in hospitals): Docker install died on Alma/Rocky/Oracle (get.docker.com rejects them), and k3d install died on all RHEL-family incl. CentOS/RHEL (sudo secure_path omits /usr/local/bin). Plus a wrong conntrack package name on Debian/Ubuntu.

This makes the installer only claim success when the client is verifiably connected, verify credentials at entry, and actually complete on Ubuntu, RHEL-family, macOS, and Windows.

Closes #716, #717, #718, #719, #720.

Changes

Verify before reporting — bash + PowerShell (#716, #717)

  • Credentials entered at the prompt are validated against the backend api-token-auth/ endpoint — the same call jobs-manager makes (client-runtime/controller.py) — via the already-present curl (Invoke-WebRequest on Windows), with a re-prompt loop. A wrong Client ID/password is caught immediately instead of after a full deploy. Resolves the backend per CLIENT_ENV (dev/stg/prod).
  • After helm applies, a readiness gate (wait_for_client_ready / Wait-ForClientReady) waits on kubectl rollout status and classifies the outcome; the summary reports connected / starting / bad_creds / image_pull / crash. The "secure compute environment / your data never leaves" line prints only when verifiably connected. The exit code now reflects reality.

RHEL-family support — bash (#718, #719, #720)

  • install_docker_engine: install docker-ce from the official Docker CentOS dnf repo on AlmaLinux/Rocky/Oracle (get.docker.com rejects these as "Unsupported distribution").
  • install_k3d: run the k3d script via sudo env "PATH=$PATH" so its post-install command -v k3d survives RHEL secure_path (which omits /usr/local/bin).
  • install_system_deps: use the conntrack package on Debian/Ubuntu, conntrack-tools elsewhere.

Type

Bug fix + reliability. No chart changes.

Test plan — verified this session

# Verified
#717 _backend_url maps prod/dev/stg; live backend rejects bad creds → invalid
#716 all 5 summary states render (trust claim only on connected); _diagnose_not_ready returns bad_creds against a real crash-looping release (jobs-manager, 45 restarts) ✅
#718 on CentOS Stream 9 (secure_path without /usr/local/bin), the new method installs k3d v5.8.3 where stock sudo bash failed ✅
#719 AlmaLinux 9 ID="almalinux" matches the clone branch; the dnf-repo method installs Docker ✅
#720 conntrack is a valid apt package; conntrack-tools is not ✅

bash -n clean on all scripts; install-k8s.ps1 parses clean under PowerShell 7.4.

⚠️ Needs a Windows reviewer

The PowerShell changes (install-k8s.ps1) mirror the bash logic and pass the PS parser, but I could not execute them (no Windows host here). Please validate on Windows + WSL2 (admin PowerShell): the credential re-prompt loop, the readiness gate, and the state-branched summary.

Follow-ups (out of scope)

Moving credential collection ahead of cluster build (fail in ~2s); the broader items from the review (cluster-admin auto-upgrade RBAC, egress allowlist docs, OVA/air-gapped appliance).

🤖 Generated with Claude Code

LukasWodka and others added 3 commits June 1, 2026 15:19
Docker: install docker-ce from the official Docker CentOS dnf repo on AlmaLinux/Rocky/Oracle, which get.docker.com rejects as unsupported. k3d: preserve PATH through sudo so the post-install lookup survives RHEL secure_path (which omits /usr/local/bin). conntrack: use the conntrack apt package on Debian/Ubuntu and conntrack-tools elsewhere.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ccess (#716, #717)

Credentials entered at the prompt are validated against the backend api-token-auth endpoint (the same call jobs-manager makes) with a re-prompt loop, so a wrong Client ID or password is caught immediately instead of after a full deploy. After helm apply, wait_for_client_ready polls rollout status and classifies the outcome; print_summary reports connected, starting, bad_creds, image_pull or crash, and prints the data-never-leaves message only when the client is verifiably connected. Exit code now reflects the real outcome.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…716, #717)

Test-Credentials, Wait-ForClientReady and Get-NotReadyState mirror the bash logic in install-k8s.ps1, and Print-Summary is now state-branched. Validated with the PowerShell 7.4 parser; runtime behavior still needs a check on a Windows host.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 14 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

scripts/tests/: 64 bats tests (summary, install-client-helm, setup-linux, common) + 32 Pester tests for install-k8s.ps1 — all mocked, no Docker/k3d/network needed. Changed-line coverage measured with kcov (bash 96.2%) and Pester (PowerShell 97.4%); residual lines are the real RHEL Docker-install commands + the guarded main() orchestration, exercised by the integration E2E. A TB_PESTER guard lets the suite dot-source install-k8s.ps1 without running the installer. New installer-tests job in helm-ci.yaml runs both suites on PRs (scripts/ added to path filters).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown
Contributor Author

Tests added (follow-up — commit 5e41bd3)

Added an automated test suite + a CI gate for the installer (the PR previously had only manual verification).

scripts/tests/ — 96 tests, all green:

  • 64 bats unit tests, fully mocked (no Docker/k3d/network): summary.bats, install-client-helm.bats, setup-linux.bats, common.bats.
  • 32 Pester tests for install-k8s.ps1 (mock Invoke-WebRequest / kubectl). A TB_PESTER env guard lets the suite dot-source the script to load functions without running the installer.

Changed-line coverage (measured, not estimated):

  • bash 96.2% (kcov over the bats run). The residual lines are the RHEL docker-ce dnf commands inside the bash -c string (only execute during a real RHEL Docker install — validated live on AlmaLinux) and one curl multi-line accounting artifact.
  • PowerShell 97.4% (Pester JaCoCo). The residual 3 lines are the guarded main() orchestration (integration-only).
  • Whole-file coverage is lower on purpose: these files contain large untouched, system-mutating sections (Docker Desktop/WSL installers, the spinner, the macOS-only download path) that can't run in a Linux unit test — so the meaningful metric is changed-line coverage.

CI: new installer-tests job in helm-ci.yaml runs both suites on every PR (scripts/** added to the path filters). The full-installer integration E2E stays manual; the PowerShell still needs a Windows host to validate runtime behaviour.

LukasWodka added a commit that referenced this pull request Jun 1, 2026
… 0.0.0.0 detect, Windows parity)

A customer running behind an authenticated corporate HTTP proxy hit install
failures. The 0.0.0.0 kubeconfig headline was fixed for the bash path in
#166/#167, but adverse testing (a forward proxy + real k3d on Linux VMs)
surfaced three remaining gaps plus a Windows parity hole:

- Gap A: authenticated proxies (http://user:pass@host) were silently SKIPPED —
  k3d's --env KEY=VALUE@FILTER can't carry an '@' in the value. Now propagated
  via a k3d --config file (structured YAML env) so credentials survive intact.
  Verified on k3d v5.8.3 (it merges the --config env with the existing CLI flags).
- Gap B: NO_PROXY was propagated verbatim. Now auto-augmented with the
  cluster-internal ranges (loopback + RFC1918 + .svc/.cluster.local +
  host.k3d.internal), both into the cluster and host-side, so in-cluster traffic
  never routes through the proxy — fixes the misroute AND the observed
  `k3d cluster create --wait` hang.
- Gap C: a cluster created outside the installer and bound to 0.0.0.0 is now
  detected (serverlb HostIp) and flagged with a non-destructive recreate remedy.
- Windows parity: install-k8s.ps1::New-K3dCluster had NONE of the bash fixes —
  it still bound --api-port 0.0.0.0:6550 (the original headline bug, still live
  on Windows), normalized only host.docker.internal in the kubeconfig, and
  propagated zero proxy env. Now mirrors bash: 127.0.0.1:6550, a 0.0.0.0->127.0.0.1
  kubeconfig rewrite, and Get-EffectiveNoProxy + Write-K3dProxyConfig (auth +
  augmented NO_PROXY, written UTF-8 without a BOM).

Tests: new scripts/tests/cluster.bats (15) + Pester for the two ps1 helpers (6),
both green. Verified end-to-end on Linux VMs: auth creds propagated into the node,
no startup hang behind an unreachable proxy, and 0.0.0.0 detection firing.

Stacked on #171 (the installer test scaffolding + final install-k8s.ps1 live there).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
LukasWodka added a commit that referenced this pull request Jun 1, 2026
…eal preflight probes)

Measured changed-line coverage of the stack had dropped (bash ~84% vs #171's
~96%) because the new code added integration-only branches the mocked unit
suites skipped. Recover the unit-testable portion:
- diagnose.bats: exercise the kubectl/docker/helm collection path (has()=true +
  mocked tools) -> diagnose.sh 64% -> 90%.
- preflight.bats: test the REAL _pf_probe_url curl-exit-code -> token mapping,
  the missing-curl path, and the _pf_ncpu/_pf_total_mem_kb/_pf_free_kb readers
  (re-sourced past the setup stubs) -> preflight.sh 79% -> 90%.

Bash changed-line coverage: 83.6% -> 92.3% (kcov, 383/415). The residual ~8% is
integration-only (real k3d/docker create + macOS/Windows-specific branches +
MAIN orchestration), validated by the live VM E2Es (reboot recovery, auth-proxy,
preflight blocked-egress/arm64, diagnose redaction grep).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants