Installer: verify readiness & credentials before reporting success; RHEL-family support by LukasWodka · Pull Request #171 · tracebloc/client

LukasWodka · 2026-06-01T13:20:31Z

Summary

Running the curl/irm installer on real Ubuntu 24.04 / AlmaLinux 9 / CentOS Stream 9 VMs + macOS (and tracing the Windows path) surfaced two classes of problems for the people who actually run it — clinical researchers and hospital IT:

It reported success before the client was actually up. helm upgrade --install runs with no --wait, verify_cluster only logged, and print_summary printed "installed successfully — your data never leaves" unconditionally — while pods were still ContainerCreating/CrashLoopBackOff. Wrong credentials produced the same green screen, because credentials were never validated.
It failed outright on the entire RHEL family (the enterprise-Linux standard in hospitals): Docker install died on Alma/Rocky/Oracle (get.docker.com rejects them), and k3d install died on all RHEL-family incl. CentOS/RHEL (sudo secure_path omits /usr/local/bin). Plus a wrong conntrack package name on Debian/Ubuntu.

This makes the installer only claim success when the client is verifiably connected, verify credentials at entry, and actually complete on Ubuntu, RHEL-family, macOS, and Windows.

Closes #716, #717, #718, #719, #720.

Changes

Verify before reporting — bash + PowerShell (#716, #717)

Credentials entered at the prompt are validated against the backend api-token-auth/ endpoint — the same call jobs-manager makes (client-runtime/controller.py) — via the already-present curl (Invoke-WebRequest on Windows), with a re-prompt loop. A wrong Client ID/password is caught immediately instead of after a full deploy. Resolves the backend per CLIENT_ENV (dev/stg/prod).
After helm applies, a readiness gate (wait_for_client_ready / Wait-ForClientReady) waits on kubectl rollout status and classifies the outcome; the summary reports connected / starting / bad_creds / image_pull / crash. The "secure compute environment / your data never leaves" line prints only when verifiably connected. The exit code now reflects reality.

RHEL-family support — bash (#718, #719, #720)

install_docker_engine: install docker-ce from the official Docker CentOS dnf repo on AlmaLinux/Rocky/Oracle (get.docker.com rejects these as "Unsupported distribution").
install_k3d: run the k3d script via sudo env "PATH=$PATH" so its post-install command -v k3d survives RHEL secure_path (which omits /usr/local/bin).
install_system_deps: use the conntrack package on Debian/Ubuntu, conntrack-tools elsewhere.

Type

Bug fix + reliability. No chart changes.

Test plan — verified this session

#	Verified
#717	`_backend_url` maps prod/dev/stg; live backend rejects bad creds → `invalid` ✅
#716	all 5 summary states render (trust claim only on `connected`); `_diagnose_not_ready` returns `bad_creds` against a real crash-looping release (jobs-manager, 45 restarts) ✅
#718	on CentOS Stream 9 (`secure_path` without `/usr/local/bin`), the new method installs k3d v5.8.3 where stock `sudo bash` failed ✅
#719	AlmaLinux 9 `ID="almalinux"` matches the clone branch; the dnf-repo method installs Docker ✅
#720	`conntrack` is a valid apt package; `conntrack-tools` is not ✅

bash -n clean on all scripts; install-k8s.ps1 parses clean under PowerShell 7.4.

⚠️ Needs a Windows reviewer

The PowerShell changes (install-k8s.ps1) mirror the bash logic and pass the PS parser, but I could not execute them (no Windows host here). Please validate on Windows + WSL2 (admin PowerShell): the credential re-prompt loop, the readiness gate, and the state-branched summary.

Follow-ups (out of scope)

Moving credential collection ahead of cluster build (fail in ~2s); the broader items from the review (cluster-admin auto-upgrade RBAC, egress allowlist docs, OVA/air-gapped appliance).

🤖 Generated with Claude Code

Docker: install docker-ce from the official Docker CentOS dnf repo on AlmaLinux/Rocky/Oracle, which get.docker.com rejects as unsupported. k3d: preserve PATH through sudo so the post-install lookup survives RHEL secure_path (which omits /usr/local/bin). conntrack: use the conntrack apt package on Debian/Ubuntu and conntrack-tools elsewhere. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ccess (#716, #717) Credentials entered at the prompt are validated against the backend api-token-auth endpoint (the same call jobs-manager makes) with a re-prompt loop, so a wrong Client ID or password is caught immediately instead of after a full deploy. After helm apply, wait_for_client_ready polls rollout status and classifies the outcome; print_summary reports connected, starting, bad_creds, image_pull or crash, and prints the data-never-leaves message only when the client is verifiably connected. Exit code now reflects the real outcome. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…716, #717) Test-Credentials, Wait-ForClientReady and Get-NotReadyState mirror the bash logic in install-k8s.ps1, and Print-Summary is now state-branched. Validated with the PowerShell 7.4 parser; runtime behavior still needs a check on a Windows host. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-01T13:21:17Z

👋 Heads-up — Code review queue is at 14 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
data-ingestors#132 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
data-ingestors#133 — docs: fix declarative-ingest path/column drift (issue feat(#129): parent client chart owns the shared ingestor ServiceAccount (1.3.4) #131 A-series) · author: @divyasinghds · no reviewer assigned
design-system#19 — fix: un-track coverage/ and node_modules/ from git · author: @LukasWodka · no reviewer assigned
design-system#22 — ci: add Vitest test workflow · author: @LukasWodka · reviewer: @aptracebloc
design-system#23 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
docs#46 — docs: make declarative-ingest staging self-contained (data-ingestors#131 B/C) · author: @divyasinghds · no reviewer assigned
frontend-app#499 — ci: add Vitest + Cypress test workflow · author: @LukasWodka · reviewer: @aptracebloc

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

scripts/tests/: 64 bats tests (summary, install-client-helm, setup-linux, common) + 32 Pester tests for install-k8s.ps1 — all mocked, no Docker/k3d/network needed. Changed-line coverage measured with kcov (bash 96.2%) and Pester (PowerShell 97.4%); residual lines are the real RHEL Docker-install commands + the guarded main() orchestration, exercised by the integration E2E. A TB_PESTER guard lets the suite dot-source install-k8s.ps1 without running the installer. New installer-tests job in helm-ci.yaml runs both suites on PRs (scripts/ added to path filters). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-01T14:26:33Z

Tests added (follow-up — commit `5e41bd3`)

Added an automated test suite + a CI gate for the installer (the PR previously had only manual verification).

scripts/tests/ — 96 tests, all green:

64 bats unit tests, fully mocked (no Docker/k3d/network): summary.bats, install-client-helm.bats, setup-linux.bats, common.bats.
32 Pester tests for install-k8s.ps1 (mock Invoke-WebRequest / kubectl). A TB_PESTER env guard lets the suite dot-source the script to load functions without running the installer.

Changed-line coverage (measured, not estimated):

bash 96.2% (kcov over the bats run). The residual lines are the RHEL docker-ce dnf commands inside the bash -c string (only execute during a real RHEL Docker install — validated live on AlmaLinux) and one curl multi-line accounting artifact.
PowerShell 97.4% (Pester JaCoCo). The residual 3 lines are the guarded main() orchestration (integration-only).
Whole-file coverage is lower on purpose: these files contain large untouched, system-mutating sections (Docker Desktop/WSL installers, the spinner, the macOS-only download path) that can't run in a Linux unit test — so the meaningful metric is changed-line coverage.

CI: new installer-tests job in helm-ci.yaml runs both suites on every PR (scripts/** added to the path filters). The full-installer integration E2E stays manual; the PowerShell still needs a Windows host to validate runtime behaviour.

… 0.0.0.0 detect, Windows parity) A customer running behind an authenticated corporate HTTP proxy hit install failures. The 0.0.0.0 kubeconfig headline was fixed for the bash path in #166/#167, but adverse testing (a forward proxy + real k3d on Linux VMs) surfaced three remaining gaps plus a Windows parity hole: - Gap A: authenticated proxies (http://user:pass@host) were silently SKIPPED — k3d's --env KEY=VALUE@FILTER can't carry an '@' in the value. Now propagated via a k3d --config file (structured YAML env) so credentials survive intact. Verified on k3d v5.8.3 (it merges the --config env with the existing CLI flags). - Gap B: NO_PROXY was propagated verbatim. Now auto-augmented with the cluster-internal ranges (loopback + RFC1918 + .svc/.cluster.local + host.k3d.internal), both into the cluster and host-side, so in-cluster traffic never routes through the proxy — fixes the misroute AND the observed `k3d cluster create --wait` hang. - Gap C: a cluster created outside the installer and bound to 0.0.0.0 is now detected (serverlb HostIp) and flagged with a non-destructive recreate remedy. - Windows parity: install-k8s.ps1::New-K3dCluster had NONE of the bash fixes — it still bound --api-port 0.0.0.0:6550 (the original headline bug, still live on Windows), normalized only host.docker.internal in the kubeconfig, and propagated zero proxy env. Now mirrors bash: 127.0.0.1:6550, a 0.0.0.0->127.0.0.1 kubeconfig rewrite, and Get-EffectiveNoProxy + Write-K3dProxyConfig (auth + augmented NO_PROXY, written UTF-8 without a BOM). Tests: new scripts/tests/cluster.bats (15) + Pester for the two ps1 helpers (6), both green. Verified end-to-end on Linux VMs: auth creds propagated into the node, no startup hang behind an unreachable proxy, and 0.0.0.0 detection firing. Stacked on #171 (the installer test scaffolding + final install-k8s.ps1 live there). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…eal preflight probes) Measured changed-line coverage of the stack had dropped (bash ~84% vs #171's ~96%) because the new code added integration-only branches the mocked unit suites skipped. Recover the unit-testable portion: - diagnose.bats: exercise the kubectl/docker/helm collection path (has()=true + mocked tools) -> diagnose.sh 64% -> 90%. - preflight.bats: test the REAL _pf_probe_url curl-exit-code -> token mapping, the missing-curl path, and the _pf_ncpu/_pf_total_mem_kb/_pf_free_kb readers (re-sourced past the setup stubs) -> preflight.sh 79% -> 90%. Bash changed-line coverage: 83.6% -> 92.3% (kcov, 383/415). The residual ~8% is integration-only (real k3d/docker create + macOS/Windows-specific branches + MAIN orchestration), validated by the live VM E2Es (reboot recovery, auth-proxy, preflight blocked-egress/arm64, diagnose redaction grep). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka and others added 3 commits June 1, 2026 15:19

LukasWodka assigned saadqbal Jun 1, 2026

LukasWodka mentioned this pull request Jun 1, 2026

Merge master into develop + bugbot fixes (unblocks #79) tracebloc/model-zoo#80

Merged

2 tasks

LukasWodka mentioned this pull request Jun 1, 2026

fix(ingest): MLM tokenizer + relax PascalVOC difficult + reserved-id guard (#137, #135) tracebloc/data-ingestors#139

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installer: verify readiness & credentials before reporting success; RHEL-family support#171

Installer: verify readiness & credentials before reporting success; RHEL-family support#171
LukasWodka wants to merge 4 commits into
developfrom
fix/installer-hardening

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LukasWodka commented Jun 1, 2026

Summary

Changes

Type

Test plan — verified this session

⚠️ Needs a Windows reviewer

Follow-ups (out of scope)

Uh oh!

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 1, 2026

Tests added (follow-up — commit 5e41bd3)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tests added (follow-up — commit `5e41bd3`)