Skip to content
Jakob Munch Overgaard edited this page Jun 21, 2026 · 1 revision

RemotePower v5.0.0 — "CTRLMatters"

Control the control plane. v5.0.0 hardens the path between operators, the server, and agents: stronger agent authentication, dual-control and at-rest crypto over the secret/backup paths, operator bulk + command control, reliability for production upgrades, and the first fleet-scale CVE optimisation — plus a batch of operator-efficiency polish.

Major bump: 4.10.0 → 5.0.0. No breaking config changes; every new control is opt-in and off by default.

Security spine

  • Mutual TLS for agents (opt-in). The server can now require every agent to present a CA-verified client certificate (from the v4.5.0 self-signed CA), optionally pinned per device — defence-in-depth on top of the per-device token, so a stolen token alone can't impersonate a host. Enable the nginx ssl_verify_client optional block, flip Settings → Security → Require agent mTLS, and run agents with RP_CLIENT_CERT/RP_CLIENT_KEY.
  • Encrypted disaster-recovery backups. Set RP_BACKUP_PASSPHRASE and the nightly/manual backup is written as *.tar.gz.enc (streaming AES-256-GCM, PBKDF2-SHA256) with the plaintext removed. Restore decrypts transparently (passphrase via the X-RP-Backup-Passphrase header or the env). The passphrase lives only in the environment — never in the data dir the backup protects.
  • Break-glass vault reveals. Flag a sensitive credential (root / IPMI / DR) as break-glass: revealing it then requires a second admin's approval. The request, the approval, and the reveal are each immutably audit-logged, and a high-severity vault_break_glass alert fires so an approver notices. Approvals expire after 15 minutes; self-approval is rejected.
  • Per-API-key rate limiting. Each API key can carry a requests/minute cap, so a leaked key can't saturate the API (429 once exhausted).

Reliability & resilience

  • Server disk-space watchdog — alerts (server_disk_low, with a recover) when the controller's own data-dir crosses a configurable threshold, before flock writes start failing.
  • Webhook dead-letter queue — permanently-failed deliveries are kept and shown on the Webhook log with Retry / Retry all / Clear, plus a replay of any past fleet event.
  • Runtime maintenance mode — a one-click toggle that pauses new agent command dispatch during a controller upgrade (heartbeats and browsing keep working, so devices don't flip offline), with a banner shown to everyone.
  • Graceful shutdown for long-poll commands + an OSV circuit breaker that stops a fleet CVE scan from hammering OSV.dev while it's down.

Fleet power-ops

  • Bulk device delete and bulk tag add/remove from the existing multi-select.
  • Per-command timeout override on the Run-command modal.
  • Agent/server version-compatibility check before an update (blocks a cross-major downgrade unless forced).
  • One-click rollout rollback — a script rollout can carry a rollback script that re-runs on exactly the devices the rollout reached.

Scale

  • Cross-device OSV batching — a fleet CVE scan now deduplicates packages across every device and queries OSV once per ecosystem instead of once per device.

Operator polish

Copy-to-clipboard everywhere · webhook delivery green/red dots · Snooze alerts 1h per device · a live pending-commands nav badge · rename/duplicate saved fleet queries · field tooltips · the command palette now searches command history · one-click Run diagnostics (storage / disk / audit chain / agent reachability).

Follow-up sweep

A whole-project hardening + polish pass folded into v5.0.0 (no version bump):

  • Thermal page — expand any host to see every sensor (temperature and its critical limit), a ~24h temperature trend sparkline (the server now keeps a rolling per-host hottest-reading series), and a per-host Thresholds button that opens the warning / critical temperature editor.
  • CMDB — a new Business function field (a fixed list: Application Operation / OS Operation / Server Camp) and a wider asset editor whose properties lay out in two columns so the inputs aren't full-page-width.
  • AI knowledge index — the live-state corpus now also covers mount problems, failing custom checks, running process names and file-descriptor / conntrack saturation, so the assistant can answer those questions from real data.
  • Hardening — the legacy webhook_url is no longer returned by the config API (it embeds a secret in its path; the response carries only a webhook_configured flag, and an admin re-enters the URL to change it). Per-disk SMART, per-GPU and temperature trend samples are written outside the hardware lock so they're durable on the SQLite backend. A six-stream audit plus a live, authorized probe and the usual SAST tooling found no Critical/High/Medium issues — see security-review-5.0.0.

Deferred to a focused follow-up: user-configurable timezone (needs a central time-formatting helper threaded through every call site to stay consistent).

Clone this wiki locally