Add KB: Cluster Recovery After Full Power Outage (closes #441) by CarlRodabaugh · Pull Request #442 · verge-io/docs

CarlRodabaugh · 2026-04-28T01:33:50Z

Summary

Adds a new KB article addressing #441: a consolidated how-to / troubleshooting guide for recovering a VergeOS cluster after an unplanned full power loss.

Changes

New file: docs/knowledge-base/posts/cluster-recovery-after-power-outage.md
Patterned after kb-template.md (frontmatter, Key Points, Prerequisites, Steps, Troubleshooting, Prevention, Additional Resources, Feedback, Document Information)
Cross-links to existing docs: Proper Power Sequence, Proper Shutdown Procedure, Journal Walks, Generating System Diagnostics, Repair Server (ioGuardian), vSAN Diagnostics Guide, System Diagnostics, Sizing & Hardware Requirements

Issue coverage (per #441 "Suggested Content")

✅ Expected cluster behavior after simultaneous full power loss (What to Expect subsection)
✅ Recommended host power-on sequence — including the verbatim Waiting for the vSAN to mount console prompt as the operator's go-signal between Node1 and Node2
✅ Rejoin order and how nodes resync vSAN tiers (auto-reconciliation via Journal Walks once quorum is reached)
✅ Step-by-step recovery procedure (pre-checks → power-on → verification)
✅ How to verify vSAN health and tier sync status post-recovery (Status tile fields, Repairs/Bad Drives interpretation, vSAN Diagnostics CLI equivalents)
✅ Recommendations to prevent data inconsistency or corruption — UPS sizing, graceful shutdown automation (UI / API / VRG CLI), On Power Loss VM settings, ioGuardian repair server, off-site snapshots, fencing-handled-internally explainer
✅ Troubleshooting: node fails to rejoin, vSAN won't mount, stuck/growing repairs, split-brain
✅ When to engage VergeIO support, with explicit sysdiag generation steps and support@verge.io for the air-gapped/manual path

Verification

All UI nav paths verified against existing operational KBs (System → vSAN → Tiers, System → vSAN → Drives, System → Nodes, System → vSAN Diagnostics, System → System Diagnostics)
API payload POST /v4/cluster_actions { action: shutdown } matches the existing Proper Shutdown Procedure doc and API Tables reference
Cluster reconciliation, Journal Walk, and quorum behavior cross-checked against the Journal Walks KB and stuck-repairs internal doc
Sysdiag UI label and parent/root requirement cross-checked against both Generating System Diagnostics and System Diagnostics docs
Version footer aligned with currently supported releases (26.0+)

Test plan

Render preview in mkdocs and verify all admonitions, code blocks, and internal links resolve
Confirm the Waiting for the vSAN to mount console string matches what customers see on 26.x (verified against an in-the-wild console screenshot)
Tech review by support / engineering for any internal-only details that shouldn't be public

Closes #441 Adds a consolidated how-to / troubleshooting guide for recovering a VergeOS cluster after an unplanned full power loss. Covers expected behavior, pre-power-on checks, the Node1 → Node2 → remaining-nodes sequence (including the "Waiting for the vSAN to mount" prompt as the operator's go-signal), post-recovery verification, troubleshooting (stuck repairs, split-brain, failed rejoin), prevention (UPS sizing, graceful shutdown automation via API/VRG, ioGuardian repair server, fencing handled internally), and when to engage support with sysdiag generation steps. Patterned after kb-template.md; cross-links to existing power-sequence, shutdown-procedure, journal-walks, repair-server, vSAN-diagnostics, sizing, and system-diagnostics docs.

- draft: false (frontmatter blocker) - Promote Prevention, When to Engage Support, Generating a System Diagnostic for Support from h3 to h2 - Drop unverified verbatim "Waiting for the vSAN to mount" console string; describe the wait-state behavior instead - Drop undefined "Kill Mode" term; describe IPMI hard power-off inline - Link Maintenance Mode to /product-guide/operations/maintenance-mode/ - Align slug to filename (cluster-recovery-after-power-outage) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…w fixes - Replace 'at least 2 nodes' quorum framing with N-1 vSAN nodes (per Jason Yaeger). - Simplify power-on sequence to 'Node1 first, then power on the rest paced ~1 min apart'; vSAN mounts on its own when N-1 is reached. - Reframe Bad Drives: count of drives the cluster currently can't see; persistent non-zero is a real fault, not transient walk noise. - Fix cluster shutdown API payload to include cluster id and params. - Soften 'no built-in NUT/UPS' claim ('does not currently document'). - Drop unsupported 'often auto-created' claim about ioGuardian. - Reframe split-brain as a recovery-time network-partition risk, not 'during the outage'. - Trim Pro Tip; 'stoplights' -> 'status lights'; bump last-updated.

CarlRodabaugh and others added 3 commits April 27, 2026 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KB: Cluster Recovery After Full Power Outage (closes #441)#442

Add KB: Cluster Recovery After Full Power Outage (closes #441)#442
CarlRodabaugh wants to merge 3 commits into
mainfrom
carl/cluster-recovery-after-power-outage

CarlRodabaugh commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CarlRodabaugh commented Apr 28, 2026

Summary

Changes

Issue coverage (per #441 "Suggested Content")

Verification

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant