feat(scenarios): major-upgrade Chaos Workflow + runbook (PR 5/5) by bdchatham · Pull Request #288 · sei-protocol/sei-k8s-controller

bdchatham · 2026-05-19T18:49:25Z

Final slice of the SeiNodeTask MVP implementation. Adds the major-upgrade Chaos Mesh Workflow that composes the 5 MVP SeiNodeTask kinds via the seitask-runner image (PR 4) to express sei-chain/integration_test/upgrade_module/major_upgrade_test.yaml end-to-end. This is the MVP acceptance scenario.

Status: schema-validated, not yet runnable end-to-end. See "Status" section of scenarios/README.md and the bridge-gap callout below.

Workstream

Council workstream — System tier — for the SeiNodeTask MVP (design merged at #277):

PR 1: v1alpha1 CRD types + CEL + manifests (feat(seinodetask): v1alpha1 CRD types (PR 1/5) #281, merged)
PR 2: reconciler + UpdateNodeImage (feat(seinodetask): reconciler + UpdateNodeImage task (PR 2/5) #283, merged)
PR 3: sidecar-backed kinds wiring (feat(seinodetask): wire sidecar-backed kinds (PR 3/5) #285, merged)
PR 4: seitask-runner image + per-kind templates (feat(seitask-runner): orchestration binary + image + templates (PR 4/5) #286, merged)
PR 5 (this PR): major-upgrade Chaos Workflow + runbook
Hotfix fix(crd): add seinodetasks to config/crd/kustomization.yaml #287: config/crd/kustomization.yaml (merged, prod recovered)

What ships

scenarios/major-upgrade.yaml — 599-line, 26-template Chaos Mesh Workflow
scenarios/testnet-deployment.yaml — reference 4-validator SeiNodeDeployment the Workflow can target
scenarios/README.md — runbook + known-limitations

The 12-step flow

#	Step	Primitive
1	compute-target-height	bash → emit `TARGET_HEIGHT`, `POST_UPGRADE_HEIGHT`, `PANIC_BOUNDARY`
2	submit-upgrade-proposal	`GovSoftwareUpgrade` SeiNodeTask
3	vote-yes-all-validators	Parallel of 4× `GovVote` SeiNodeTask
4	wait-for-proposal-to-pass	bash polls gov queries (self-resolves proposal by content)
5	early-upgrade-node-0	`UpdateNodeImage` (expects CrashLoop)
6	wait-for-target-height-nodes-1-2-3	Parallel of 3× `AwaitCondition` height
7	upgrade-nodes-1-2-3	Parallel of 3× `UpdateNodeImage`
8	downgrade-node-0	`UpdateNodeImage` to v1
9	await-post-upgrade-progress-nodes-1-2-3	Parallel `AwaitCondition` height-advance past `POST_UPGRADE_HEIGHT`
10	wait-for-target-height-node-0	`AwaitCondition` for `PANIC_BOUNDARY` (= TARGET_HEIGHT-1; node-0 panics AT TARGET_HEIGHT)
11	upgrade-node-0	`UpdateNodeImage` to v2
12	await-post-upgrade-progress-node-0	`AwaitCondition` height-advance

Scope cuts (cross-review with product-engineer)

Cut from the original 36-step draft, per the product-engineer's north-star walkthrough:

verify_running × 4 (pre-upgrade) — vestigial. The bash framework needed these because it had no preceding signal; we have wait-for-proposal-to-pass which is a strictly stronger proof of chain liveness.
verify_panic × 4 — semantics ("chain stopped at target-1 OR process exited") cannot be cleanly expressed without sidecar/runner extensions we deliberately don't ship. The chain reaching PANIC_BOUNDARY and stopping is sufficient regression-detection signal.
verify_upgrade_needed_log — would require pod log access (pods/log verb). Liveness via height-advance covers the assertion intent.

Net: 36-step bash translation → 12-step Workflow that tests the same upgrade flow.

Known gap: cross-step variable bridge

Bug in the LLD I authored: Chaos Mesh's Task template launches each step in its own Pod. The task.volumes field is per-Pod-scoped, so an emptyDir declared in two steps does NOT share storage. The /workflow/vars/env.sh bridge the LLD specified doesn't work as designed.

The Workflow YAML retains /workflow/vars/env.sh references throughout for forward-compatibility with a real bridge implementation. The README's Status block is unambiguous: schema-validate only, not runnable end-to-end yet. Follow-up issue (filed separately) tracks a ConfigMap-based bridge as a focused PR 6.

Cross-review (local, pre-push)

Item	Provider (platform-engineer)	Audit (product-engineer)	Status
12 surviving steps map to north-star	All steps wired to runner templates or bash	Confirmed via walkthrough	COMPATIBLE
Verifier cuts applied	10 templates removed, 4 height-advance steps added	Recommended exact cuts	COMPATIBLE
`wait-for-target-height-node-0` waits for `PANIC_BOUNDARY` not `TARGET_HEIGHT`	Fixed in this PR (1-line var addition + 1 --var override)	Real bug flagged in audit	RESOLVED
Cross-step bridge gap documented	Status block at top of README	Required honesty	COMPATIBLE
YAML schema valid	`yaml.safe_load_all` parses	Required	COMPATIBLE

Zero MISMATCH after the bridge-gap call-out + the PANIC_BOUNDARY fix.

Test plan

yaml.safe_load_all parses cleanly
Manual walkthrough vs major_upgrade_test.yaml confirms each step maps
kubectl apply --dry-run=client against a cluster with chaos-mesh.org/v1alpha1 installed (reviewer verification, requires Chaos Mesh CRDs)
End-to-end run blocked on: cross-step variable bridge (follow-up PR), seitask-runner image published to a registry the cluster can pull from (follow-up — currently requires manual make runner-image && make runner-push)
Harbor cluster e2e dry-run once the above two land

Follow-up issues (filing separately)

PR 6: ConfigMap-based cross-step variable bridge — replaces the broken emptyDir mechanism; runner gains configmaps RBAC.
Runner image release pipeline — .github/workflows/ecr.yml builds only the controller today; needs a parallel runner-image build/publish step.
CI: assert config/crd/kustomization.yaml lists every config/crd/*.yaml — the gap that caused the prod outage (fix(crd): add seinodetasks to config/crd/kustomization.yaml #287). Run a quick check in lint stage.

🤖 Generated with Claude Code

Final slice of the SeiNodeTask MVP implementation. Adds the major-upgrade Chaos Mesh Workflow that composes the 5 MVP SeiNodeTask kinds via the seitask-runner image (PR 4) to express sei-chain/integration_test/upgrade_module/major_upgrade_test.yaml end-to-end as the MVP acceptance scenario. Status: schema-validated, not yet runnable end-to-end. See "Status" section of scenarios/README.md. Files: - scenarios/major-upgrade.yaml (599 LOC, 26 Chaos Mesh templates) - scenarios/testnet-deployment.yaml (reference 4-validator SeiNodeDeployment the Workflow targets) - scenarios/README.md (runbook + known-limitations) The Workflow expresses: 1. compute-target-height (bash; queries node-0 RPC, computes TARGET_HEIGHT = current + 100, UPGRADE_HEIGHT, POST_UPGRADE_HEIGHT, PANIC_BOUNDARY = TARGET_HEIGHT - 1) 2. submit-upgrade-proposal (GovSoftwareUpgrade) 3. vote-yes-all-validators (Parallel of 4 GovVote) 4. wait-for-proposal-to-pass (bash polls gov queries) 5. early-upgrade-node-0 (UpdateNodeImage) 6. wait-for-target-height-nodes-1-2-3 (Parallel of 3 AwaitCondition height) 7. upgrade-nodes-1-2-3 (Parallel of 3 UpdateNodeImage) 8. downgrade-node-0 (UpdateNodeImage) 9. await-post-upgrade-progress-nodes-1-2-3 (Parallel, liveness via height-advance past POST_UPGRADE_HEIGHT) 10. wait-for-target-height-node-0 (AwaitCondition; waits for PANIC_BOUNDARY = TARGET_HEIGHT-1 since node-0 panics AT TARGET_HEIGHT on the pre-upgrade binary) 11. upgrade-node-0 (UpdateNodeImage) 12. await-post-upgrade-progress-node-0 (AwaitCondition height-advance) Scope discipline applied during cross-review: - CUT: verify_running pre-upgrade × 4 (vestigial from bash framework; passed-proposal already proves chain liveness) - CUT: verify_panic × 4 (semantics need AwaitCondition extensions we defer; the chain reaching TARGET_HEIGHT-1 and stopping is sufficient evidence) - CUT: verify_upgrade_needed_log (would require pods/log access; the liveness check covers regression detection without it) - KEPT: post-upgrade height-advance checks (real liveness signal) Known gap (documented in README's Status block): Chaos Mesh's Task template launches each step in its own Pod, so the emptyDir-based /workflow/vars/env.sh bridge described in the LLD does not span steps. The cross-step variables (TARGET_HEIGHT, UPGRADE_HEIGHT, POST_UPGRADE_HEIGHT, PANIC_BOUNDARY, PROPOSAL_ID) computed by early steps cannot be read by subsequent steps until a real bridge is in place. ConfigMap-based bridge will land in a follow-up issue. The YAML retains /workflow/vars references throughout for forward-compatibility with that bridge. Validation: yaml.safe_load_all parses cleanly; full kubectl --dry-run schema validation requires the chaos-mesh.org/v1alpha1 CRD on the target cluster. Cross-review (local, pre-push): - platform-engineer (primary author): Chaos Mesh syntax, runbook DX, cross-step bridge investigation (discovered the emptyDir gap). - product-engineer (scope-cutter / north-star walkthrough): line-by-line mapping from major_upgrade_test.yaml, identified the verifier cuts, validated the 5 MVP kinds cover the scenario.

… Status block

cursor · 2026-05-19T18:49:30Z

You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace.

To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard.

bdchatham added 2 commits May 19, 2026 11:48

docs(scenarios): list PANIC_BOUNDARY among the cross-step vars in the…

a9a4fb7

… Status block

bdchatham merged commit f349d5f into main May 19, 2026
2 checks passed

bdchatham deleted the feat/seinodetask-pr5-scenario branch May 19, 2026 19:13

This was referenced May 19, 2026

feat(scenarios): ConfigMap ownerReference + Workflow name templating (PR 7) #290

Merged

fix(dockerignore): re-include runner/templates so the runner image builds #293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scenarios): major-upgrade Chaos Workflow + runbook (PR 5/5)#288

feat(scenarios): major-upgrade Chaos Workflow + runbook (PR 5/5)#288
bdchatham merged 2 commits into
mainfrom
feat/seinodetask-pr5-scenario

bdchatham commented May 19, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Workstream

What ships

The 12-step flow

Scope cuts (cross-review with product-engineer)

Known gap: cross-step variable bridge

Cross-review (local, pre-push)

Test plan

Follow-up issues (filing separately)

Uh oh!

cursor Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bdchatham commented May 19, 2026 •

edited

Loading