Skip to content

feat(scenarios): major-upgrade Chaos Workflow + runbook (PR 5/5)#288

Merged
bdchatham merged 2 commits into
mainfrom
feat/seinodetask-pr5-scenario
May 19, 2026
Merged

feat(scenarios): major-upgrade Chaos Workflow + runbook (PR 5/5)#288
bdchatham merged 2 commits into
mainfrom
feat/seinodetask-pr5-scenario

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented May 19, 2026

Final slice of the SeiNodeTask MVP implementation. Adds the major-upgrade Chaos Mesh Workflow that composes the 5 MVP SeiNodeTask kinds via the seitask-runner image (PR 4) to express sei-chain/integration_test/upgrade_module/major_upgrade_test.yaml end-to-end. This is the MVP acceptance scenario.

Status: schema-validated, not yet runnable end-to-end. See "Status" section of scenarios/README.md and the bridge-gap callout below.

Workstream

Council workstream — System tier — for the SeiNodeTask MVP (design merged at #277):

What ships

  • scenarios/major-upgrade.yaml — 599-line, 26-template Chaos Mesh Workflow
  • scenarios/testnet-deployment.yaml — reference 4-validator SeiNodeDeployment the Workflow can target
  • scenarios/README.md — runbook + known-limitations

The 12-step flow

# Step Primitive
1 compute-target-height bash → emit TARGET_HEIGHT, POST_UPGRADE_HEIGHT, PANIC_BOUNDARY
2 submit-upgrade-proposal GovSoftwareUpgrade SeiNodeTask
3 vote-yes-all-validators Parallel of 4× GovVote SeiNodeTask
4 wait-for-proposal-to-pass bash polls gov queries (self-resolves proposal by content)
5 early-upgrade-node-0 UpdateNodeImage (expects CrashLoop)
6 wait-for-target-height-nodes-1-2-3 Parallel of 3× AwaitCondition height
7 upgrade-nodes-1-2-3 Parallel of 3× UpdateNodeImage
8 downgrade-node-0 UpdateNodeImage to v1
9 await-post-upgrade-progress-nodes-1-2-3 Parallel AwaitCondition height-advance past POST_UPGRADE_HEIGHT
10 wait-for-target-height-node-0 AwaitCondition for PANIC_BOUNDARY (= TARGET_HEIGHT-1; node-0 panics AT TARGET_HEIGHT)
11 upgrade-node-0 UpdateNodeImage to v2
12 await-post-upgrade-progress-node-0 AwaitCondition height-advance

Scope cuts (cross-review with product-engineer)

Cut from the original 36-step draft, per the product-engineer's north-star walkthrough:

  • verify_running × 4 (pre-upgrade) — vestigial. The bash framework needed these because it had no preceding signal; we have wait-for-proposal-to-pass which is a strictly stronger proof of chain liveness.
  • verify_panic × 4 — semantics ("chain stopped at target-1 OR process exited") cannot be cleanly expressed without sidecar/runner extensions we deliberately don't ship. The chain reaching PANIC_BOUNDARY and stopping is sufficient regression-detection signal.
  • verify_upgrade_needed_log — would require pod log access (pods/log verb). Liveness via height-advance covers the assertion intent.

Net: 36-step bash translation → 12-step Workflow that tests the same upgrade flow.

Known gap: cross-step variable bridge

Bug in the LLD I authored: Chaos Mesh's Task template launches each step in its own Pod. The task.volumes field is per-Pod-scoped, so an emptyDir declared in two steps does NOT share storage. The /workflow/vars/env.sh bridge the LLD specified doesn't work as designed.

The Workflow YAML retains /workflow/vars/env.sh references throughout for forward-compatibility with a real bridge implementation. The README's Status block is unambiguous: schema-validate only, not runnable end-to-end yet. Follow-up issue (filed separately) tracks a ConfigMap-based bridge as a focused PR 6.

Cross-review (local, pre-push)

Item Provider (platform-engineer) Audit (product-engineer) Status
12 surviving steps map to north-star All steps wired to runner templates or bash Confirmed via walkthrough COMPATIBLE
Verifier cuts applied 10 templates removed, 4 height-advance steps added Recommended exact cuts COMPATIBLE
wait-for-target-height-node-0 waits for PANIC_BOUNDARY not TARGET_HEIGHT Fixed in this PR (1-line var addition + 1 --var override) Real bug flagged in audit RESOLVED
Cross-step bridge gap documented Status block at top of README Required honesty COMPATIBLE
YAML schema valid yaml.safe_load_all parses Required COMPATIBLE

Zero MISMATCH after the bridge-gap call-out + the PANIC_BOUNDARY fix.

Test plan

  • yaml.safe_load_all parses cleanly
  • Manual walkthrough vs major_upgrade_test.yaml confirms each step maps
  • kubectl apply --dry-run=client against a cluster with chaos-mesh.org/v1alpha1 installed (reviewer verification, requires Chaos Mesh CRDs)
  • End-to-end run blocked on: cross-step variable bridge (follow-up PR), seitask-runner image published to a registry the cluster can pull from (follow-up — currently requires manual make runner-image && make runner-push)
  • Harbor cluster e2e dry-run once the above two land

Follow-up issues (filing separately)

  1. PR 6: ConfigMap-based cross-step variable bridge — replaces the broken emptyDir mechanism; runner gains configmaps RBAC.
  2. Runner image release pipeline.github/workflows/ecr.yml builds only the controller today; needs a parallel runner-image build/publish step.
  3. CI: assert config/crd/kustomization.yaml lists every config/crd/*.yaml — the gap that caused the prod outage (fix(crd): add seinodetasks to config/crd/kustomization.yaml #287). Run a quick check in lint stage.

🤖 Generated with Claude Code

bdchatham added 2 commits May 19, 2026 11:48
Final slice of the SeiNodeTask MVP implementation. Adds the major-upgrade
Chaos Mesh Workflow that composes the 5 MVP SeiNodeTask kinds via the
seitask-runner image (PR 4) to express
sei-chain/integration_test/upgrade_module/major_upgrade_test.yaml
end-to-end as the MVP acceptance scenario.

Status: schema-validated, not yet runnable end-to-end. See "Status"
section of scenarios/README.md.

Files:
- scenarios/major-upgrade.yaml (599 LOC, 26 Chaos Mesh templates)
- scenarios/testnet-deployment.yaml (reference 4-validator
  SeiNodeDeployment the Workflow targets)
- scenarios/README.md (runbook + known-limitations)

The Workflow expresses:
1. compute-target-height (bash; queries node-0 RPC, computes
   TARGET_HEIGHT = current + 100, UPGRADE_HEIGHT, POST_UPGRADE_HEIGHT,
   PANIC_BOUNDARY = TARGET_HEIGHT - 1)
2. submit-upgrade-proposal (GovSoftwareUpgrade)
3. vote-yes-all-validators (Parallel of 4 GovVote)
4. wait-for-proposal-to-pass (bash polls gov queries)
5. early-upgrade-node-0 (UpdateNodeImage)
6. wait-for-target-height-nodes-1-2-3 (Parallel of 3 AwaitCondition height)
7. upgrade-nodes-1-2-3 (Parallel of 3 UpdateNodeImage)
8. downgrade-node-0 (UpdateNodeImage)
9. await-post-upgrade-progress-nodes-1-2-3 (Parallel, liveness via
   height-advance past POST_UPGRADE_HEIGHT)
10. wait-for-target-height-node-0 (AwaitCondition; waits for
    PANIC_BOUNDARY = TARGET_HEIGHT-1 since node-0 panics AT TARGET_HEIGHT
    on the pre-upgrade binary)
11. upgrade-node-0 (UpdateNodeImage)
12. await-post-upgrade-progress-node-0 (AwaitCondition height-advance)

Scope discipline applied during cross-review:
- CUT: verify_running pre-upgrade × 4 (vestigial from bash framework;
  passed-proposal already proves chain liveness)
- CUT: verify_panic × 4 (semantics need AwaitCondition extensions we
  defer; the chain reaching TARGET_HEIGHT-1 and stopping is sufficient
  evidence)
- CUT: verify_upgrade_needed_log (would require pods/log access; the
  liveness check covers regression detection without it)
- KEPT: post-upgrade height-advance checks (real liveness signal)

Known gap (documented in README's Status block):
Chaos Mesh's Task template launches each step in its own Pod, so the
emptyDir-based /workflow/vars/env.sh bridge described in the LLD does
not span steps. The cross-step variables (TARGET_HEIGHT, UPGRADE_HEIGHT,
POST_UPGRADE_HEIGHT, PANIC_BOUNDARY, PROPOSAL_ID) computed by early
steps cannot be read by subsequent steps until a real bridge is in
place. ConfigMap-based bridge will land in a follow-up issue. The YAML
retains /workflow/vars references throughout for forward-compatibility
with that bridge.

Validation: yaml.safe_load_all parses cleanly; full kubectl --dry-run
schema validation requires the chaos-mesh.org/v1alpha1 CRD on the
target cluster.

Cross-review (local, pre-push):
- platform-engineer (primary author): Chaos Mesh syntax, runbook DX,
  cross-step bridge investigation (discovered the emptyDir gap).
- product-engineer (scope-cutter / north-star walkthrough): line-by-line
  mapping from major_upgrade_test.yaml, identified the verifier cuts,
  validated the 5 MVP kinds cover the scenario.
@cursor
Copy link
Copy Markdown

cursor Bot commented May 19, 2026

You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace.

To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard.

@bdchatham bdchatham merged commit f349d5f into main May 19, 2026
2 checks passed
@bdchatham bdchatham deleted the feat/seinodetask-pr5-scenario branch May 19, 2026 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant