feat(scenarios): major-upgrade Chaos Workflow + runbook (PR 5/5)#288
Merged
Conversation
Final slice of the SeiNodeTask MVP implementation. Adds the major-upgrade
Chaos Mesh Workflow that composes the 5 MVP SeiNodeTask kinds via the
seitask-runner image (PR 4) to express
sei-chain/integration_test/upgrade_module/major_upgrade_test.yaml
end-to-end as the MVP acceptance scenario.
Status: schema-validated, not yet runnable end-to-end. See "Status"
section of scenarios/README.md.
Files:
- scenarios/major-upgrade.yaml (599 LOC, 26 Chaos Mesh templates)
- scenarios/testnet-deployment.yaml (reference 4-validator
SeiNodeDeployment the Workflow targets)
- scenarios/README.md (runbook + known-limitations)
The Workflow expresses:
1. compute-target-height (bash; queries node-0 RPC, computes
TARGET_HEIGHT = current + 100, UPGRADE_HEIGHT, POST_UPGRADE_HEIGHT,
PANIC_BOUNDARY = TARGET_HEIGHT - 1)
2. submit-upgrade-proposal (GovSoftwareUpgrade)
3. vote-yes-all-validators (Parallel of 4 GovVote)
4. wait-for-proposal-to-pass (bash polls gov queries)
5. early-upgrade-node-0 (UpdateNodeImage)
6. wait-for-target-height-nodes-1-2-3 (Parallel of 3 AwaitCondition height)
7. upgrade-nodes-1-2-3 (Parallel of 3 UpdateNodeImage)
8. downgrade-node-0 (UpdateNodeImage)
9. await-post-upgrade-progress-nodes-1-2-3 (Parallel, liveness via
height-advance past POST_UPGRADE_HEIGHT)
10. wait-for-target-height-node-0 (AwaitCondition; waits for
PANIC_BOUNDARY = TARGET_HEIGHT-1 since node-0 panics AT TARGET_HEIGHT
on the pre-upgrade binary)
11. upgrade-node-0 (UpdateNodeImage)
12. await-post-upgrade-progress-node-0 (AwaitCondition height-advance)
Scope discipline applied during cross-review:
- CUT: verify_running pre-upgrade × 4 (vestigial from bash framework;
passed-proposal already proves chain liveness)
- CUT: verify_panic × 4 (semantics need AwaitCondition extensions we
defer; the chain reaching TARGET_HEIGHT-1 and stopping is sufficient
evidence)
- CUT: verify_upgrade_needed_log (would require pods/log access; the
liveness check covers regression detection without it)
- KEPT: post-upgrade height-advance checks (real liveness signal)
Known gap (documented in README's Status block):
Chaos Mesh's Task template launches each step in its own Pod, so the
emptyDir-based /workflow/vars/env.sh bridge described in the LLD does
not span steps. The cross-step variables (TARGET_HEIGHT, UPGRADE_HEIGHT,
POST_UPGRADE_HEIGHT, PANIC_BOUNDARY, PROPOSAL_ID) computed by early
steps cannot be read by subsequent steps until a real bridge is in
place. ConfigMap-based bridge will land in a follow-up issue. The YAML
retains /workflow/vars references throughout for forward-compatibility
with that bridge.
Validation: yaml.safe_load_all parses cleanly; full kubectl --dry-run
schema validation requires the chaos-mesh.org/v1alpha1 CRD on the
target cluster.
Cross-review (local, pre-push):
- platform-engineer (primary author): Chaos Mesh syntax, runbook DX,
cross-step bridge investigation (discovered the emptyDir gap).
- product-engineer (scope-cutter / north-star walkthrough): line-by-line
mapping from major_upgrade_test.yaml, identified the verifier cuts,
validated the 5 MVP kinds cover the scenario.
|
You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace. To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard. |
This was referenced May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Final slice of the SeiNodeTask MVP implementation. Adds the major-upgrade Chaos Mesh Workflow that composes the 5 MVP SeiNodeTask kinds via the
seitask-runnerimage (PR 4) to expresssei-chain/integration_test/upgrade_module/major_upgrade_test.yamlend-to-end. This is the MVP acceptance scenario.Workstream
Council workstream — System tier — for the SeiNodeTask MVP (design merged at #277):
What ships
scenarios/major-upgrade.yaml— 599-line, 26-template Chaos MeshWorkflowscenarios/testnet-deployment.yaml— reference 4-validatorSeiNodeDeploymentthe Workflow can targetscenarios/README.md— runbook + known-limitationsThe 12-step flow
TARGET_HEIGHT,POST_UPGRADE_HEIGHT,PANIC_BOUNDARYGovSoftwareUpgradeSeiNodeTaskGovVoteSeiNodeTaskUpdateNodeImage(expects CrashLoop)AwaitConditionheightUpdateNodeImageUpdateNodeImageto v1AwaitConditionheight-advance pastPOST_UPGRADE_HEIGHTAwaitConditionforPANIC_BOUNDARY(= TARGET_HEIGHT-1; node-0 panics AT TARGET_HEIGHT)UpdateNodeImageto v2AwaitConditionheight-advanceScope cuts (cross-review with product-engineer)
Cut from the original 36-step draft, per the product-engineer's north-star walkthrough:
verify_running× 4 (pre-upgrade) — vestigial. The bash framework needed these because it had no preceding signal; we havewait-for-proposal-to-passwhich is a strictly stronger proof of chain liveness.verify_panic× 4 — semantics ("chain stopped at target-1 OR process exited") cannot be cleanly expressed without sidecar/runner extensions we deliberately don't ship. The chain reachingPANIC_BOUNDARYand stopping is sufficient regression-detection signal.verify_upgrade_needed_log— would require pod log access (pods/logverb). Liveness via height-advance covers the assertion intent.Net: 36-step bash translation → 12-step Workflow that tests the same upgrade flow.
Known gap: cross-step variable bridge
Bug in the LLD I authored: Chaos Mesh's
Tasktemplate launches each step in its own Pod. Thetask.volumesfield is per-Pod-scoped, so anemptyDirdeclared in two steps does NOT share storage. The/workflow/vars/env.shbridge the LLD specified doesn't work as designed.The Workflow YAML retains
/workflow/vars/env.shreferences throughout for forward-compatibility with a real bridge implementation. The README's Status block is unambiguous: schema-validate only, not runnable end-to-end yet. Follow-up issue (filed separately) tracks a ConfigMap-based bridge as a focused PR 6.Cross-review (local, pre-push)
wait-for-target-height-node-0waits forPANIC_BOUNDARYnotTARGET_HEIGHTyaml.safe_load_allparsesZero MISMATCH after the bridge-gap call-out + the PANIC_BOUNDARY fix.
Test plan
yaml.safe_load_allparses cleanlymajor_upgrade_test.yamlconfirms each step mapskubectl apply --dry-run=clientagainst a cluster withchaos-mesh.org/v1alpha1installed (reviewer verification, requires Chaos Mesh CRDs)seitask-runnerimage published to a registry the cluster can pull from (follow-up — currently requires manualmake runner-image && make runner-push)Follow-up issues (filing separately)
.github/workflows/ecr.ymlbuilds only the controller today; needs a parallelrunner-imagebuild/publish step.🤖 Generated with Claude Code