Stabilize FlatKV EVM migration stop point#3592
Conversation
The FlatKV EVM migration test could flake while switching validators from memiavl_only to migrate_evm. The script only checked that validators had the same committed height before flipping modes, but a validator can already have signed a vote for the next height without committing it. After the mode flip, consensus WAL replay could ask that validator to sign different data for the same height, triggering Tendermint's double-sign guard with "conflicting data" and leaving the cluster stuck. Fix the migration choreography by stopping validators, configuring a temporary future halt-height, restarting them in memiavl_only, and waiting for all nodes to halt themselves at the same durable commit boundary. The script also records priv_validator_state heights and refuses to flip if last-sign state advanced past the committed halt height. Reset halt-height when switching to migrate_evm so the migrated restart can make progress.
PR SummaryLow Risk Overview The script stops all validators, sets a temporary New helpers factor out coordinated barrier-synchronized starts, parallel TERM stops, and halt-height writes with verification on node 0. The old docker pause / multi-attempt freeze path and Reviewed by Cursor Bugbot for commit 26ddc6a. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dc65617ab5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if [ -n "$PRE_FLIP_HEIGHT" ] && [ "$PRE_FLIP_HEIGHT" -gt "$halt_start_height" ]; then | ||
| halt_start_height=$PRE_FLIP_HEIGHT | ||
| fi | ||
| HALT_HEIGHT=$((halt_start_height + PRE_FLIP_HALT_BLOCKS)) |
There was a problem hiding this comment.
Wait for all validators before starting halt countdown
When one validator restarts slower than the other three, this fixed 10-block window can be consumed before the lagging node is up and caught up: in the local 4-validator devnet any three equal-power validators can form quorum, and the config allows very fast blocks (unsafe-commit-timeout-override = "50ms" in docker/localnode/config/config.toml:358). Those three validators can then halt at HALT_HEIGHT, leaving the fourth without live peers to commit the halt block, so wait_for_all_seid_stop times out instead of stabilizing the migration; choose the halt target only after all validators are observed running/caught up, or make the window large enough for the slowest restart path.
Useful? React with 👍 / 👎.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3592 +/- ##
==========================================
- Coverage 59.22% 58.36% -0.87%
==========================================
Files 2213 2139 -74
Lines 183115 174568 -8547
==========================================
- Hits 108453 101878 -6575
+ Misses 64914 63642 -1272
+ Partials 9748 9048 -700
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
The FlatKV EVM migration test could flake while switching validators from memiavl_only to migrate_evm. The script only checked that validators had the same committed height before flipping modes, but a validator can already have signed a vote for the next height without committing it. After the mode flip, consensus WAL replay could ask that validator to sign different data for the same height, triggering Tendermint's double-sign guard with "conflicting data" and leaving the cluster stuck.
Fix the migration choreography by stopping validators, configuring a temporary future halt-height, restarting them in memiavl_only, and waiting for all nodes to halt themselves at the same durable commit boundary. The script also records priv_validator_state heights and refuses to flip if last-sign state advanced past the committed halt height. Reset halt-height when switching to migrate_evm so the migrated restart can make progress.
Flaked for unrelated changes.