Stabilize FlatKV crash recovery CI test#3457
Conversation
The `verify_flatkv_crash_recovery.sh` was treating survivor progress while sei-node-3 remained down as a hard FlatKV recovery invariant. The CI flake showed pre_kill=360 during_kill=360: the SIGKILL landed after the victim had participated in the next Tendermint height, and the devnet did not commit another block until the validator was restarted. That temporary consensus stall is orthogonal to FlatKV crash recovery. Make the no-progress-while-down case diagnostic instead of fatal, then restart the victim and require every validator to advance past pre_kill + COMPARE_BUFFER before selecting the FlatKV comparison height. This keeps the storage check just as strong: the digest comparison is still over a height committed after the SIGKILL/restart cycle by all validators, so corruption or divergence introduced by FlatKV crash recovery will still fail the test. Flaked in unrelated changes here: * https://github.com/sei-protocol/sei-chain/actions/runs/26035091204/job/76530891935?pr=3456
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
PR SummaryLow Risk Overview After restarting the killed node and confirming catch-up, the test now waits for all validators to advance past Reviewed by Cursor Bugbot for commit 2a14b78. Bugbot is set up for automated code reviews on this repo. Configure here. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3457 +/- ##
==========================================
+ Coverage 59.24% 59.27% +0.02%
==========================================
Files 2126 2126
Lines 175715 175699 -16
==========================================
+ Hits 104097 104137 +40
+ Misses 62538 62480 -58
- Partials 9080 9082 +2
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
The
verify_flatkv_crash_recovery.shwas treating survivor progress while sei-node-3 remained down as a hard FlatKV recovery invariant. The CI flake showed pre_kill=360 during_kill=360: the SIGKILL landed after the victim had participated in the next Tendermint height, and the devnet did not commit another block until the validator was restarted. That temporary consensus stall is orthogonal to FlatKV crash recovery.Make the no-progress-while-down case diagnostic instead of fatal, then restart the victim and require every validator to advance past pre_kill + COMPARE_BUFFER before selecting the FlatKV comparison height. This keeps the storage check just as strong: the digest comparison is still over a height committed after the SIGKILL/restart cycle by all validators, so corruption or divergence introduced by FlatKV crash recovery will still fail the test.
Flaked in unrelated changes here.