Skip to content

fix(scripts): preserve node_key.json across state sync (PLT-415)#3527

Merged
monty-sei merged 1 commit into
mainfrom
monty/plt-415-preserve-node-key-on-state-sync
Jun 17, 2026
Merged

fix(scripts): preserve node_key.json across state sync (PLT-415)#3527
monty-sei merged 1 commit into
mainfrom
monty/plt-415-preserve-node-key-on-state-sync

Conversation

@monty-sei

Copy link
Copy Markdown
Contributor

Summary

  • scripts/state_sync.sh backed up priv_validator_key.json before wiping ~/.sei/config, but not node_key.json — which lives in the same directory. The whole config/ dir is moved to ~/.sei_backup and only the validator key is copied back, so node_key.json never returns.
  • On the next start, LoadOrGenNodeKey finds no key and generates a fresh Ed25519 keypair, giving the node a brand-new NodeID (hex(sha256(pubkey)[:20])). Any peer that has the old NodeID@host:port in persistent_peers then silently fails to connect, because Tendermint rejects the handshake when the advertised NodeID doesn't match.
  • This bites hardest during upgrades / incident recovery — exactly when operators reach for state sync — and churns the NodeID on every run for teams using state sync for disk management. State sync has no dependency on a fresh node key, so there's no reason to regenerate it.
  • Fix: back up and restore node_key.json alongside priv_validator_key.json, preserving node identity across a sync.

Test plan

  • bash -n scripts/state_sync.sh — syntax clean
  • Verified backup/restore symmetry: node_key.json is now copied to $HOME/key_backup before the config wipe and copied back to $HOME/.sei/config/ afterward, mirroring the existing priv_validator_key.json handling
  • Operator validation on a non-prod node: run state sync on an existing node and confirm the NodeID (seid tendermint show-node-id) is unchanged afterward

Notes

  • The copy of this script deployed on some EC2 instances is reportedly out of date with the version in sei-infra. This PR fixes the in-repo script; the sei-infra copy should be patched and redeployed separately so existing nodes actually pick up the fix.

state_sync.sh backed up and restored priv_validator_key.json but not
node_key.json, even though both live in ~/.sei/config. Moving config to
the backup dir and restoring only the validator key left node_key.json
behind, so LoadOrGenNodeKey generated a fresh key on next start and the
node's NodeID changed. Any peer with the old NodeID@host:port in
persistent_peers then silently failed to connect.

Back up and restore node_key.json alongside the validator key so node
identity is preserved across a state sync.
@cursor

cursor Bot commented Jun 1, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Operator-only shell script change with no application or chain logic impact; reduces operational risk from accidental NodeID rotation.

Overview
scripts/state_sync.sh now backs up and restores node_key.json the same way it already does for priv_validator_key.json, so wiping and rebuilding ~/.sei during state sync no longer drops the Tendermint node key.

Without that file, the node would mint a new NodeID on restart and persistent_peers entries keyed to the old ID would fail to connect—especially painful during upgrades or recovery when operators rely on this script.

The change is symmetric copy-in / copy-out around the existing key backup flow; state sync behavior itself is unchanged.

Reviewed by Cursor Bugbot for commit 44712ef. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 1, 2026, 12:35 AM

@codecov

codecov Bot commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.30%. Comparing base (3d3df7e) to head (44712ef).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3527      +/-   ##
==========================================
- Coverage   59.15%   58.30%   -0.85%     
==========================================
  Files        2213     2140      -73     
  Lines      182710   174341    -8369     
==========================================
- Hits       108077   101650    -6427     
+ Misses      64930    63667    -1263     
+ Partials     9703     9024     -679     
Flag Coverage Δ
sei-db 70.41% <ø> (-0.22%) ⬇️
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 74 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@monty-sei monty-sei requested review from masih and pompon0 June 1, 2026 01:20
@monty-sei

Copy link
Copy Markdown
Contributor Author

Hey hey! Followed up on masih's note that the script usually run is the one in sei-infra, so I went and checked that one too — good news is it doesn't have this bug, so we don't need a parallel fix over there!

The short version is the two scripts have drifted apart. The sei-infra one (common/scripts/state_sync.sh) stopped calling seid init back in sei-protocol/sei-infra#989 and switched to a much simpler data-only wipe — it only does rm -rf /root/.sei/data/* and rm -rf /root/.sei/wasm, so it never touches /root/.sei/config/ where node_key.json lives, and the key gets preserved in place. The giveaway is that it actually reads node_key.json after the wipe to filter itself out of the peer list, which only works because the key survives — so the NodeID stays stable across a sync.

I also checked the automated recycle path (state-syncer/deploy/scripts/recycle_node.sh) just to be safe, and that one's fine too — it deletes genesis + priv_validator_key but leaves node_key alone, then seid init just loads the existing key via LoadOrGenNodeKey rather than regenerating it.

So the node key loss is really specific to this in-repo scripts/state_sync.sh (the one that wipes the whole config dir), which is exactly what this PR fixes. On the stale-EC2-copies note from the description — sei-infra ships these scripts via the release zip that's pulled at ec2_init time, so a node ends up with whatever release was current when it was last provisioned, but since no sei-infra version (current or pre-#989) ever deleted node_key.json, those copies preserve the NodeID regardless.

Let me know if I've missed anything or you'd like me to dig in further!

@monty-sei monty-sei added this pull request to the merge queue Jun 17, 2026
Merged via the queue into main with commit f79b5fd Jun 17, 2026
56 of 57 checks passed
@monty-sei monty-sei deleted the monty/plt-415-preserve-node-key-on-state-sync branch June 17, 2026 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants