Skip to content

feat: validator signing-key configuration on SeiNode#136

Merged
bdchatham merged 3 commits intomainfrom
feat/validator-signing-key
Apr 28, 2026
Merged

feat: validator signing-key configuration on SeiNode#136
bdchatham merged 3 commits intomainfrom
feat/validator-signing-key

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented Apr 28, 2026

Summary

Migrate Sei validators from EC2 to Kubernetes without double-signing. This PR adds spec.validator.signingKey.secret.secretName to the SeiNode CRD so an operator can deploy a K8s validator that takes over signing from a stopped EC2 instance using its existing consensus identity.

Implements docs/design-seinode-validator-signing-key-lld.md (PR #135). Direction doc: .tide/validator-migration.md.

Scope

v1 supports single-shot deployment: deploy a SeiNode with validator.signingKey.secret.secretName set from creation. The bootstrap Job runs without the Secret (load-bearing safety property — bootstrap pods are physically incapable of signing); the production StatefulSet starts with the Secret mounted, seid catches up to chain tip and starts signing.

Mid-life patching is NOT supported — adding signingKey to a Running validator is a no-op today because buildRunningPlan only detects image drift. Documented in LLD §11 with the implementation sketch for the deferred zero-downtime variant.

What's in this PR

CRD additions (api/v1alpha1/)

  • SigningKeySource discriminated union with one v1 variant Secret *SecretSigningKeySource. Future TMKMS / RemoteSigner / Vault variants slot in additively under the same exactly-one CEL rule.
  • secretName immutable (self == oldSelf).
  • ConditionSigningKeyReady + Validated / NotReady / Invalid reason taxonomy (mirrors ImportPVCReady).

Pod-spec mutation (internal/noderesource/, production StatefulSet only)

  • One Secret-backed volume scoped to priv_validator_key.json (defaultMode 0400, items-filtered).
  • subPath mount on the seid main container at /sei/config/priv_validator_key.json. subPath is the safety property — kubelet does not auto-refresh subPath mounts, so a kubectl edit secret cannot hot-swap the consensus key under a running seid.
  • Sidecar container has no signing mount.

Bootstrap-Job safety invariant (internal/task/bootstrap_resources.go)

  • assertNoSigningKeyOnBootstrapPod runtime guard fails Job generation if the bootstrap pod-spec ever references the validator's signing-key Secret. Belt-and-suspenders against a future refactor that couples the volume helpers; the existing test TestTaskGenerateBootstrapJob_NeverHasSigningKeyVolume is the static-analysis layer.

Pre-flight validation (internal/task/validate_signing_key.go)

  • New validate-signing-key task: existence, deletionTimestamp, key data present, JSON parse, Tendermint shape (address, pub_key.{type,value}, priv_key.{type,value}). Direct condition mutation from the task (matches the merged ensure-data-pvc precedent).
  • Inserted into both buildBootstrapPlan (after EnsureDataPVC, before DeployBootstrapSvc) and buildBasePlan (after EnsureDataPVC, before ApplyStatefulSet) when SigningKey is set.

Validator planner cross-field validationSigningKey is mutually exclusive with GenesisCeremony.

RBAC+kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch on the SeiNode reconciler.

Notable scope decisions

  • priv_validator_state.json is not injected. CometBFT auto-creates it on first start. The runbook's "wait M blocks past last-signed height before deploying" provides the slashing-protection envelope.
  • Mid-life SigningKey patch on a Running validator is deferred. v1 ships single-shot deployment only.

What's deferred (LLD §11)

Mid-life SigningKey patch (drift detection); TMKMS / Horcrux / RemoteSigner variants; automated cutover orchestration; double-sign detection; sentry-node topology; consensus-key rotation; cross-namespace Secret references; HSM integration.

Test plan

  • make test — all packages green (internal/task 41.9% / internal/noderesource 96.2% / internal/planner 69.3% / internal/controller/node 85.6%)
  • make lint — 0 issues
  • make manifests generate — regenerated CRD YAML, deepcopy, role.yaml
  • Task unit tests (9): valid Secret; missing Secret transient; terminating Secret; missing data key; empty data key; malformed JSON; missing address; missing pub_key; tendermintValidatorKey serdes round-trip fixture
  • Pod-spec tests (4): volume present, subPath mount on seid, sidecar mount absent, regression guard for unset SigningKey
  • Bootstrap-Job invariant test: pod-spec must NEVER carry signing volume even when SigningKey is set on the SeiNode (LLD §3); plus the new runtime guard assertNoSigningKeyOnBootstrapPod
  • Plan-builder tests (5): Validate cross-field rules; plan ordering for bootstrap path; plan ordering for base path; absence when SigningKey unset
  • Controller integration tests (4): SigningKeyReady=True on valid Secret; missing-Secret transient → applied → converges; malformed-Secret → plan Failed + condition False/Invalid; SeiNode deletion preserves the referenced Secret

Operational dry-run before first cutover

The runbook in .tide/validator-migration.md is the long pole. Plan: walk the single-shot deployment against a throwaway arctic-1 testnet identity (lowest-staked validator first), confirm the M-block wait works, confirm seid auto-creates priv_validator_state.json and starts signing past the wait window. Document downtime envelope before applying to pacific-1.

🤖 Generated with Claude Code

bdchatham and others added 2 commits April 28, 2026 07:56
Implements docs/design-seinode-validator-signing-key-lld.md (#135).

Adds spec.validator.signingKey to SeiNode, enabling migration of an
existing external validator identity onto a Kubernetes-managed SeiNode
without double-signing risk.

CRD changes:
- New SigningKeySource discriminated union with one v1 variant (Secret).
  Future TMKMS / RemoteSigner / Vault variants slot in additively under
  the same XValidation exactly-one rule.
- New SecretSigningKeySource with immutable secretName.
- New ConditionSigningKeyReady status condition; reasons follow the
  coarse Validated/NotReady/Invalid taxonomy used by ImportPVCReady.

Pod-spec mutation (production StatefulSet only):
- Single Secret volume scoped to priv_validator_key.json (defaultMode
  0400, items-filtered).
- subPath mount on the seid main container at
  /sei/config/priv_validator_key.json. subPath is deliberate — kubelet
  does not auto-refresh subPath mounts, so a kubectl edit cannot
  hot-swap the consensus key under a running seid.
- Sidecar container has no signing mount — no business reading consensus
  material.
- Bootstrap-Job pod-spec is untouched; load-bearing safety comment on
  task.GenerateBootstrapJob pins the invariant against future refactors.

Pre-flight validation:
- New validate-signing-key task validates the Secret before pod creation
  (existence, deletionTimestamp, key data present, JSON parse, Tendermint
  shape). Surfaces failures via SigningKeyReady condition.
- Inserted into both buildBootstrapPlan (between Phase 0 and Phase 1) and
  buildBasePlan (between EnsureDataPVC and ApplyStatefulSet) when
  SigningKey is set.
- Cross-field validation in validatorPlanner.Validate: SigningKey is
  mutually exclusive with GenesisCeremony.

Other:
- New RBAC marker: secrets get;list;watch on the SeiNode controller.
- Finalizer code comment noting Secrets are externally managed.

priv_validator_state.json is intentionally not injected — CometBFT
auto-creates it on first start, and the cutover runbook's "wait M blocks
past last-signed height" provides the operational protection against
re-orgs at the cutover boundary. Documented in the LLD §11 and §8.

Tests: 8 task unit tests, 4 pod-spec generator tests, 1 bootstrap-Job
invariant test, 5 plan-builder tests (Validate + plan ordering), 4
controller integration tests covering happy path, missing-Secret
convergence, malformed-Secret terminal failure, and Secret preservation
on SeiNode deletion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply the no-WHAT, no-breadcrumb comment standard:
- Drop trivial WHAT-comments on self-describing identifiers
  (privValidatorKeyDataKey, ValidateSigningKeyParams, Execute,
  signingKeySecretSource, needsValidateSigningKey, validateSigningKeyParams).
- Trim references to PR numbers, runbook phases, and LLD section pointers
  that belong in the PR description, not the source.
- Keep load-bearing WHY-comments on the bootstrap-Job safety invariant,
  the subPath-vs-hot-swap rationale, the Terminal/transient error contract,
  and the externally-managed-Secrets finalizer note.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/planner/bootstrap.go Outdated
Comment thread internal/task/bootstrap_resources.go Outdated
Comment thread internal/task/validate_signing_key.go Outdated
PR review:
- Drop the Phase 0.5 inline comment in buildBootstrapPlan (the function's
  task ordering is self-evident).
- Replace the safety-invariant prose comment on GenerateBootstrapJob with
  a runtime guard, assertNoSigningKeyOnBootstrapPod, that fails the Job
  generation if the bootstrap pod-spec ever references the validator's
  signing-key Secret. The existing test covering this invariant remains.
- Move tendermintValidatorKey to the top of validate_signing_key.go and
  add a serdes round-trip test against the validKeyJSON fixture, locking
  the on-disk Tendermint shape against future cosmos-sdk changes.

Scope:
- v1 supports SigningKey set from SeiNode creation only. Mid-life
  patching of SigningKey onto a Running validator is a no-op today
  because buildRunningPlan only detects image drift; the LLD §8
  cutover-via-patch flow assumed otherwise.
- LLD §8 rewritten as single-shot deployment: stop EC2 → wait M blocks
  past last-signed → create Secret → apply SeiNode with bootstrapImage +
  signingKey → seid syncs to tip and starts signing.
- LLD §11 adds a "mid-life SigningKey patch (drift detection)" deferred
  entry with the implementation sketch.
- .tide/validator-migration.md runbook rewritten to match: single
  Execution sequence, no Phase 1 pre-sync, no priv_validator_state.json
  transfer, slashing protection via M-block wait.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham marked this pull request as ready for review April 28, 2026 16:13
@bdchatham bdchatham merged commit f8baed0 into main Apr 28, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant