Skip to content

feat(noderesource): lock seid invocation on HOME env var#234

Merged
bdchatham merged 1 commit into
mainfrom
feat/seid-home-env-var
May 13, 2026
Merged

feat(noderesource): lock seid invocation on HOME env var#234
bdchatham merged 1 commit into
mainfrom
feat/seid-home-env-var

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented May 13, 2026

Summary

Single source of truth for the seid data directory across every seid-running container:

  • container.env declares HOME=dataDir (and TMPDIR=dataDir/tmp)
  • container.args use $(HOME) (K8s VariableReference) for the canonical seid start --home $(HOME)
  • Bash-context paths (seid-init, sidecarWaitCommand) use "$HOME" (shell expansion) against the same env var

spec.Entrypoint is now silently ignored and marked Deprecated; the canonical invocation is injected unconditionally. The field stays in v1alpha1 for backwards compat — manifests don't need updates today — and is slated for removal in v1alpha2.

The failure mode this fixes

Smoke-testing #233 on harbor surfaced mkdir /.sei: permission denied from eth_accounts and a handful of other seid subcommands. Root cause: distroless base images don't carry an /etc/passwd entry for uid 65532, so HOME is unset for the seid process, and Cosmos SDK's ~/.sei default resolves to literal /.sei — on a read-only root filesystem.

Setting HOME at the container level closes that gap: every subcommand that consults os.UserHomeDir() / ~/.sei now sees dataDir, the same path --home is pointing at.

Why $(HOME) (with parens, not braces)

K8s defines two unrelated substitution mechanisms:

Where Syntax Expanded by
container.args[], container.command[], env.value \$(VAR_NAME) kubelet, before exec
Inside sh -c "<script>" or bash -c "<script>" \$VAR or \${VAR} shell, at runtime

The kubelet form uses parens to avoid collision with shell \$VAR — args[] entries are passed directly via execve with no shell in the loop. Ref: https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/#use-environment-variables-to-define-arguments.

Both codepaths in this PR ultimately resolve from the same HOME env var; only the substitution layer differs.

What this unlocks

The 3-PR sequence we're working toward:

  1. This PR — every container references HOME, not a literal /sei. Behavior unchanged (still resolves to /sei).
  2. PR 2 — remove spec.entrypoint from any manifest that still sets it (no-op since it's now ignored, but cleans up state).
  3. PR 3 — flip platform.DataDir from /sei to the conventional Cosmos SDK layout (~/.sei / /.sei). Single-line change; every container picks it up via HOME. Blocked on Trigger NodeUpdate plan on pod template hash drift (not just spec.image) #235 (pod template hash drift detection).

Rollout — corrected

This is a pod-spec change with no image change. Two semantics matter:

  • buildRunningPlan (internal/planner/planner.go:708) builds a NodeUpdate plan only when spec.image != status.currentImage. Env/args changes alone do not trigger a plan.
  • StatefulSets use OnDelete update strategy (internal/noderesource/noderesource.go:204), so even if the StatefulSet template is patched, existing pods don't roll.

Net effect on existing chains:

Filed as #235 for the long-term fix: trigger NodeUpdate plans on pod template hash drift, not just spec.image drift. PR 3 is blocked on #235.

Cross-review (Coral trio)

All three follow-ups are non-blocking for this PR.

Test plan

  • go build ./... clean
  • go vet ./... clean
  • gofmt -l . clean
  • go test ./... green
  • go test -race ./... green
  • golangci-lint run --new-from-rev=origin/main clean
  • make manifests generate — CRD doc updates reflect the deprecation
  • Coral trio cross-review (platform-engineer + kubernetes-specialist + security-specialist)
  • Smoke test on harbor after merge + controller bump: confirm eth_accounts resolves on a freshly-rolled SeiNode, no permission denied regressions — verified indirectly via HOME=/.sei env on rolled pods + zero permission denied errors in seid logs + chain producing blocks. Literal eth_accounts curl wasn't executed (seid bound port 8545 to a non-loopback address; port-forward returned connection refused). Fix is confirmed working in practice.

Follow-up issues filed

🤖 Generated with Claude Code

The controller now injects a canonical `seid start --home $(HOME)`
invocation across every seid-running container (main, init, bootstrap
init). HOME is set on container.env to dataDir; the K8s VariableReference
$(HOME) resolves at pod-create. Shell-context seid-init and the sidecar
wait wrapper use "$HOME" expansion against the same env var.

This removes the silent gap that started failing nodes after #233:
distroless base images don't carry an /etc/passwd entry for uid 65532,
so HOME is unset for the seid process, and `~/.sei` resolves to `/.sei`
on a read-only root filesystem -> ENOENT/EACCES on any subcommand that
touches the home dir (e.g. eth_accounts via mkdir /.sei).

Tying all four codepaths to the same env var also unblocks the planned
move to the conventional Cosmos SDK layout (~/.sei) in a follow-up:
flipping dataDir there will Just Work because every container references
HOME, not a literal path.

spec.Entrypoint is now silently ignored and marked Deprecated. It was
the only way to override seid args before; with HOME-based resolution
the override is no longer needed (and dangerous, since it bypasses the
canonical invocation). Field stays in v1alpha1 for backwards-compat;
slated for removal in v1alpha2.

Rollout: this is a pod-spec change. After merge + image bump, harbor
controller will roll new SeiNodes through the existing NodeUpdate plan
(apply-statefulset -> replace-pod -> observe-image -> mark-ready);
existing chains keep their PVCs and dataDir layout untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 13, 2026

You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace.

To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard.

@bdchatham bdchatham merged commit 0d3a156 into main May 13, 2026
2 checks passed
bdchatham added a commit that referenced this pull request May 13, 2026
Default image used by clusters that consume `config/default?ref=main`
without an overlay-level image pin. This bump propagates to dev and
prod on the next Flux reconcile after merge. Harbor is unaffected —
it has its own overlay-level pin in the platform repo (already at
0d3a156 via platform#521).

Cumulative changes since the previous pin (e99c04c, PR #216):

- #220 — keyring containment + non-root sidecar + ReadOnlyRootFilesystem
- #233 — non-root seid (uid 65532) + kubelet fsGroup migration
- #234 — HOME env var declared on seid containers

The controller bump itself is passive: existing dev/prod pods don't
restart until each chain's spec.image is bumped to trigger a NodeUpdate
plan. The kubelet fsGroup migration from #233 only fires when a pod
restarts and the pod template includes fsGroup with
FSGroupChangePolicy=OnRootMismatch — chown'ing the PVC root tree at
mount time. On fresh harbor PVCs this is instant; on existing multi-TB
prod archive PVCs it could take longer, so per-chain image bumps on
prod should be sequenced with that latency in mind.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request May 14, 2026
Same cleanup shape as sei-protocol/platform#522 applied to the fleet
manifests, now for the in-tree samples. Six sample files; all had
`spec.entrypoint: { command: ["seid"], args: ["start", "--home", "/sei"] }`
which the controller silently ignores as of #234 — the canonical
invocation is injected unconditionally.

Without this cleanup, copy-paste users would land manifests that:
- look functionally meaningful but are silently ignored, and
- pin the literal `/sei` arg value, which post-flip is misleading
  (the data dir is `/.sei`).

One sample (`pacific-1-shadow-replayer.yaml`) carried an extra arg —
`--skip-app-hash-validation` — that was the only real functional
override in this set. #234 removed the surface to pass extra seid args
via the CRD; tracked as a follow-up issue for cases that legitimately
need non-default args (shadow replayer being the candidate use case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request May 14, 2026
* feat(platform): flip dataDir to conventional /.sei home layout

`platform.DataDir` moves from `/sei` to `/.sei` — the conventional
Cosmos SDK home layout (`~/.sei`). Single source of truth in
`internal/platform/platform.go`; both consumers (`noderesource` for
the StatefulSet pod template, `task/bootstrap_resources` for the
ceremony pod) route through it, so this is a one-line flip plus
test-literal fixups.

What propagates from the flip:

- seid main container: `mountPath`, `HOME` env, and `--home $(HOME)`
  K8s VariableReference all become `/.sei`. Same value resolves via
  three layers (mount, env, args) automatically.
- sidecar container: `mountPath` and `SEI_HOME` env both become
  `/.sei`. `seictl serve` reads `SEI_HOME` via its `--home` flag's
  env source (`seictl/main.go:54-66`), so all task handlers operate
  against `/.sei` with no coordinated sidecar change required —
  verified via cross-repo audit.
- bootstrap container path: same shape.

Tests updated:

- `noderesource_test.go:1119` — substring assertion was
  `NotTo ContainSubstring("/sei")`. `/.sei` contains `/sei` as a
  substring, so the assertion would falsely match post-flip. Routed
  through the `dataDir` constant so it auto-tracks any future flip.
- Two cosmetic literals (`keyringTestMountPath`, an assertion
  message) updated for hygiene.

PVC semantics: the mount path is a kernel-side façade. Existing PVC
content lives at the volume root inode, not under `/sei`. Pods rolling
to the new template see the same `config/`, `data/`, `keyring-file/`
subtree, just at `/.sei` instead of `/sei`. No data movement.

Round-trip safe: reverting this constant + re-rolling pods restores
the old mountpath against the same PVC contents.

Rollout: this is a pod-spec change with no image change.
`buildRunningPlan` only triggers NodeUpdate plans on `spec.image` drift
(per `internal/planner/planner.go:708`), so existing SeiNodes stay on
their current template until each chain's `spec.image` is bumped.
That's the same propagation mechanism PR 1 (#234) relied on.
#235 tracks extending drift detection
to pod-template-hash, which would let this flip propagate automatically
to the fleet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(samples): strip deprecated spec.entrypoint blocks

Same cleanup shape as sei-protocol/platform#522 applied to the fleet
manifests, now for the in-tree samples. Six sample files; all had
`spec.entrypoint: { command: ["seid"], args: ["start", "--home", "/sei"] }`
which the controller silently ignores as of #234 — the canonical
invocation is injected unconditionally.

Without this cleanup, copy-paste users would land manifests that:
- look functionally meaningful but are silently ignored, and
- pin the literal `/sei` arg value, which post-flip is misleading
  (the data dir is `/.sei`).

One sample (`pacific-1-shadow-replayer.yaml`) carried an extra arg —
`--skip-app-hash-validation` — that was the only real functional
override in this set. #234 removed the surface to pass extra seid args
via the CRD; tracked as a follow-up issue for cases that legitimately
need non-default args (shadow replayer being the candidate use case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant