Skip to content

feat(controller/node): resolve label peers to NLB addresses + controller-side node_id (#368)#369

Merged
bdchatham merged 7 commits into
mainfrom
feat/label-peers-controller-side-nodeid
May 29, 2026
Merged

feat(controller/node): resolve label peers to NLB addresses + controller-side node_id (#368)#369
bdchatham merged 7 commits into
mainfrom
feat/label-peers-controller-side-nodeid

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

First slice of the peer-discovery harmonization (#368). Moves node_id resolution for the LabelPeerSource flow from the sidecar's :26657/status query to the controller's existing per-peer sidecar gRPC. Status.ResolvedPeers shifts to fully-composed <node_id>@<host>:<port> strings.

Why

The sidecar's :26657/status lookup path is in-cluster-only — NLBs only forward 26656, so it can't reach a publishable peer over the public endpoint. After PR #365, a peer's Spec.ExternalAddress is the publishable vanity hostname; if the controller writes that into Status.ResolvedPeers and the sidecar tries :26657/status, the lookup fails permanently.

The controller already has in-cluster access to each peer's sidecar gRPC (port 7777) — the same path the genesis CollectAndSetPeers task uses to read node_id from node_key.json. Reusing that path for the Label flow keeps node_id resolution authoritative (it's from the keyfile, not from :26657) and decouples discovery from publishable-vs-private addressing.

Behavior

Peer's Spec.ExternalAddress Resolved entry
set (<vanity>:26656) <node_id>@<vanity>:26656
unset <node_id>@<peer>-0.<peer>.<ns>.svc.cluster.local:26656

node_id is fetched via the existing SidecarClient.GetNodeID (in-cluster gRPC to port 7777). On per-peer failure the whole resolve errors and controller-runtime retries — same atomic semantic as the genesis CollectAndSetPeers task. Seid only reads persistent_peers at boot, so transient peer-sidecar failures during reconcile don't disturb running pods.

Files

  • internal/controller/node/peers.goresolveLabelPeers composes the new wire format; new peerAddress helper picks Spec.ExternalAddress or falls back to headless DNS.
  • internal/planner/planner.go — Label sources map to PeerSourceStatic instead of PeerSourceDNSEndpoints. The sidecar's DNSEndpointsSource is untouched — still serves EC2Tags peers.
  • internal/task/task.go — widens task.SidecarClient interface with GetNodeID(ctx) (string, error). The full *sidecar.SidecarClient already implements it; mocks gain one-line stubs.
  • api/v1alpha1/seinode_types.goResolvedPeers godoc updated for the new composed wire format.

Wire-format change (one-way door)

Status.ResolvedPeers was previously a []string of bare hosts (<peer>-0.<peer>.<ns>.svc.cluster.local). It's now a []string of fully-composed <node_id>@<host>:<port> entries. The only consumer is the planner, which is updated in this same PR. No external consumer is affected.

Backward compatibility

  • Existing SNDs with LabelPeerSource keep working — their ResolvedPeers will repopulate with the new format on next reconcile, and the planner routes them through the same Static sidecar source path that genesis peers already use.
  • Spec.Peers user-facing shape is unchanged.
  • EC2Tags and Static PeerSources are unchanged.
  • The sidecar's DNSEndpointsSource (the :26657/status path) stays in seictl — still serves EC2Tags. A follow-up could retire it once EC2Tags also moves to controller-side; out of scope here.

Test plan

  • Unit: new TestReconcilePeers_PrefersExternalAddress. Existing peers tests updated for the new format.
  • go test ./api/v1alpha1/... ./internal/... — green.
  • make test-integration — envtest green (~64s).
  • golangci-lint run --new-from-rev=main ./... — 0 issues.
  • make manifests generate — clean.
  • CI on this PR.

Refs

🤖 Generated with Claude Code

… node_id controller-side

Move node_id resolution for the LabelPeerSource flow from the sidecar's
:26657/status query to the controller's existing per-peer sidecar gRPC.
Status.ResolvedPeers wire format shifts to fully-composed
<node_id>@<host>:<port> strings.

Why: the sidecar's :26657/status path is in-cluster-only (NLBs only
forward 26656). After PR #365, a peer's Spec.ExternalAddress is the
publishable vanity hostname. Using that address in ResolvedPeers means
the sidecar can no longer query :26657 to learn node_id externally. The
controller already has clean in-cluster access to each peer's sidecar
gRPC (port 7777) — same path the genesis CollectAndSetPeers uses to
learn node_id from node_key.json.

Changes:

- internal/controller/node/peers.go: extend resolveLabelPeers to call
  the peer's sidecar via r.Planner.BuildSidecarClient + GetNodeID, and
  compose <node_id>@<host>:<port>. host is Spec.ExternalAddress when
  set, else headless Service DNS at p2p port. On per-peer failure the
  whole resolve errors and controller-runtime retries — matches the
  genesis path semantics.

- internal/planner/planner.go: Label sources route to
  PeerSourceStatic (same source path the genesis-Static flow already
  uses) instead of PeerSourceDNSEndpoints. The sidecar's
  DNSEndpointsSource is unchanged and still serves EC2Tags peers.

- internal/task/task.go: widen task.SidecarClient interface with
  GetNodeID(ctx) (string, error). The full *sidecar.SidecarClient
  already implements this; mocks in tests get one-line stubs.

- api/v1alpha1/seinode_types.go: ResolvedPeers godoc updated for the
  new composed wire format.

Tests:

- New TestReconcilePeers_PrefersExternalAddress covers the
  Spec.ExternalAddress branch.
- Existing peers tests updated for the new composed format.
- Envtest StubSidecarClient gains a GetNodeID stub returning
  "stub-node-id". Suite still passes (~64s).

Wire-format change to Status.ResolvedPeers is the only one-way door.
The field was previously bare hosts; consumers are the planner only,
which is updated in this same PR.

Refs #368

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 29, 2026

PR Summary

Medium Risk
Changes P2P peer list format and discovery path for label-based peers (networking/consensus adjacency); transient failures can retain stale node_id entries until the sidecar recovers.

Overview
Label-based peer discovery now writes status.resolvedPeers as CometBFT-ready <node_id>@<host>:<port> strings instead of bare cluster DNS hostnames. The SeiNode reconciler fetches each peer’s node_id via in-cluster sidecar gRPC (SidecarClient.GetNodeID), picks spec.externalAddress when set (publishable NLB path) or headless service DNS plus the standard P2P port otherwise, and on per-peer sidecar failure keeps the previous composed entry or omits a new peer so one bad peer does not block reconcile.

The discover-peers planner path for label sources now passes those strings as static addresses (replacing DNS-endpoint resolution in the sidecar). API/CRD docs and tests were updated for the new wire format and failure modes.

Reviewed by Cursor Bugbot for commit 3b71ffc. Bugbot is set up for automated code reviews on this repo. Configure here.

…ecar failure

Per platform-engineer's HOLD on PR #369: atomic resolve semantics
wedge fleet-wide reconciles when any one label-matched peer's sidecar
is transient-unavailable. In pacific-1 (6+ SNDs with mutual label
selectors), a single archive sidecar restart would block every
consumer's reconcile loop — not just peer updates, all of it, because
reconcilePeers runs before the status patch.

Fix:
- On per-peer BuildSidecarClient or GetNodeID error, preserve that
  peer's prior entry from Status.ResolvedPeers if available. Indexed
  by host:port to look up the existing <node_id>@host:port.
- If no prior entry exists (peer never resolved), skip with a
  structured log line, do not fail the reconcile.
- The genesis CollectAndSetPeers path keeps its atomic semantics —
  that's a one-shot bounded-cohort operation where retry is correct.
- New tests: TestReconcilePeers_PreservesPriorEntryOnTransientFailure,
  TestReconcilePeers_SkipsNewPeerOnSidecarFailure.

Addresses the platform-engineer HOLD reason on PR #369. K8s-specialist
+ sei-network-specialist already GO'd the atomic version; the
"silent partial list" concern they raised is mitigated by the
structured log line on every skip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Collaborator Author

Three-way cross-review verdict:

  • kubernetes-specialist: GO (atomic semantics correct)
  • sei-network-specialist: GO (CometBFT semantics + node_id correctness verified)
  • platform-engineer: HOLD — atomic resolve wedges fleet-wide reconciles in steady state when any label-matched peer's sidecar is transient-unavailable (pacific-1 scenario: 6+ SNDs with mutual selectors → single sidecar restart blocks every consumer's reconcile, not just peer updates).

Reasoning from first principles per coral discipline: atomic is right for the genesis path (bounded cohort, one-shot, co-bootstrapping) but wrong for the steady-state label flow (every reconcile forever, arbitrary peer lifecycle). Platform-engineer's blocker is the load-bearing one.

Fix pushed in 0fceb7a:

  • On per-peer BuildSidecarClient or GetNodeID error, preserve that peer's prior entry from Status.ResolvedPeers (indexed by host:port). Transients don't drop peers from the list.
  • New peer with no prior entry + sidecar failure: skip with structured log line (peer, err), do not fail the reconcile. Mitigates the k8s-specialist's "silent partial list" debugging concern.
  • Two new unit tests: TestReconcilePeers_PreservesPriorEntryOnTransientFailure, TestReconcilePeers_SkipsNewPeerOnSidecarFailure.

Tests + envtest still green. golangci-lint --new-from-rev=main = 0 issues.

5-9 lines → 2-4 across resolveLabelPeers, indexResolvedPeersByHost,
peerAddress, the planner-side Label→Static branch, the SidecarClient
interface, Status.ResolvedPeers godoc, and the envtest stub. No
behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 10a8cbf. Configure here.

Comment thread internal/controller/node/peers.go Outdated
bdchatham and others added 4 commits May 29, 2026 15:06
…d contract

NodeResolver.BuildSidecarClient is documented as nilable ("Nil factory
skips the sidecar probe; used by tests") and ResolvePlan nil-guards it.
resolveLabelPeers was calling it unconditionally — a panic waiting on a
test or future refactor that leaves the factory unset. Treat nil as a
transient per-peer failure (matches existing fallback semantics: preserve
prior entry or skip with a log line). Caught by Cursor Bugbot review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
K8s-specialist cross-review on the nil-guard fix asked for the missing
test variant: nil factory + prior entry present must route to
preserve-prior, not skip. Locks errNoSidecarFactory into the same
recovery path as runtime transient sidecar errors.

Also: drop the var ( ) block at the call site for a single var + sentinel
init — closer to the boring-clear-code style used elsewhere in the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
errNoSidecarFactory doc 3 lines → 2; new test docstrings shortened to
match the file's existing one-line convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier comment sweep on api/v1alpha1/seinode_types.go didn't carry into
the generated CRD YAMLs; verify-generated caught the drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit 8d4e4b0 into main May 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant