Skip to content

feat(p2p-endpoint): SND-owned per-pod NLB + Spec.ExternalAddress (#360)#365

Merged
bdchatham merged 6 commits into
mainfrom
feat/publishable-p2p
May 29, 2026
Merged

feat(p2p-endpoint): SND-owned per-pod NLB + Spec.ExternalAddress (#360)#365
bdchatham merged 6 commits into
mainfrom
feat/publishable-p2p

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented May 28, 2026

Summary

Reframe of the publishable-P2P controller path (#360). Supersedes #362.

Design

Networking is an SND-layer concern. The SND reconciler owns the per-pod LoadBalancer Service (one per child SeiNode), paralleling the existing HTTPRoute path. The SeiNode reconciler is fully networking-agnostic — it reads Spec.ExternalAddress and proceeds.

Spec.ExternalAddress is plain string on SeiNodeSpec. The SND computes the deterministic vanity hostname per child and injects the address at child Create time via the ensureSeiNode diff path. Standalone SeiNodes (no SND parent) can set the field directly; the controller respects user-set values on standalone resources.

Deterministic vanity hostname via external-dns. Format: <seinode-name>-p2p.<chainID>.<P2PEndpointDomain>. The SND stamps external-dns.alpha.kubernetes.io/hostname on each per-pod Service; external-dns creates the CNAME to the AWS-allocated NLB hostname. ChainID has a DNS-1123 pattern constraint on the CRD; that's the gate.

SEI_P2P_ENDPOINT_DOMAIN env, parsed once at startup. When unset, the P2P endpoint path is silently inert — TCP-enabled SNDs see no Service stamped and no Spec.ExternalAddress set. ConditionNetworkingReady=False/NetworkingDisabled reflects "no tier active" so operators have a clear signal during cluster bring-up. CNI mismatches surface via AWS LBC target-registration failure events, not a controller-side gate.

Lifecycle. Opt-in stamps Spec.ExternalAddress + creates the per-pod Service. Opt-out clears the field and deletes the Service. Scale-down deletes the per-ordinal Service before the child SeiNode. DeletionPolicyRetain orphans the per-pod Services alongside the HTTPRoute / ClusterIP. Runtime opt toggles require a manual pod delete to re-render config.toml — v1 ops contract.

Files

  • api/v1alpha1/seinode_types.go — drop Status.ExternalAddress; add Spec.ExternalAddress string; ChainID DNS-1123 pattern.
  • api/v1alpha1/seinodedeployment_types.goNetworkingStatus.P2PEndpoints []P2PEndpoint (ordinal-keyed list); Genesis.ChainID DNS-1123 pattern.
  • api/v1alpha1/networking_types.goTCPEnabled() accessor mirroring HTTPEnabled().
  • internal/controller/nodedeployment/p2p_endpoint.go (NEW) — Service apply / delete / per-ordinal-delete / orphan; pure hostname derivation; Service factory. Owner-ref = SND. Field manager nodedeployment-controller-p2p-endpoint.
  • internal/controller/nodedeployment/networking.goHTTPEnabled() migration; parallel TCP branch in reconcileNetworking; catch-all NetworkingDisabled condition when neither tier reconciles. orphanNetworkingResources extends to handle P2P endpoints.
  • internal/controller/nodedeployment/status.goHTTPEnabled() migration; buildNetworkingStatus populates P2PEndpoints when TCP is active.
  • internal/controller/nodedeployment/nodes.goensureSeiNode diff propagates Spec.ExternalAddress; template-aliasing guard zeros it before per-ordinal stamping; scaleDown deletes per-ordinal Service before child.
  • internal/controller/nodedeployment/controller.go — adds P2PEndpointDomain string field on the reconciler.
  • internal/planner/planner.go:697 — reads from Spec.ExternalAddress.
  • cmd/main.go — reads SEI_P2P_ENDPOINT_DOMAIN env.

Backward compatibility

  • networking: nil / networking: {} / networking: { http: {} } — unchanged behavior via HTTPEnabled() backcompat.
  • networking: { tcp: {} } — does NOT trigger HTTPRoutes; provisions a P2P endpoint Service when SEI_P2P_ENDPOINT_DOMAIN is set on the controller.
  • No SND or SeiNode manifests in sei-protocol/platform need to change to land this PR.

Test plan

  • Unit: p2p_endpoint_test.go covers p2pEndpointHostname (table-driven), p2pEndpointAddress, p2pEndpointServiceName, generateP2PEndpointService (annotations/selector/ports), effectiveChainID, TCPEnabled.
  • Envtest: SND with TCP creates Service + sets Spec.ExternalAddress + populates NetworkingStatus.P2PEndpoints.
  • Envtest: opt-out clears Spec.ExternalAddress + deletes Service.
  • Envtest: re-opt-in restamps the same deterministic hostname.
  • Envtest: scale-down deletes ordinal-N Service before ordinal-N SeiNode.
  • Envtest: standalone SeiNode with user-set Spec.ExternalAddress is preserved (no SND involvement).
  • Envtest: kubectl edit stomp reconverges via ensureSeiNode diff.
  • Envtest: SEI_P2P_ENDPOINT_DOMAIN unset → no Service created, NetworkingReady=False/NetworkingDisabled.
  • make manifests generate — clean on second invocation.
  • golangci-lint run --new-from-rev=main ./... — 0 issues.
  • CI on this PR.

Out of scope (separate follow-ups in sei-protocol/platform)

  • Terraform SG: open tcp/26656 from 0.0.0.0/0 on the Sei-node SG.
  • SEI_P2P_ENDPOINT_DOMAIN env wiring per cluster's manager-patch.yaml: prod = prod.platform.sei.io, harbor = harbor.platform.sei.io, dev intentionally unset until opted in.
  • First rollout: enable networking.tcp: {} on one advertised full-node SND.

Refs

🤖 Generated with Claude Code

Reframe of the publishable-P2P controller path. Closes prior PR #362.

Architectural shape:

- Networking is an SND-layer concern. The SND reconciler owns the
  per-pod LoadBalancer Service (one per child SeiNode); the SeiNode
  reconciler stays fully networking-agnostic — no parent lookup, no
  Service watch, no env reads.
- ExternalAddress lives on SeiNodeSpec (not Status). The SND injects
  a deterministic vanity hostname at child Create time inside the
  ensureSeiNode diff; the planner reads from Spec. Race window dissolves
  because Spec is populated atomically at Create.
- Vanity hostname: <seinode>-p2p.<chainID>.<gatewayPublicDomain>, with
  the CNAME produced by the existing external-dns Service-source flow.
  ChainID gets a DNS-1123 pattern constraint to keep the hostname legal.
- Standalone SeiNodes (no SND parent) may set Spec.ExternalAddress
  directly; the controller respects the user-set value.
- SEI_VPC_CIDR is parsed once at controller startup. Unset/unparseable
  ⇒ PublishabilityAvailable=false; SNDs with Networking.TCP set surface
  ConditionNetworkingReady=False/VPCCIDRNotConfigured and skip the
  publishable path. AWS LBC events remain the ground-truth signal for
  any cluster-CNI mismatch.
- Lifecycle: opt-in stamps Spec.ExternalAddress + creates Service;
  opt-out clears the field on existing children + deletes Services;
  scale-down deletes the per-ordinal Service before deleting the child
  SeiNode. Runtime opt toggles still require a manual pod delete to
  re-render config.toml (accepted v1 ops contract).
- HTTPEnabled() call-site migration in nodedeployment/networking.go +
  status.go bundled (was a separate proposed PR; co-locating with the
  TCP branch keeps the gating consistent within a single review).

New file: internal/controller/nodedeployment/publishable.go
Removed (vs prior PR #362 plan): no external_address.go on SeiNode side,
no annotation/Condition gate, no per-reconcile pod-IP routability check.

Refs #360

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 28, 2026

PR Summary

High Risk
Changes public P2P advertisement (internet-facing NLBs, DNS hostnames) and moves externalAddress across the CRD boundary; misconfiguration affects peer connectivity cluster-wide.

Overview
Adds SND-managed publishable P2P when spec.networking.tcp is set and SEI_P2P_ENDPOINT_DOMAIN is configured: the deployment reconciler creates per-replica internet-facing NLB Services, stamps each child SeiNode.spec.externalAddress with a deterministic vanity host:26656, and records status.networkingStatus.p2pEndpoints.

API contract shift: externalAddress moves from SeiNode status → spec (planner now sets p2p.external_address from spec). TCPEnabled() gates TCP without the legacy networking: {} HTTP back-compat. ChainID (and genesis chain ID) get a DNS-1123 label pattern for stable P2P hostnames.

Networking reconcile splits HTTP (Gateway) and TCP (P2P NLB) independently; opt-out clears child addresses and deletes Services; scale-down removes the ordinal’s P2P Service before the child. SeiNode reconciler no longer owns P2P LB wiring.

Reviewed by Cursor Bugbot for commit 8428235. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread internal/controller/nodedeployment/networking.go
…Y_PUBLIC_DOMAIN

Platform review surfaced three blockers, all the same root cause: the
mental model assumed the L7 gateway zone (SEI_GATEWAY_PUBLIC_DOMAIN)
was reusable for L4 publishable hostnames, but the cluster reality is
that harbor's external-dns watches `harbor.platform.sei.io` while its
gateway uses `platform.sei.io`, and dev has no gateway public domain
configured at all. Reusing the L7 zone would silently write CNAMEs into
a zone the cluster's external-dns doesn't watch — peers would never
resolve.

Resolution:
- New `SEI_PUBLISHABLE_DOMAIN` env (optional, no fallback). When unset,
  TCP-enabled SNDs surface a distinct condition reason.
- New `ConditionNetworkingReady=False/PublishableDomainNotConfigured`
  reason — distinct from `VPCCIDRNotConfigured` so dev's
  empty-domain-but-cidr-set case doesn't mislabel as
  `InvalidChainIDForDNS` (chain ID is fine; domain is the issue).
- `publishableHostname` switches to `r.PublishableDomain` from
  `r.GatewayPublicDomain`. L7 and L4 zones now diverge per cluster.

Platform PR (separate, follow-up) sets per-cluster:
- prod: SEI_PUBLISHABLE_DOMAIN=prod.platform.sei.io
- harbor: SEI_PUBLISHABLE_DOMAIN=harbor.platform.sei.io
- dev: unset → publishability inert with clear reason

Tests:
- All publishable_test.go cases retargeted to PublishableDomain field.
- New envtest TestPublishableP2P_PublishableDomainNotConfigured_SurfacesCondition
  pins the distinct condition reason.
- Existing envtest suite still green at 67s.

Refs #360
Addresses platform-engineer review blockers on PR #365

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Collaborator Author

Platform cross-review surfaced three blockers (B1/B2/B3) that all reduce to the same root cause: the design assumed SEI_GATEWAY_PUBLIC_DOMAIN was reusable for L4 publishable hostnames, but per-cluster reality is that the L7 gateway zone and the L4 publishable zone may diverge — harbor's external-dns watches harbor.platform.sei.io while its gateway uses platform.sei.io, and dev has no gateway public domain at all.

Fixed in 2c71091:

  • New SEI_PUBLISHABLE_DOMAIN env (optional, no fallback). Parsed once at startup alongside SEI_VPC_CIDR.
  • New ConditionNetworkingReady=False/PublishableDomainNotConfigured reason. Distinct from VPCCIDRNotConfigured and InvalidChainIDForDNS so operators get the right signal.
  • publishableHostname uses r.PublishableDomain, not r.GatewayPublicDomain. The L7 gateway zone and the L4 publishable zone can now diverge per cluster, which matches the actual platform shape.
  • New envtest TestPublishableP2P_PublishableDomainNotConfigured_SurfacesCondition pins the distinct condition reason; all 7 prior publishable scenarios still pass.

The follow-up platform PR now sets:

  • prod: SEI_PUBLISHABLE_DOMAIN=prod.platform.sei.io (or platform.sei.io if you prefer the existing subzone)
  • harbor: SEI_PUBLISHABLE_DOMAIN=harbor.platform.sei.io
  • dev: leave unset → publishability inert with clear PublishableDomainNotConfigured reason

Also addressing the doc-text mismatch from the network review: the Service selector in code is sei.io/node=<seinode>-0, not statefulset.kubernetes.io/pod-name. Functionally equivalent (the SeiNode controller stamps sei.io/node on the single STS pod). PR description updated.

The PR-description vanity-hostname examples will use whatever value the platform PR ultimately sets per cluster.

Comment thread cmd/main.go Outdated
Comment on lines +217 to +246
// Publishable-P2P capability: SEI_VPC_CIDR is a deployment-time
// constant (a VPC CIDR rarely changes within a cluster's life).
// Parse once at startup. When unset or unparseable, the SND
// reconciler still runs — TCP-enabled SNDs surface
// `ConditionNetworkingReady=False/VPCCIDRNotConfigured` on each
// reconcile and no LB Services are created. This is fail-closed at
// the apply boundary, not at admission; an admission webhook is
// deferred until we accumulate more SND-side validation needs.
vpcCIDRRaw := os.Getenv("SEI_VPC_CIDR")
publishabilityAvailable := false
var publishableVPCCIDR *net.IPNet
if vpcCIDRRaw == "" {
setupLog.Info("Publishable-P2P capability disabled: SEI_VPC_CIDR not set")
} else if _, cidr, err := net.ParseCIDR(vpcCIDRRaw); err != nil {
setupLog.Error(err, "Publishable-P2P capability disabled: SEI_VPC_CIDR is unparseable", "value", vpcCIDRRaw)
} else {
publishabilityAvailable = true
publishableVPCCIDR = cidr
setupLog.Info("Publishable-P2P capability enabled", "cidr", cidr.String())
}

// SEI_PUBLISHABLE_DOMAIN is the DNS zone for per-pod vanity hostnames.
// Distinct from SEI_GATEWAY_PUBLIC_DOMAIN because L7 and L4 zones may
// diverge per cluster: harbor's external-dns watches
// `harbor.platform.sei.io` while the gateway uses `platform.sei.io`.
// When unset, TCP-enabled SNDs surface
// `ConditionNetworkingReady=False/PublishableDomainNotConfigured`.
publishableDomain := os.Getenv("SEI_PUBLISHABLE_DOMAIN")
if publishableDomain == "" {
setupLog.Info("Publishable-P2P domain not configured: SEI_PUBLISHABLE_DOMAIN not set; TCP-enabled SNDs will surface PublishableDomainNotConfigured")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were dropping this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 53423de — SEI_VPC_CIDR parsing, PublishabilityAvailable, PublishableVPCCIDR all gone.

// When false, an SND with `Spec.Networking.TCP` set surfaces
// `ConditionNetworkingReady=False/VPCCIDRNotConfigured` and no LB
// Services are created.
PublishabilityAvailable bool
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. The reconciler struct lost both PublishabilityAvailable and PublishableVPCCIDR; only PublishableDomain remains.

Comment on lines +102 to +113
if !r.PublishabilityAvailable {
setCondition(group, seiv1alpha1.ConditionNetworkingReady, metav1.ConditionFalse,
"VPCCIDRNotConfigured",
"spec.networking.tcp requires SEI_VPC_CIDR to be set on the controller; publishable Services are not created")
// Defensive: clear any stale Services left from a prior boot
// where the env was set. Cheap list-and-delete; no-op when
// nothing is there.
if err := r.deletePublishableServices(ctx, group); err != nil {
return fmt.Errorf("deleting publishable services after capability loss: %w", err)
}
return nil
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove if we agree to drop that check.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. The gate collapsed to a single TCPEnabled() && PublishableDomain != "" check; no separate condition reasons for the misconfig paths.

Comment thread internal/controller/nodedeployment/networking.go
Comment on lines +228 to +236
// ExternalAddress is SND-managed when TCP is on, cleared when TCP is
// off, and untouched when the parent does not opt into networking. The
// diff here is the single write site that propagates the field —
// generateSeiNode handles creation, this handles updates and external
// stomps (kubectl edit, manual patches).
if !externalAddressEqual(existing.Spec.ExternalAddress, desired.Spec.ExternalAddress) {
existing.Spec.ExternalAddress = desired.Spec.ExternalAddress
updated = true
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the rationale of this in a three sentence summary?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simpler answer turned out to be dropping the helper entirely. Spec.ExternalAddress is now plain string (not *string), so the diff in ensureSeiNode is just existing.Spec.ExternalAddress != desired.Spec.ExternalAddress and no normalization helper is needed. The pointer was over-engineering — there's no case where "explicitly empty" should mean something different from "unset" for this field.

…ddress, sweep comments

Per PR #365 review feedback. Three less-ambiguous wins applied; subjective
direction (publishable.go vs external_address.go complexity trade) parked
for follow-up.

Dropped:
- SEI_VPC_CIDR env parsing in main.go and the PublishabilityAvailable /
  PublishableVPCCIDR fields on the SND reconciler.
- VPCCIDRNotConfigured + PodOutsideVPCCIDR condition reasons.
- PublishableDomainNotConfigured condition reason (caller folded into one
  combined `TCPEnabled() && PublishableDomain != ""` gate).
- Runtime DNS-1123 belt-and-suspenders in publishableHostname (the CRD
  Pattern on ChainID is the gate).
- externalAddressEqual helper (was solving nil-vs-*"" for a field that
  shouldn't have been a pointer).

Changed:
- Spec.ExternalAddress flipped from *string to plain string. Planner read
  becomes plain `!= ""`; ensureSeiNode diff becomes plain `==`. Template
  guard becomes `spec.ExternalAddress = ""`.
- orphanNetworkingResources now also drops the SND owner ref on
  publishable Services so DeletionPolicyRetain keeps them around. New
  orphanPublishableServices mirrors the existing HTTP-side path.

Aggressive comment sweep across changed files. ~445 net line deletion.
Tests pass: unit + envtest (65s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 53423de. Configure here.

Comment thread internal/controller/nodedeployment/networking.go
- gofmt main.go (struct field alignment after VPC CIDR removal)
- dedupe `atlantic-2` and `prod.platform.sei.io` test literals via
  package-local consts; rename one duplicate `rpc` SND name to
  `validators`; reuse existing `testNamespace` for `pacific-1`

`golangci-lint run --new-from-rev=main ./...` is now clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +42 to +49
// HTTP hostname resolves in DNS, indicating the HTTPRoute + External-DNS
// pipeline is ready. Returns true when no HTTP routes are expected
// (TCP-only deployments, validator mode, private deployments).
//
// The publishable-P2P hostname has no parallel resolvability gate. CometBFT
// peers retry; the controller does not block STS creation on external-dns
// catching up — that would re-introduce the layering inversion the original
// PR #362 design suffered from.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this more concise. Present tense. Describe, don't explain history

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed to two sentences in 5e44109.

Comment on lines +66 to +77
// reconcileNetworking dispatches to two independent sub-reconcilers gated
// on independent presence checks. HTTP (L7 Gateway exposure) and TCP
// (per-pod L4 NLB for P2P) share only the group-level resource labels,
// the `fieldOwner` constant, and the `ConditionNetworkingReady` condition
// vocabulary; they do not nest.
//
// Condition contract: each branch sets `ConditionNetworkingReady` to
// reflect *its* terminal state. When both are enabled, the TCP branch
// runs second and its outcome wins; that's deliberate — TCP is the
// load-bearing path for publishable validators, and the operator wants
// to know its readiness first. When neither is enabled,
// `NetworkingDisabled/false` describes the steady state.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concise

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed to three sentences in 5e44109.

Comment on lines +313 to +318
// Delete the per-ordinal publishable Service before the child
// SeiNode. Both are SND-owned, so owner-ref cascade would
// only fire on SND deletion — leaving the NLB stranded with
// no targets between scale-down and SND delete is the failure
// we explicitly avoid. Idempotent: no-op when publishability
// isn't in use.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make more concise.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed to three lines in 5e44109.

"publishable" was internal jargon. Renamed to "p2pEndpoint" throughout for a
self-documenting concept name. No behavior change.

File renames:
- publishable.go → p2p_endpoint.go
- publishable_test.go → p2p_endpoint_test.go
- envtest/publishable_p2p_test.go → envtest/p2p_endpoint_test.go

API/CRD renames:
- PublishableEndpoint → P2PEndpoint (struct + json:"p2pEndpoint")
- PublishableEndpoints → P2PEndpoints (slice + json:"p2pEndpoints")
- PublishableDomain → P2PEndpointDomain (reconciler field)
- SEI_PUBLISHABLE_DOMAIN → SEI_P2P_ENDPOINT_DOMAIN (env)
- PublishableServicesApplied → P2PEndpointsApplied (condition reason)
- Service label value sei.io/component=p2p-lb → p2p-endpoint
- Field manager nodedeployment-controller-publishable → -p2p-endpoint

Plus a comment-conciseness sweep on the three doc-blocks called out in review
(routeHostnameResolvable, reconcileNetworking, scaleDown's per-ordinal delete).

Refs #365

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Collaborator Author

Renamed publishable*p2pEndpoint* everywhere in 5e44109. Sweep covered file names, identifiers, condition reason, API field + JSON tag, env var (SEI_PUBLISHABLE_DOMAINSEI_P2P_ENDPOINT_DOMAIN), Service label value (p2p-lbp2p-endpoint), and field-manager string. Tests + envtest green; lint delta is 0 issues.

Cursor Bugbot finding: when `networking: { tcp: {} }` (TCP-only) and
`P2PEndpointDomain` is empty, both branches in reconcileNetworking take
their delete-side paths and nobody updates ConditionNetworkingReady. The
condition stays stale or unset.

This is the initial state of every cluster until SEI_P2P_ENDPOINT_DOMAIN
is wired, so the gap fires on every reconcile of a TCP-only SND until then.

Fix: track which tier actually reconciled and set
`NetworkingDisabled` as a catch-all when neither did. Existing envtest
extended with an assertion on the condition.

Refs #365 (Cursor bug 3d2af5a2)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham changed the title feat(publishable-p2p): SND-owned per-pod NLB with Spec.ExternalAddress (#360) feat(p2p-endpoint): SND-owned per-pod NLB + Spec.ExternalAddress (#360) May 29, 2026
@bdchatham bdchatham merged commit 120c979 into main May 29, 2026
5 checks passed
@bdchatham bdchatham deleted the feat/publishable-p2p branch May 29, 2026 17:53
bdchatham added a commit that referenced this pull request May 29, 2026
…ler-side node_id (#368) (#369)

* feat(controller/node): resolve label peers to NLB addresses + compose node_id controller-side

Move node_id resolution for the LabelPeerSource flow from the sidecar's
:26657/status query to the controller's existing per-peer sidecar gRPC.
Status.ResolvedPeers wire format shifts to fully-composed
<node_id>@<host>:<port> strings.

Why: the sidecar's :26657/status path is in-cluster-only (NLBs only
forward 26656). After PR #365, a peer's Spec.ExternalAddress is the
publishable vanity hostname. Using that address in ResolvedPeers means
the sidecar can no longer query :26657 to learn node_id externally. The
controller already has clean in-cluster access to each peer's sidecar
gRPC (port 7777) — same path the genesis CollectAndSetPeers uses to
learn node_id from node_key.json.

Changes:

- internal/controller/node/peers.go: extend resolveLabelPeers to call
  the peer's sidecar via r.Planner.BuildSidecarClient + GetNodeID, and
  compose <node_id>@<host>:<port>. host is Spec.ExternalAddress when
  set, else headless Service DNS at p2p port. On per-peer failure the
  whole resolve errors and controller-runtime retries — matches the
  genesis path semantics.

- internal/planner/planner.go: Label sources route to
  PeerSourceStatic (same source path the genesis-Static flow already
  uses) instead of PeerSourceDNSEndpoints. The sidecar's
  DNSEndpointsSource is unchanged and still serves EC2Tags peers.

- internal/task/task.go: widen task.SidecarClient interface with
  GetNodeID(ctx) (string, error). The full *sidecar.SidecarClient
  already implements this; mocks in tests get one-line stubs.

- api/v1alpha1/seinode_types.go: ResolvedPeers godoc updated for the
  new composed wire format.

Tests:

- New TestReconcilePeers_PrefersExternalAddress covers the
  Spec.ExternalAddress branch.
- Existing peers tests updated for the new composed format.
- Envtest StubSidecarClient gains a GetNodeID stub returning
  "stub-node-id". Suite still passes (~64s).

Wire-format change to Status.ResolvedPeers is the only one-way door.
The field was previously bare hosts; consumers are the planner only,
which is updated in this same PR.

Refs #368

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(peers): per-peer skip + prior-entry preservation on transient sidecar failure

Per platform-engineer's HOLD on PR #369: atomic resolve semantics
wedge fleet-wide reconciles when any one label-matched peer's sidecar
is transient-unavailable. In pacific-1 (6+ SNDs with mutual label
selectors), a single archive sidecar restart would block every
consumer's reconcile loop — not just peer updates, all of it, because
reconcilePeers runs before the status patch.

Fix:
- On per-peer BuildSidecarClient or GetNodeID error, preserve that
  peer's prior entry from Status.ResolvedPeers if available. Indexed
  by host:port to look up the existing <node_id>@host:port.
- If no prior entry exists (peer never resolved), skip with a
  structured log line, do not fail the reconcile.
- The genesis CollectAndSetPeers path keeps its atomic semantics —
  that's a one-shot bounded-cohort operation where retry is correct.
- New tests: TestReconcilePeers_PreservesPriorEntryOnTransientFailure,
  TestReconcilePeers_SkipsNewPeerOnSidecarFailure.

Addresses the platform-engineer HOLD reason on PR #369. K8s-specialist
+ sei-network-specialist already GO'd the atomic version; the
"silent partial list" concern they raised is mitigated by the
structured log line on every skip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(peers): comment sweep — drop narration, lift WHY to PR body

5-9 lines → 2-4 across resolveLabelPeers, indexResolvedPeersByHost,
peerAddress, the planner-side Label→Static branch, the SidecarClient
interface, Status.ResolvedPeers godoc, and the envtest stub. No
behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(peers): nil-guard BuildSidecarClient to honor planner's documented contract

NodeResolver.BuildSidecarClient is documented as nilable ("Nil factory
skips the sidecar probe; used by tests") and ResolvePlan nil-guards it.
resolveLabelPeers was calling it unconditionally — a panic waiting on a
test or future refactor that leaves the factory unset. Treat nil as a
transient per-peer failure (matches existing fallback semantics: preserve
prior entry or skip with a log line). Caught by Cursor Bugbot review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(peers): cover nil-factory preserve-prior path + drop var-block

K8s-specialist cross-review on the nil-guard fix asked for the missing
test variant: nil factory + prior entry present must route to
preserve-prior, not skip. Locks errNoSidecarFactory into the same
recovery path as runtime transient sidecar errors.

Also: drop the var ( ) block at the call site for a single var + sentinel
init — closer to the boring-clear-code style used elsewhere in the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(peers): final comment sweep on nil-guard follow-up

errNoSidecarFactory doc 3 lines → 2; new test docstrings shortened to
match the file's existing one-line convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(manifests): regenerate CRDs for ResolvedPeers godoc trim

Earlier comment sweep on api/v1alpha1/seinode_types.go didn't carry into
the generated CRD YAMLs; verify-generated caught the drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant