Multi-region: one control plane, many datacenters (region → zone model, agent-based transport) #157

HiranAdikari · 2026-06-12T07:06:41Z

HiranAdikari
Jun 12, 2026
Collaborator

Problem

dc-api today manages exactly one site: one Harvester cluster, one Rancher server, one KubeOVN fabric, configured by a single set of DCAPI_* env vars. A second datacenter is coming online, and the product goal is one control plane that deploys and manages resources across many datacenters — the way a public cloud exposes regions — without tenants ever caring where dc-api itself runs.

Two hard requirements shape the design:

Symmetry. The control plane is hosted inside one of the datacenters, but it must talk to its local region exactly the way it talks to every remote region. No in-cluster shortcuts, no privileged local path — relocating the control plane must be a redeploy, not a redesign.
Zones from day one. A region may eventually contain more than one Harvester cluster. The model is region → zone, where a zone is one Harvester (+ its Rancher); naming and schema should assume this even while every region has exactly one zone.

Proposed model

Region/zone as first-class API objects. GET /v1/regions lists regions and their zones with health. Every regional resource (VNet, VM, cluster, bastion, volume) carries an immutable region (and eventually zone) attribute. Tenants and projects stay region-agnostic.

Containment does the validation. A VM's region derives from its VNet; a cluster's from its VNet; a bastion's from its VNet. Creating a VM in region B against a VNet in region A is a 422 with a clear message — no special-casing, the parent resource's region is simply authoritative. Only root resources (VNets, key vaults) take region as a free choice. VNet peering stays same-region until a cross-region fabric exists.

Provider registry. The Strategy/Factory layer already isolates handlers from backends. The singleton provider set becomes a registry keyed by region/zone: providers.For(region, zone) returns the same Compute/Cluster/Network interfaces. Handlers resolve the registry through the resource's region; the reconciler runs per-region loops so one region's outage can't stall another's reconciliation.

Transport: regional agents over outbound WSS (443)

Rather than routing the control plane into each datacenter's management network (site-to-site tunnels, O(n²) growth, inbound firewall holes), each region runs a small dc-agent that dials out to the control plane over WSS on 443 — internet-traversable, TLS end to end, nothing inbound to any datacenter. This is the same topology Rancher's cattle-agent, Azure Arc, and GitHub runners use, and it is the easier story to defend with security teams: each datacenter only ever opens an outbound HTTPS connection.

The agent holds the region's credentials (Harvester kubeconfig, Rancher token) locally — they never leave the datacenter, never sit in the control-plane DB. This strengthens the existing "credentials never leave dc-api" principle into "credentials never leave their region".
The control plane sends desired-state operations down the established channel; the agent executes against its local Harvester/Rancher/KubeOVN and streams results/status back. dc-api's async 202-plus-reconciler model already absorbs the eventual consistency.
The local region runs the identical agent, connecting to the control plane's service address like any other region — symmetry by construction.
Per-zone agents (or one agent per region managing its zones) keeps the blast radius small; agent liveness doubles as the region/zone health signal surfaced in GET /v1/regions and the dashboard.

Region registration & credentials

Admin-driven and API-managed: POST /v1/admin/regions creates the region and mints a one-time agent bootstrap token; the agent is deployed in the datacenter with that token, dials in, completes a token exchange, and receives its long-lived identity (mTLS cert or rotating token). Decommissioning = revoke the agent identity. No region credentials are ever uploaded to the control plane.

Quotas & placement

Per-region tenant caps and project quotas (a region's capacity is physically its own), mirroring how public clouds scope quota by region. Project presence in a region materializes lazily — the per-project namespace/quota mirror is provisioned in a region the first time the project deploys there, not eagerly in every region at project creation.

Other consequences

Images are per-region catalogs (an image must exist on each region's Harvester); whether they're centrally synced or per-region curated is an open question below.
Data-plane independence: if the control plane's host datacenter is lost, workloads in other regions keep running unmanaged. DR for the control plane = DB replica/backup shipped to another region + redeploy-from-GitOps runbook.
The audit/activity framework and RBAC are region-agnostic already (events and grants reference resources, not sites).

Phasing

Phase	Scope
0 — schema readiness	`region`/`zone` columns on regional tables (defaulted to the current site), read-only in API/UI. Cheap; protects against painful retrofits.
1 — model + registry	Region/zone API objects, per-region provider registry, containment validation, region in create flows (CLI flag, UI selector), per-region reconcilers, per-region quotas.
2 — agent transport	dc-agent (WSS dial-out, bootstrap-token registration, credential locality), control plane speaks to ALL regions—including local—through agents. Region health surfaced.
3 — second region GA	Bootstrap the new datacenter via the agent, image catalog story, runbooks, dashboards.
later	Cross-region VNet peering, placement policies, DR automation, multi-zone scheduling within a region.

Open questions

Agent protocol: hand-rolled WSS+JSON command stream vs gRPC streams vs reusing an existing machinery (e.g. embedding a message queue). Bias: smallest thing that supports request/response + watch semantics, versioned from day one.
One agent per region managing N zones, or one per zone? (Per-zone keeps failure domains honest; per-region is less to operate.)
Image catalogs: central registry with per-region sync jobs, or per-region curation with a naming convention?
Does phase 1 ship with the direct-connection provider registry (over existing private connectivity) so multi-region semantics land before the agent transport, or do we hold regions until the agent exists? Bias: yes, ship semantics first behind the registry — the agent slots in as a transport later without API changes.
How much of GET /v1/regions health is agent-liveness vs deeper probes (Harvester API reachability, capacity headroom)?

Feedback welcome — especially on the agent protocol choice (q1) and phase-1-without-agents (q4), which gate the implementation order.

HiranAdikari · 2026-06-12T09:22:06Z

HiranAdikari
Jun 12, 2026
Collaborator Author

Addendum: agent scope — generic manifest application and the per-service operators

A scope question came up: since the agent is deployed onto each Harvester at bootstrap, could it also apply Kubernetes manifests on command — and eventually replace the separate per-service operators (key vault, database, …) running on every Harvester? Decision after going through the operator code:

1. Generic manifest primitives: yes, in protocol v1 — they're essentially free. The agent must hold a local Kubernetes client regardless, because every provider operation it executes is a CR manipulation (KubeVirt VMs, KubeOVN VPCs/Subnets, KeyVaultInstance/DBInstance CRs). So the protocol's core verbs should be generic from day one — Apply(manifest), Delete(gvr, ns, name), Get/WatchStatus(gvr, ns, name) — and the typed provider operations ride on them. This costs nothing extra and keeps the protocol stable as services are added.

2. Operator delivery through the agent: yes, near-term win. Today each operator (and its CRDs) is delivered to each Harvester out-of-band. With manifest primitives, the control plane can ship and upgrade the operators through the agent at bootstrap — one delivery channel per datacenter, no per-operator installation plumbing, and operator versions become control-plane-managed facts.

3. Replacing the operators with the agent: deferred, deliberately. "Apply manifests" is not what the operators are for. Reading the key-vault controller (~1,500 lines): it watches CRs in a continuous reconcile loop with requeues, discovers the Raft leader pod among the vault replicas, drives the vault's own API (mounts, policies, AppRoles), handles unseal flows, and runs finalizer cleanup that connects into the service before letting CRs go. The database operator has the same shape per the managed-services framework contract. Replacing them means the agent would have to host controller loops (an embedded controller-runtime manager with per-service reconciler modules) — coherent as a future architecture ("one agent, many controllers" would collapse N operator Deployments per Harvester into one binary), but it couples agent releases to every service's controller code, unions their RBAC into one identity, and makes the agent the largest possible blast radius. That trade is not worth taking while the operator count is small.

Revisit trigger for #3: when the per-Harvester operator count or their upgrade orchestration becomes a real operational burden, the migration path is incremental — the agent already speaks the manifest/status/watch primitives, so individual operators can be absorbed as embedded reconciler modules one service at a time without protocol changes.

0 replies

HiranAdikari · 2026-06-13T07:35:08Z

HiranAdikari
Jun 13, 2026
Collaborator Author

Milestone breakdown: from rails (foundation) to agent-executed provisioning

The foundation PR (#160) lands the rails — the region/zone model, the outbound WebSocket channel, agent tokens, and derived health. It is dark and non-breaking: an agent can connect and show up as healthy, but nothing routes provisioning through it yet, and dc-api still calls Harvester/Rancher directly. This comment maps the path from there to the actual goal.

Design principle: every cluster is a region — uniformly

There is no privileged "home" region and no direct-to-Harvester path anywhere. The Harvester cluster the control plane happens to run on is reached exactly like every other one: through that cluster's agent, dialing out to the control plane. dc-api never holds a Harvester kubeconfig for any region — not even the one it sits on.

The payoff is that the control plane becomes location-independent. Its only hard runtime dependency is that agents can reach its WSS endpoint over 443. Move it to a different Harvester, a different cloud, or a workstation, and nothing changes for the regions — the agents simply reconnect to the new endpoint. One code path for one and for a hundred clusters.

End-state credential model

dc-api stops holding cluster credentials entirely. In their place:

A region-local, RBAC-scoped Kubernetes ServiceAccount per agent. The agent runs inside its cluster and talks to the local kube-apiserver with its mounted SA token, scoped to exactly what it provisions (KubeVirt VMs, KubeOVN resources, tenant namespaces). That identity never leaves the region.
A capability-scoped channel token held by the control plane. dc-api keeps only the dcagent_ bearer — it authorizes "this agent may connect as region X / zone Y and receive intent," not "admin on region X's cluster."

Blast radius shrinks from compromise the control plane → every region's clusters to compromise the control plane → the channel tokens (an attacker could send malicious intent, which is bounded and auditable, but does not get the raw cluster keys). And because every agent dials out, no region exposes an inbound API.

Milestones

M-A — Protocol v1: operation verbs. Extend the v0 envelope (same JSON-over-WS framing, forward-compatible) with Apply / Delete / GetStatus / WatchStatus, plus a manifest primitive (apply an arbitrary namespaced manifest, within an allowlist). Adds correlation IDs, an async ack, a status-update frame, and a structured error model. v0 agents keep working (unknown frames are already tolerated).

M-B — Agent executor. Give the agent an in-cluster client that applies intent idempotently against the local cluster, reconciles, and reports status back over the channel. Defines the RBAC-scoped ServiceAccount and the reconcile/retry semantics. The manifest primitive here is also what lets us deliver per-service operators (DB, KV, …) through the agent instead of running a separate operator install per cluster.

M-C — Provider routing in dc-api. Implement the existing compute/network provider interfaces as a channel-backed provider selected per region/zone via the provider registry — it serializes intent to the right agent instead of calling Harvester directly. This replaces the direct driver for all regions (no special case). Status flows back into the resources table through the reconciler.

M-D — Credential isolation + control-plane portability. Remove Harvester kubeconfigs from dc-api entirely; agents run with in-cluster SAs. Document the bootstrap path (how the first agent and the control plane itself come up before any channel exists) and a "move the control plane" runbook.

Open questions to resolve as we go

Bootstrap / chicken-and-egg. Steady state is uniform, but the very first agent and the control plane itself need some initial access to be deployed. How is the first agent's token seeded, and how is the control plane stood up on a fresh cluster?
Rancher's place. The agent model cleanly covers Harvester (per-region IaaS). Where does Rancher (RKE2 management) sit in the uniform model — one per region behind its agent, or central?
Intent authorization. Does the agent trust whatever the control plane sends, or validate intent against a local allowlist (operations, namespaces, image sources) as defense-in-depth? A compromised control plane can still send bad intent otherwise.
Delivery semantics. Status streaming vs poll; ordering, backpressure, at-least-once vs exactly-once for Apply; idempotency keys.
Version skew. Agent upgrade story and protocol negotiation when control plane and agents are on different versions.

0 replies

HiranAdikari · 2026-06-15T06:21:31Z

HiranAdikari
Jun 15, 2026
Collaborator Author

Foundation is live — and an agent has now connected to the deployed control plane, end to end

Status update on the rails from the milestone breakdown above: #160 is merged and deployed, and as of today the loop is proven for real. A dc-agent running on a laptop dialed the deployed dc-api outbound over the WebSocket channel, authenticated with a minted dcagent_ token, sent its hello frame, and zone-1 flipped from agent: null / unknown to status: up with a live agent record. No Harvester kubeconfig was involved on the dc-api side for that handshake — the only credential that travelled was the agent's own bearer token, exactly as designed.

The honest, no-magic version of how it was wired: the agent reached dc-api through a kubectl port-forward for this first proof, which conveniently sidesteps the bot/JS challenge on the public API host — which is precisely why #161 (below) adds a dedicated, un-challenged connect. ingress for programmatic agents. The token was minted through the deployed admin endpoint (POST /v1/admin/regions/{region}/zones/{zone}/agent-token) using a real dc-admin OIDC token, so the full operator path works against the deployed stack and not a local stub: admin authenticates → mints → agent connects → zone goes healthy.

Agent log (laptop dc-agent → deployed dc-api):

Region card in cloud-ui reading GET /v1/regions:

Supporting PRs now open

Add dc-agent token minting, CLI, and connect ingress #161 — register-region/zone-on-mint (a freshly bootstrapped cluster comes into being by minting its first token), the dcctl admin agent-token CLI, the dedicated connect ingress (placeholder host in the base; the per-env overlay patches the real host + TLS secret), and sourcing DCAPI_OIDC_AUDIENCE from a single Terraform output so a client's audience can no longer be silently dropped.
(consumer) the dcctl app now embeds groups in its access token (access_token_attributes), which is the claim dc-api authorizes off — without it a public PKCE client like dcctl receives no dc-admin / tenant groups and is rejected from every scoped call.

Next: M-A — turn the heartbeat into a command channel (protocol v1)

The channel today speaks four frames (hello / hello_ack / ping / pong) — pure presence. M-A makes it carry work, without yet moving any provider off the direct path (still dark and non-breaking). The plan:

Frame vocabulary. Add request/response frames carrying a correlation id so dc-api can issue a command and match its reply, plus a streaming *_progress variant for long operations. Errors are structured (code + message); every request is idempotent under an op id the agent dedupes on.
First slice — get_inventory (read-only). dc-api asks an agent; the agent reads its local cluster (nodes, allocatable vs. used CPU/mem, VM count) and replies. Low blast radius, and immediately visible: the region card goes from "up" to "up · 3 nodes · 48 vCPU free." This validates the whole request/response + correlation design before any mutation touches a cluster.
Then apply / delete (mutating). dc-api hands the agent a Kubernetes manifest — a KubeVirt VM CR first — and the agent applies it locally and reports status. This is the slice where dc-api stops needing that cluster's kubeconfig for that operation, and the manifest primitive doubles as how the per-service operators (key vault, database) get delivered through the agent, per the addendum above.

Open questions I'll settle in the design pass before coding: delivery semantics (at-least-once + the agent-side dedupe window); how WatchStatus maps onto the streaming frame vs. dc-api polling; the error taxonomy (transient vs. terminal) and who owns retries; and intent authorization — whether the agent blindly trusts dc-api or enforces a capability allowlist on what it will apply.

Starting the M-A design now — I'll post the protocol sketch here for review before implementation.

0 replies

HiranAdikari · 2026-06-15T06:45:20Z

HiranAdikari
Jun 15, 2026
Collaborator Author

M-A design sketch — the frame envelope and the regions visibility model

Two design points are now concrete (the full design lands as docs/multi-region-protocol-v1.md in the M-A PR; the protocol frames are already implemented and tested).

1. Frame envelope — one generic request/response, not a type per verb. Rather than a distinct frame type per operation, v1 uses a single envelope with a correlation id and an op name, so the wire grammar stays fixed (six frame types, forever) and verbs grow in a registry instead of the grammar:

req       → { "type":"req",      "id":"<uuid>", "op":"get_inventory", "params":{…} }
res (ok)  → { "type":"res",      "id":"<uuid>", "ok":true,  "result":{…} }
res (err) → { "type":"res",      "id":"<uuid>", "ok":false, "error":{ "code":"…", "message":"…" } }
progress  → { "type":"progress", "id":"<uuid>", "stage":"…", "detail":"…" }   // advisory, 0+

The agent advertises the ops it can serve on hello ("ops":[…], omitted by a v0 agent), so a verb an older agent can't do returns a clean OP_UNSUPPORTED rather than a timeout. Dark and non-breaking: v0 frames are byte-identical, and both ends still tolerate unknown frame types in either direction.

2. Regions visibility — admin vs. tenant. Node counts and capacity (from get_inventory) are Harvester-internal, so the rich region card is admin-only. Tenants never get a regions dashboard — to a tenant a region is only a placement target: create flows list only placeable (up) regions, and it is enforced server-side (handlers reject placement into a down/unknown region with 409/422), not merely filtered in the dropdown. Existing resources in a region that goes down stay listed and simply report down — never hidden or deleted.

Sequencing note worth flagging: today provisioning still runs direct-to-Harvester, so a region's live status means "an agent is connected", not "this region can provision". The live-agent placement gate therefore switches on with M-C (when the agent becomes the provisioning path); until then placeability rides an admin "region enabled" flag, so nothing that works today breaks. In short: placeable = enabled AND healthy-path.

Proceeding to build the RPC machinery now — server Session/registry and serve() routing → agent dispatch + a local-cluster Executor → the admin-only inventory endpoint → the admin region card.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-region: one control plane, many datacenters (region → zone model, agent-based transport) #157

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi-region: one control plane, many datacenters (region → zone model, agent-based transport) #157

Uh oh!

HiranAdikari Jun 12, 2026 Collaborator

Problem

Proposed model

Transport: regional agents over outbound WSS (443)

Region registration & credentials

Quotas & placement

Other consequences

Phasing

Open questions

Replies: 4 comments

Uh oh!

HiranAdikari Jun 12, 2026 Collaborator Author

Addendum: agent scope — generic manifest application and the per-service operators

Uh oh!

HiranAdikari Jun 13, 2026 Collaborator Author

Milestone breakdown: from rails (foundation) to agent-executed provisioning

Design principle: every cluster is a region — uniformly

End-state credential model

Milestones

Open questions to resolve as we go

Uh oh!

Uh oh!

HiranAdikari Jun 15, 2026 Collaborator Author

Foundation is live — and an agent has now connected to the deployed control plane, end to end

Supporting PRs now open

Next: M-A — turn the heartbeat into a command channel (protocol v1)

Uh oh!

HiranAdikari Jun 15, 2026 Collaborator Author

M-A design sketch — the frame envelope and the regions visibility model

HiranAdikari
Jun 12, 2026
Collaborator

HiranAdikari
Jun 12, 2026
Collaborator Author

HiranAdikari
Jun 13, 2026
Collaborator Author

HiranAdikari
Jun 15, 2026
Collaborator Author

HiranAdikari
Jun 15, 2026
Collaborator Author