Repo Radius: externalize control-plane and Terraform state (rad startup / rad shutdown)#12214
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
Adds Repo Radius “state externalization” support by enabling PostgreSQL-backed control-plane state and introducing rad shutdown / rad startup to back up/restore both PostgreSQL databases and Terraform recipe state (Kubernetes Secrets) via a git orphan branch.
Changes:
- Enable
database.enabled=trueend-to-end in the Helm chart (PostgreSQL provider wiring, init-db ConfigMap, image pinning, chart tests). - Add Terraform-state Secret backup/restore (
pkg/cli/tfstate) and PostgreSQL dump/restore (pkg/cli/pgbackup), persisted through a git orphan-branch worktree (pkg/cli/gitstate). - Add
rad startup/rad shutdowncommands plus unit tests and an opt-in destructive functional lifecycle test.
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/functional-portable/statestore/noncloud/statestore_lifecycle_test.go | Adds opt-in destructive E2E lifecycle test for shutdown/startup preserving Terraform state across reinstall. |
| pkg/components/database/databaseprovider/storageprovider_test.go | Adds unit tests for ${VAR} env substitution in PostgreSQL connection URLs. |
| pkg/components/database/databaseprovider/factory.go | Implements ${VAR} expansion for PostgreSQL URL config via environment variables. |
| pkg/cli/tfstate/tfstate.go | New package to back up/restore Terraform Kubernetes-backend state Secrets to/from the state directory. |
| pkg/cli/tfstate/tfstate_test.go | Unit tests for tfstate backup/restore (fake clientset). |
| pkg/cli/pgbackup/pgbackup.go | New helpers to pg_dump / psql the control-plane DBs via kubectl exec, plus readiness wait. |
| pkg/cli/gitstate/gitstate.go | New orphan-branch worktree manager for persisting state outside the main working tree. |
| pkg/cli/gitstate/gitstate_test.go | Unit tests for orphan-branch worktree and push behavior using real temp git repos. |
| pkg/cli/cmd/startup/stateclient.go | Startup command wrapper interface over pg/tf restore operations (testable). |
| pkg/cli/cmd/startup/startup.go | Implements rad startup to open worktree, wait for DB, restore DB + Terraform state. |
| pkg/cli/cmd/startup/startup_test.go | Unit tests ensuring restore order and failure short-circuiting for rad startup. |
| pkg/cli/cmd/shutdown/stateclient.go | Shutdown command wrapper interface over pg/tf backup operations (testable). |
| pkg/cli/cmd/shutdown/shutdown.go | Implements rad shutdown to back up DB + Terraform state then commit/push to orphan branch. |
| pkg/cli/cmd/shutdown/shutdown_test.go | Unit tests ensuring both stores are backed up and commit/push happens (or fails correctly). |
| eng/design-notes/2026-06-repo-radius-state-storage.md | Adds technical design note for Repo Radius state storage approach and decisions. |
| deploy/Chart/values.yaml | Pins PostgreSQL image tag default to 16-alpine. |
| deploy/Chart/tests/database_test.yaml | Adds helm-unittest coverage for database-enabled/disabled rendering and wiring. |
| deploy/Chart/templates/ucp/deployment.yaml | Injects POSTGRES_PASSWORD env var into UCP when database.enabled=true. |
| deploy/Chart/templates/ucp/configmaps.yaml | Switches UCP database provider to PostgreSQL when enabled (URL uses ${POSTGRES_PASSWORD}). |
| deploy/Chart/templates/rp/deployment.yaml | Injects POSTGRES_PASSWORD into applications-rp when enabled. |
| deploy/Chart/templates/rp/configmaps.yaml | Switches applications-rp database provider to PostgreSQL when enabled. |
| deploy/Chart/templates/dynamic-rp/deployment.yaml | Injects POSTGRES_PASSWORD into dynamic-rp when enabled. |
| deploy/Chart/templates/dynamic-rp/configmaps.yaml | Switches dynamic-rp database provider to PostgreSQL when enabled. |
| deploy/Chart/templates/database/statefulset.yaml | Mounts init-db scripts into the Postgres StatefulSet. |
| deploy/Chart/templates/database/configmaps.yaml | Fixes POSTGRES_DB secret value to literal "radius". |
| deploy/Chart/templates/database/configmap-initdb.yaml | Adds init-db script ConfigMap to create per-RP DBs/users and the resources table. |
| cmd/rad/cmd/root.go | Wires rad startup and rad shutdown into the CLI root command. |
40a7aaf to
83c7893
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #12214 +/- ##
==========================================
+ Coverage 52.81% 52.84% +0.03%
==========================================
Files 743 751 +8
Lines 47788 48313 +525
==========================================
+ Hits 25238 25532 +294
- Misses 20197 20385 +188
- Partials 2353 2396 +43 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Review: externalize control-plane & Terraform state (rad startup / rad shutdown)
I reviewed the code statically, ran the unit tests, reproduced the failures against a real Postgres container, and ran a full local end-to-end install on kind with database.enabled=true. The new core Go packages (gitstate, tfstate, controlplane) are well-structured and the factory.go env-var fix is correct — but the database.enabled=true startup path does not work out of the box. There are two independent blocking bugs, both reproduced end-to-end, plus some non-blocking notes and test-coverage gaps.
The two blocking items also have inline comments anchored to the exact lines (
deploy/Chart/values.yamlanddeploy/Chart/templates/database/configmap-initdb.yaml).
🔴 Blocking
1. database.tag: 16-alpine points at a mirror image that does not exist
deploy/Chart/values.yaml:200-201 changes tag: latest → 16-alpine, resolving via the radius.image helper to ghcr.io/radius-project/mirror/postgres:16-alpine, which is not in the mirror. database-0 sits in ImagePullBackOff and the control plane never starts.
EXISTS ghcr.io/radius-project/mirror/postgres:latest <- pre-PR value
NOT FOUND ghcr.io/radius-project/mirror/postgres:16-alpine <- this PR's value
NOT FOUND ghcr.io/radius-project/mirror/postgres:16
NOT FOUND ghcr.io/radius-project/mirror/postgres:16.4-alpine
EXISTS docker.io/library/postgres:16-alpine <- real upstream
The upstream tag exists on Docker Hub, but the radius mirror was never populated with it. This breaks database.enabled=true in CI and every default-registry environment. I could only get past it with --set database.image=docker.io/library/postgres.
Fix (pick one): populate the mirror with postgres:16-alpine before merge, revert to a tag that exists in the mirror, or point image/tag at an upstream-resolvable reference. Either way, add CI coverage for database.enabled=true so this can't regress silently.
2. permission denied for table resources — init-db never grants the per-RP users
deploy/Chart/templates/database/configmap-initdb.yaml:33-48 creates the resources table while connected as $POSTGRES_USER (superuser radius), but each RP connects as its own per-RP user (ucp, applications_rp, dynamic_rp) and is never granted privileges. After overriding the image so Postgres started, init-db ran cleanly and created the per-RP DBs/tables, then UCP crashed on startup:
Service api terminated with error: ERROR: permission denied for table resources (SQLSTATE 42501)
followed by a nil-pointer panic on the failed-startup shutdown path. applications-rp/dynamic-rp hit the same wall as soon as they take traffic.
Fix: in the table-creation loop, grant the per-RP user (precedent already exists in build/scripts/start-radius.sh:247-248):
GRANT ALL PRIVILEGES ON TABLE resources TO "$RESOURCE_PROVIDER";
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO "$RESOURCE_PROVIDER";or create the table while connected as the per-RP user instead of the superuser.
🟡 Non-blocking
expandEnvURLsilently expands unset vars to empty string (pkg/components/database/databaseprovider/factory.go:88-98).os.Getenvreturns""for a missing var, so a typo'd/unset variable yields a malformed connection string and a confusing downstream error rather than a clear "required env var X not set". Consideros.LookupEnv+ fail-fast.- init-db schema duplicates
deploy/init-db/db.sql.txt. TheresourcesDDL now lives in two places; they will drift. Source it from one place. replicasOfcoerces0 → 1. An explicitreplicas: 0(intentional scale-to-zero) is silently overridden to 1. Distinguish "unset" from "explicitly zero".- Postgres password is regenerated on every chart apply, which will break an already-initialized data directory whose users were created with the previous password. Persist/Secret-pin it.
rad startup/rad shutdownrequire a git repo but the error path when none is present could be friendlier.startup.Rundefers a secondScaleUpthat can double-scale on the success path. Worth a second look.
🧪 Test coverage
Core packages are well covered (gitstate ~80%, tfstate ~79%, controlplane ~78%). Gaps:
pkg/cli/pgbackuphas 0% coverage — no test file at all.HasBackupis pure file logic and trivially unit-testable; please addpgbackup_test.go.Validate()is 0% in bothpkg/cli/cmd/startupandpkg/cli/cmd/shutdown. 51 othercmdpackages exerciseValidate()via the sharedradcli.SharedCommandValidationharness — please follow that pattern. (Run()is otherwise well covered via fakes.)
How I validated
Static review of all 30 changed files → go build/go test on changed packages (pass) → standalone postgres:16-alpine container to reproduce the GRANT failure → full kind install with custom-built images and database.enabled=true to confirm both blockers end-to-end. Both reproduce from a clean install. Happy to re-validate once they're addressed.
|
Thanks for the thorough end-to-end review — both blocking bugs are fixed in 9da4f58, and I've addressed most of the non-blocking notes. Summary: 🔴 Blocking — fixed
🟡 Non-blocking — fixed
🧪 Test coverage — fixed
Deferred (with rationale) — happy to do these if you'd like them in this PR
Ready for re-validation whenever you have a chance. |
|
Filed the deferred follow-ups as tracking issues:
|
Fixes the pre-existing gaps that prevented database.enabled=true from producing a working PostgreSQL-backed control plane: - UCP, Applications RP, and Dynamic RP configmaps/deployments were hardcoded to the apiserver provider with no database.enabled conditional; they now switch to the postgresql provider and inject POSTGRES_PASSWORD from the database secret when database.enabled=true. - Add init-db ConfigMap (mounted at /docker-entrypoint-initdb.d) that creates the per-RP databases, users, and tables on first start (option 3 from #8398). - Fix POSTGRES_DB secret value (was the literal string POSTGRES_DB). - Pin the postgres image tag to 16-alpine. - Fix the databaseProvider URL env-var substitution in factory.go, which replaced the entire URL with the first captured variable name instead of expanding env-var references. Adds helm-unittest coverage for the conditional rendering and a unit test for the env-var expansion helper. Also adds the Repo Radius state-storage technical design note. Relates to #8096, #8398 Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
Terraform recipes store their state in Kubernetes Secrets (the Terraform
"kubernetes" backend) in the radius-system namespace, not in the Radius
PostgreSQL databases. On an ephemeral control plane those Secrets are
destroyed on teardown, so a second deploy of the same Terraform-backed
resource in a later run plans against an empty backend and either fails
or orphans cloud resources.
Add a pkg/cli/tfstate package that exports the Secrets labelled
tfstate=true to a state directory and restores them into a fresh cluster
before any deploy runs. The label selector also captures the chunked
tfstate-{workspace}-{suffix}-{index} Secrets the backend creates for
large state. Server-managed fields are stripped on backup, and restore
is idempotent (create-or-update).
Covered by unit tests using the client-go fake clientset.
Relates to #8096
Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
Adds the two commands that back up and restore all durable Radius state across an ephemeral control plane, plus the end-to-end lifecycle test. The commands operate on the current workspace's Kubernetes context like any other command. They do not create or delete clusters and do not install Radius; cluster lifecycle is the caller's responsibility. There is no dedicated workspace kind. - pkg/cli/pgbackup: control-plane PostgreSQL backup/restore via "kubectl exec pg_dump/psql". - pkg/cli/gitstate: persists the state directory to a git orphan branch (radius-state) in an isolated worktree. CommitAndPush fails loudly when a remote is configured (a failed backup push would otherwise be silent data loss) and tolerates the no-remote local/test case. - rad shutdown: backs up the control-plane databases and the Terraform state Secrets, then commits and pushes them. - rad startup: waits for the database, then restores the control-plane databases and the Terraform state Secrets. The lifecycle test (test/functional-portable/statestore) installs Radius with database.enabled=true, deploys a Terraform-backed resource, shuts down, uninstalls, reinstalls, starts up, and deploys an update to the same resource -- the cross-run path that fails when Terraform state is lost. It is destructive and requires a cluster, so it is skipped unless RADIUS_STATE_E2E is set. Relates to #8096 Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
…e test Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
- gitstate: treat a git fetch failure as fatal when the state branch exists on the remote, so a transient network/credential error cannot silently restore stale or empty state. - gitstate: inject a fallback git identity (Radius <radius@radapp.io>) for the state commit when the repo has none configured, so rad shutdown works in fresh CI environments. - tfstate: drop the write to the deprecated Secret.SelfLink field (staticcheck SA1019), which was failing the lint check. - databaseprovider test: set MISSING explicitly so the env-var expansion test is not flaky on runners that happen to have it set. - statestore e2e: assert at least one tfstate Secret exists rather than exactly one, since the backend may shard large state across multiple Secrets. - design note: mark the checksum manifest as future work (not implemented in this delivery) and fix a spelling (behaviour -> behavior). - .cspellignore: add Sylvain. Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
rad startup runs after rad install, so the control-plane pods are already running and connected to PostgreSQL when state is restored. Restoring a pg_dump (DROP TABLE / CREATE TABLE) underneath those live connections can invalidate the providers' cached prepared statements (pgx default QueryExecModeCacheStatement; the resources table OID changes), and races the UCP initializer's boot-time writes. rad startup now scales the database-backed deployments (ucp, applications-rp, dynamic-rp) to zero before the restore and back to their previous replica counts afterward, via a new pkg/cli/controlplane package. This makes the restore atomic with respect to its consumers and ensures the providers establish fresh connection pools against the restored schema. The deployment engine and dashboard do not connect to PostgreSQL and are left running. The control plane is always scaled back up, including on a failed restore, and a deployment whose previous replica count was zero is restored to one. Adds unit tests for the scaler (fake clientset with a reconciling reactor that mirrors spec replicas to status) and updates the startup runner tests to assert the scale-down -> restore -> scale-up ordering. Records the decision in the state-storage design note. Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
Two blocking bugs reproduced end-to-end by review, plus several non-blocking improvements and test-coverage gaps. Blocking: - The postgres image resolved to ghcr.io/radius-project/mirror/postgres:16-alpine, which the registry mirror does not publish (only :latest), causing ImagePullBackOff. Point the chart at docker.io/library/postgres:16-alpine, which is pullable and keeps the version pinned. - The init-db script created the resources table as the superuser but never granted the per-RP users (ucp, applications_rp, dynamic_rp) access, so UCP crashed on startup with "permission denied for table resources (42501)". Grant the per-RP user privileges on the table and sequences inside the table-creation loop (matches build/scripts/start-radius.sh). Non-blocking: - expandEnvURL now fails fast (os.LookupEnv) when a referenced env var is unset, instead of silently producing a malformed connection string. - controlplane.replicasOf preserves an explicit replicas: 0 rather than coercing it to 1, so ScaleUp faithfully restores the prior state. Test coverage: - Add pkg/cli/pgbackup/pgbackup_test.go covering HasBackup. - Add Validate()/command-shape tests for rad startup and rad shutdown via the shared radcli validation harness. - Add helm-unittest guards asserting the database image is the pullable reference and that the init-db script grants the per-RP users, so both blocking regressions are caught in CI without a cluster. Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
9da4f58 to
9a157fd
Compare
Radius functional test overviewClick here to see the test run details
Test Status⌛ Building Radius and pushing container images for functional tests... |
Functional Tests - upgrade-noncloud3 tests ±0 1 ✅ - 2 7m 9s ⏱️ + 3m 45s For more details on these failures, see this check. Results for commit 5e2ee54. ± Comparison against base commit 385f38e. |
Repo Radius: externalize control-plane and Terraform state (
rad startup/rad shutdown)Description
Implements Investment 2 of the Repo Radius feature spec — externalization of the Radius data
store, scoped to the state-storage aspects only (control-plane PostgreSQL state and Terraform
recipe state). It adds two kind-agnostic commands,
rad startupandrad shutdown, that back upand restore all durable Radius state across an ephemeral control plane, plus the Helm and
Terraform-state plumbing they depend on.
Design note:
eng/design-notes/2026-06-repo-radius-state-storage.md.There is no dedicated workspace kind. The commands operate on the current workspace's
Kubernetes context like any other command and do not create or delete clusters or install Radius —
cluster lifecycle is the caller's responsibility.
What's included
The change is organized as three logical commits:
PostgreSQL enablement (closes the chart gaps for
database.enabled=true)postgresqlprovider and inject
POSTGRES_PASSWORDwhendatabase.enabled=true(previously hardcoded toapiserver).tables before the servers start.
POSTGRES_DBsecret value and pins the postgres image to16-alpine.databaseProviderURL env-var substitution infactory.go.Terraform recipe state backup/restore (
pkg/cli/tfstate)tfstate=true), not in PostgreSQL, sothat state is lost on teardown. This package exports and restores those Secrets, including the
chunked
tfstate-{workspace}-{suffix}-{index}Secrets for large state.rad startup/rad shutdown+ end-to-end lifecycle testpkg/cli/pgbackup: control-plane PostgreSQL backup/restore viakubectl exec pg_dump/psql.pkg/cli/gitstate: persists the state directory to aradius-stategit orphan branch in anisolated worktree; the backup push fails loudly when a remote is configured (a failed push
would otherwise be silent data loss) and tolerates the no-remote local/test case.
rad shutdownbacks up both stores then commits and pushes;rad startupwaits for thedatabase then restores both stores.
Testing
Unit tests cover the chart rendering, the Terraform-state round-trip (fake Kubernetes client), the
git orphan-branch worktree behaviour (real temporary repos), and the command runners
(hand-written fakes). All pass, along with the existing 85 Helm chart tests.
End-to-end test dependency (please read)
The end-to-end lifecycle test lives at
test/functional-portable/statestoreand exercises thefull path that this work exists to protect: install → deploy a Terraform-backed resource →
rad shutdown→ tear down → reinstall →rad startup→ deploy an update to the same resource.The update is the path that fails when Terraform state is lost.
This test depends on the separate Repo Radius workflow code (in flight) that creates the
ephemeral cluster, installs Radius, and runs the deploy. Because
rad startup/rad shutdownare intentionally kind-agnostic and do not manage cluster lifecycle, the test needs that workflow
to stand up the environment. Until it lands, the test drives the install/uninstall itself and is
gated behind the
RADIUS_STATE_E2Eenvironment variable, so it does not run in the normalfunctional suite. Once the shared cluster-create + deploy workflow is merged, the test's
install/uninstall helpers should be re-pointed at that code rather than duplicating it.
Related
Type of change
This pull request adds new features (state externalization commands) for Radius.