feat: Azure Container Apps deployment for sharing server (Terraform + workflows) by rajbos · Pull Request #677 · rajbos/github-copilot-token-usage

rajbos · 2026-04-27T12:15:17Z

Summary

Adds a full infrastructure-as-code deployment for the sharing server to Azure Container Apps, using Terraform and GitHub Actions.

Architecture

push to main   → GitHub Env: production → sharing-server-prod (min 1 replica)
push to branch → GitHub Env: testing   → sharing-test-<slug><hash> (scale-to-zero)
branch delete  → cleanup workflow      → terraform destroy

Resources per deployment:

Azure Container Apps Environment (isolated per deployment)
Azure Storage Account (Standard LRS) + Azure Files share → mounted at /data
Azure Container App (0.25 vCPU / 0.5 GiB, max 1 replica for SQLite safety)
HTTPS ingress with auto-FQDN; BASE_URL is computed from ACA default_domain at apply time

Estimated cost: ~~$0.10/month (prod, always-on) — mostly the Azure Files share (~~$0.06/GB/month LRS). Test envs scale to zero and cost near nothing while idle.

Changes

File	Change
`sharing-server/infra/providers.tf`	AzureRM + Random providers, empty azurerm backend
`sharing-server/infra/variables.tf`	All input variables
`sharing-server/infra/main.tf`	Storage account, Azure Files, ACA environment + app
`sharing-server/infra/outputs.tf`	app_url, oauth_callback_url, dashboard_url, health_url
`.github/workflows/sharing-server-deploy.yml`	Build image → terraform apply
`.github/workflows/sharing-server-cleanup.yml`	terraform destroy on branch delete
`sharing-server/src/db.ts`	Switch SQLite journal_mode WAL → DELETE (Azure Files SMB compatibility)

Setup required (one-time)

1. Terraform state storage

Pre-create an Azure Storage Account + blob container for Terraform state, then set:

TF_STATE_RESOURCE_GROUP (repo var)
TF_STATE_STORAGE_ACCOUNT (repo var)
TF_STATE_CONTAINER (repo var, e.g. tfstate)

2. GitHub Environments

Create two GitHub Environments: production and testing.

For each, create an Azure AD app registration with a federated credential:

production subject: repo:rajbos/github-copilot-token-usage:environment:production
testing subject: repo:rajbos/github-copilot-token-usage:environment:testing

Assign Contributor role on the resource group + Storage Blob Data Contributor on the TF state storage account.

Add these secrets to each environment:

AZURE_CLIENT_ID
AZURE_TENANT_ID
AZURE_SUBSCRIPTION_ID
SHARING_GITHUB_CLIENT_ID
SHARING_GITHUB_CLIENT_SECRET
SHARING_SESSION_SECRET

3. Repo variables

AZURE_RESOURCE_GROUP — resource group to deploy into
AZURE_LOCATION — Azure region (default: westeurope)
SHARING_ALLOWED_GITHUB_ORG — optional org restriction

4. GitHub OAuth App

Create a GitHub OAuth App. The first deployment will output the callback URL to register. For the production deployment, you can also configure a custom domain via ACA's built-in HTTPS binding.

Notes

Dashboard OAuth on test envs: The dashboard login won't work until the ACA FQDN is registered as a callback URL in the OAuth App. The API endpoints (/health, /api/upload) work without this.
SQLite journal mode: Changed WAL → DELETE so the database works correctly on Azure Files (SMB). At our write frequency (small batches every ~5 min) the performance difference is negligible.
max_replicas = 1: Required to prevent SQLite file corruption on concurrent writes from multiple container instances.

Adds infrastructure-as-code and CI/CD workflows to deploy the sharing server to Azure Container Apps with Azure Files-backed SQLite persistence. Changes: - sharing-server/infra/: Terraform module (providers, variables, main, outputs) - ACA Environment + Container App per deployment (prod/test isolation) - Azure Storage Account + Azure Files share for /data persistence - scale-to-zero for test envs, min_replicas=1 for prod - BASE_URL auto-derived from ACA environment default_domain - .github/workflows/sharing-server-deploy.yml: build container image and terraform apply on push to any branch; uses GitHub Environments (production / testing) for OIDC-based Azure auth with env-scoped secrets - .github/workflows/sharing-server-cleanup.yml: terraform destroy on branch deletion (test envs only; main is excluded) - sharing-server/src/db.ts: switch SQLite journal_mode from WAL to DELETE; WAL requires shared-memory locking that Azure Files (SMB) does not support Required repo variables: AZURE_RESOURCE_GROUP, AZURE_LOCATION (opt), TF_STATE_RESOURCE_GROUP, TF_STATE_STORAGE_ACCOUNT, TF_STATE_CONTAINER, SHARING_ALLOWED_GITHUB_ORG (opt) Required repo secrets (per GitHub Environment): AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, SHARING_GITHUB_CLIENT_ID, SHARING_GITHUB_CLIENT_SECRET, SHARING_SESSION_SECRET Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

OIDC federated credentials are not available on the target subscription. Replace ARM_USE_OIDC + id-token:write with ARM_CLIENT_SECRET from a pre-created service principal stored as a GitHub Environment secret. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ken-usage (subscription 877cc7f4-0de3-4b2a-a3d6-ecd82e7b7cd4)\n\nAdds sharing-server/infra/create-sp.sh to create a service principal scoped to the target resource group and output SDK auth JSON.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The service principal creation script was added during prototyping and should not be committed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ured - Group multiple >> GITHUB_OUTPUT redirects in 'Compute deployment parameters' step using { ...; } >> file syntax to satisfy shellcheck SC2129 (style: use grouped redirects instead of individual ones) - Add 'Check Azure credentials' pre-flight step to the deploy job; sets configured=false output when ARM_CLIENT_ID is absent so all Terraform steps are skipped gracefully instead of failing with an opaque backend error when the GitHub environment is not yet set up Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The azurerm_container_app resource uses 'volume_mounts' (plural) for the block type inside container, not 'volume_mount'. This caused Terraform plan to fail with 'Blocks of type volume_mount are not expected here'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds a server-side PAT (ORG_CHECK_TOKEN) to validate org membership without relying on the user's own token having read:org scope or SAML SSO authorization. The PAT is stored as a GitHub Environment secret and passed to the container via an ACA secret reference. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ider The service principal has Contributor role scoped to the resource group only, not subscription-level permissions. Terraform's default behavior tries to register all supported Azure Resource Providers at the subscription level, which fails with 403 AuthorizationFailed for each provider. Setting resource_provider_registrations = "none" disables this behavior. The required providers (Microsoft.App, Microsoft.Storage, Microsoft.OperationalInsights) must already be registered in the subscription (which they are for an existing Azure environment). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ACA rejects secrets with empty string values. The testing environment does not set ORG_CHECK_TOKEN, so github_org_check_token defaults to empty string. Use dynamic blocks for both the secret and the env var referencing it so they are only created when a non-empty token value is provided. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two bugs caused a broken database state after any startup lock conflict: 1. No busy_timeout: SQLite returns SQLITE_BUSY immediately on Azure Files when a previous container revision still holds a transient SMB lock. Adding PRAGMA busy_timeout = 10000 lets SQLite retry for up to 10 s. 2. _db assigned before initSchema: if PRAGMA journal_mode = DELETE threw SQLITE_BUSY, _db was set but initSchema had not run. All subsequent getDb() calls saw _db as truthy, skipped init, and crashed with 'no such table: users' on every request until the container restarted. Fix: use a local variable during init; only assign _db on full success. On failure, close the half-open connection and rethrow so the next request retries the whole initialization. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Without an onError handler Hono silently converts unhandled exceptions to a plain-text 500 with nothing in the container logs. Add app.onError() to log every unhandled error with timestamp, method, and path. Also wrap upsertUser() in the /auth/github/callback handler — the only uncaught call site — so that database errors (e.g. SQLite init failure on Azure Files) surface as a descriptive error page instead of a blank 500, and the error is logged to stdout for ACA log inspection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

SQLite on Azure Files SMB fails with 'database is locked' (SQLITE_BUSY) because ACA scale-to-zero kills the container process without cleanly closing the SMB file handle. Azure Files retains the byte-range oplock for 30-60s after the process dies, blocking the next container's startup. Fixes applied: - min_replicas default changed 0 → 1: container stays running, no more cold starts that race against stale SMB oplocks. - busy_timeout increased 10s → 60s: safety net during rolling deploys where old and new revisions briefly coexist. - closeDb() exported from db.ts: properly closes the SQLite connection and releases the SMB lock during shutdown. - SIGTERM/SIGINT handlers in server.ts: call closeDb() before exit so ACA's drain period is enough for the new revision to get the lock. - Eager DB init at startup: lock contention is resolved before the first user request rather than surfacing as a mid-request 500 error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ploy During an ACA rolling update, old and new revisions briefly share the same Azure Files mount. The old revision's SQLite connection holds a file lock that blocks the new revision's startup. Previous approach: busy_timeout=60s blocked the event loop for a full minute before the server even started listening — health probes failed, ACA restarted the container in a loop. New approach: - busy_timeout reduced to 5 s (short block per attempt, event loop stays live) - Server starts listening FIRST so health checks respond immediately - initDbWithRetry() runs async after server start: up to 20 attempts with 5-30 s back-off between them, giving the old revision's SMB oplock time to release (~30-120 s for Azure Files byte-range lock release) - Each getDb() attempt only blocks the loop for ≤5 s - On success the module-level _db singleton is set; all subsequent requests use it without any retry overhead Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…king Azure Files SMB does not reliably support the POSIX advisory byte-range locks that SQLite requires. This caused persistent 'database is locked' errors even with a single container, as the lock was never released by the Azure Files server within a reasonable time. Changes: - db.ts: SQLite now runs on LOCAL_DATA_DIR (/tmp/db, container local ephemeral disk). Azure Files (DATA_DIR /data) is used only for backup/restore via plain file copy (no locking involved). WAL journal mode is now safe since we're on a local filesystem. Added restoreFromBackup() and backupToAzureFiles() exports. - server.ts: Calls restoreFromBackup() at startup before initDbWithRetry(). SIGTERM handler now calls backupToAzureFiles() before closeDb(). Periodic backup every 5 minutes guards against unexpected SIGKILL. - main.tf: Adds LOCAL_DATA_DIR=/tmp/db and explicit DATA_DIR=/data env vars to the container. DATA_DIR remains the Azure Files mount point. Data persistence: backup is written on clean shutdown (SIGTERM) and every 5 minutes. Max data loss on unexpected kill: 5 minutes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- variables.tf: add custom_domain variable (optional, default '') - main.tf: split locals into aca_fqdn + app_fqdn (prefers custom domain), add azurerm_container_app_environment_managed_certificate and azurerm_container_app_custom_domain resources (count-gated on custom_domain) - outputs.tf: add aca_fqdn output; existing outputs already use app_fqdn - deploy workflow: add TF_VAR_custom_domain from SHARING_CUSTOM_DOMAIN var, add import step that detects portal-created resources and imports them into Terraform state before plan (idempotent — skips if already managed) - GitHub env var SHARING_CUSTOM_DOMAIN=ai-fluency-server-test.devopsjournal.io set for testing environment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Polls /health up to 12 times (10s apart, 2 min total) after terraform apply. Fails the workflow if the container never comes up healthy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fixed SHARING_ALLOWED_GITHUB_ORGS typo in production env (was plural, workflow reads singular SHARING_ALLOWED_GITHUB_ORG) - Added aca_fqdn to deployment summary table so the production CNAME target is visible immediately after first deploy - Added DNS setup instructions to production summary when custom domain is not yet configured Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

terraform import prompts interactively for missing required variables, blocking the workflow and holding the state lock open. Fix by passing all the same env vars as plan/apply, and adding -input=false to fail fast rather than prompt if any variable is still missing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

AzureRM provider uses subject_name, not dns_suffix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…mporting The previous import approach caused a 400 'CertificateInUse' error because Azure auto-generates cert names in the portal, which don't match the Terraform resource name ('sharing-cert'). After import, TF detected the name drift and planned a destroy+recreate, but Azure blocks cert deletion while a domain binding exists. New approach: - Detect if TF state has a cert with a non-TF name (portal-created) - If so: delete the domain binding first, then delete the cert, remove stale TF state entries — letting Terraform create both resources fresh with the correct names - If cert is already TF-managed ('sharing-cert'): no-op Also added lifecycle.create_before_destroy = true on the cert resource as a safety net for future cert replacements. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Azure API requires the hostname to be bound to a container app before a managed certificate can be created for it (error: RequireCustomHostnameInEnvironment). This creates a circular dependency in Terraform: the managed cert needs the hostname registered, but azurerm_container_app_custom_domain with SniEnabled needs the cert to exist. Break the cycle with a null_resource that registers the hostname (Disabled binding) via az CLI local-exec, giving the cert resource a hostname to validate against. TF then creates the cert and upgrades the binding to SniEnabled via azurerm_container_app_custom_domain. Also adds the hashicorp/null provider (~> 3.0) to providers.tf. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…or cert binding azurerm_container_app_custom_domain rejects managedCertificate IDs at plan time (expects .../certificates/... but managed certs use .../managedCertificates/...). This is a known AzureRM provider limitation. Replace with a null_resource that PATCHes the container app ingress directly via az rest. The PATCH reads the current ingress config (GET), updates customDomains to SniEnabled with the managed cert ID, and writes it back. This bypasses the provider's ID-format validation entirely. Execution order: 1. null_resource.hostname_registration - az hostname add (Disabled) 2. azurerm_...managed_certificate - cert provisioned 3. null_resource.cert_binding - az rest PATCH to SniEnabled Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Terraform parses \ inside heredoc strings as TF expressions. Using \ without braces is valid bash and avoids the conflict. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rajbos had a problem deploying to testing April 27, 2026 12:15 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 12:17 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 12:20 — with GitHub Actions Failure

chore(infra): remove SP creation script

47d06c6

The service principal creation script was added during prototyping and should not be committed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rajbos had a problem deploying to testing April 27, 2026 12:22 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 13:27 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 13:42 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 13:46 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 13:52 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 14:11 — with GitHub Actions Failure

rajbos temporarily deployed to testing April 27, 2026 14:28 — with GitHub Actions Inactive

rajbos temporarily deployed to testing April 27, 2026 14:40 — with GitHub Actions Inactive

rajbos temporarily deployed to testing April 27, 2026 14:49 — with GitHub Actions Inactive

rajbos temporarily deployed to testing April 27, 2026 18:23 — with GitHub Actions Inactive

Merge branch 'main' into rajbos/azure-aca-terraform-deploy

d08f6bf

rajbos temporarily deployed to testing April 27, 2026 18:55 — with GitHub Actions Inactive

rajbos temporarily deployed to testing April 27, 2026 19:02 — with GitHub Actions Inactive

rajbos temporarily deployed to testing April 27, 2026 19:20 — with GitHub Actions Inactive

rajbos had a problem deploying to testing April 27, 2026 19:40 — with GitHub Actions Error

ci: add post-deploy health check with retry

593e471

Polls /health up to 12 times (10s apart, 2 min total) after terraform apply. Fails the workflow if the container never comes up healthy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rajbos had a problem deploying to testing April 27, 2026 19:46 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 19:56 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 20:06 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 27, 2026 20:10 — with GitHub Actions Failure

fix: use subject_name instead of dns_suffix for managed certificate

5d1e5bc

AzureRM provider uses subject_name, not dns_suffix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rajbos had a problem deploying to testing April 27, 2026 20:19 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 28, 2026 08:46 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 28, 2026 09:01 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 28, 2026 09:12 — with GitHub Actions Failure

rajbos had a problem deploying to testing April 28, 2026 09:35 — with GitHub Actions Failure

fix: use \ (no braces) in heredoc to avoid Terraform interpolation

bc8c3ca

Terraform parses \ inside heredoc strings as TF expressions. Using \ without braces is valid bash and avoids the conflict. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rajbos temporarily deployed to testing April 28, 2026 09:42 — with GitHub Actions Inactive

rajbos merged commit c1e5ec0 into main Apr 28, 2026
21 checks passed

rajbos deleted the rajbos/azure-aca-terraform-deploy branch April 28, 2026 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Azure Container Apps deployment for sharing server (Terraform + workflows)#677

feat: Azure Container Apps deployment for sharing server (Terraform + workflows)#677
rajbos merged 24 commits intomainfrom
rajbos/azure-aca-terraform-deploy

rajbos commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rajbos commented Apr 27, 2026

Summary

Architecture

Changes

Setup required (one-time)

1. Terraform state storage

2. GitHub Environments

3. Repo variables

4. GitHub OAuth App

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant