Skip to content

feat: Azure Container Apps deployment for sharing server (Terraform + workflows)#677

Merged
rajbos merged 24 commits intomainfrom
rajbos/azure-aca-terraform-deploy
Apr 28, 2026
Merged

feat: Azure Container Apps deployment for sharing server (Terraform + workflows)#677
rajbos merged 24 commits intomainfrom
rajbos/azure-aca-terraform-deploy

Conversation

@rajbos
Copy link
Copy Markdown
Owner

@rajbos rajbos commented Apr 27, 2026

Summary

Adds a full infrastructure-as-code deployment for the sharing server to Azure Container Apps, using Terraform and GitHub Actions.

Architecture

push to main   → GitHub Env: production → sharing-server-prod (min 1 replica)
push to branch → GitHub Env: testing   → sharing-test-<slug><hash> (scale-to-zero)
branch delete  → cleanup workflow      → terraform destroy

Resources per deployment:

  • Azure Container Apps Environment (isolated per deployment)
  • Azure Storage Account (Standard LRS) + Azure Files share → mounted at /data
  • Azure Container App (0.25 vCPU / 0.5 GiB, max 1 replica for SQLite safety)
  • HTTPS ingress with auto-FQDN; BASE_URL is computed from ACA default_domain at apply time

Estimated cost: $0.10/month (prod, always-on) — mostly the Azure Files share ($0.06/GB/month LRS). Test envs scale to zero and cost near nothing while idle.

Changes

File Change
sharing-server/infra/providers.tf AzureRM + Random providers, empty azurerm backend
sharing-server/infra/variables.tf All input variables
sharing-server/infra/main.tf Storage account, Azure Files, ACA environment + app
sharing-server/infra/outputs.tf app_url, oauth_callback_url, dashboard_url, health_url
.github/workflows/sharing-server-deploy.yml Build image → terraform apply
.github/workflows/sharing-server-cleanup.yml terraform destroy on branch delete
sharing-server/src/db.ts Switch SQLite journal_mode WAL → DELETE (Azure Files SMB compatibility)

Setup required (one-time)

1. Terraform state storage

Pre-create an Azure Storage Account + blob container for Terraform state, then set:

  • TF_STATE_RESOURCE_GROUP (repo var)
  • TF_STATE_STORAGE_ACCOUNT (repo var)
  • TF_STATE_CONTAINER (repo var, e.g. tfstate)

2. GitHub Environments

Create two GitHub Environments: production and testing.

For each, create an Azure AD app registration with a federated credential:

  • production subject: repo:rajbos/github-copilot-token-usage:environment:production
  • testing subject: repo:rajbos/github-copilot-token-usage:environment:testing

Assign Contributor role on the resource group + Storage Blob Data Contributor on the TF state storage account.

Add these secrets to each environment:

  • AZURE_CLIENT_ID
  • AZURE_TENANT_ID
  • AZURE_SUBSCRIPTION_ID
  • SHARING_GITHUB_CLIENT_ID
  • SHARING_GITHUB_CLIENT_SECRET
  • SHARING_SESSION_SECRET

3. Repo variables

  • AZURE_RESOURCE_GROUP — resource group to deploy into
  • AZURE_LOCATION — Azure region (default: westeurope)
  • SHARING_ALLOWED_GITHUB_ORG — optional org restriction

4. GitHub OAuth App

Create a GitHub OAuth App. The first deployment will output the callback URL to register. For the production deployment, you can also configure a custom domain via ACA's built-in HTTPS binding.

Notes

  • Dashboard OAuth on test envs: The dashboard login won't work until the ACA FQDN is registered as a callback URL in the OAuth App. The API endpoints (/health, /api/upload) work without this.
  • SQLite journal mode: Changed WAL → DELETE so the database works correctly on Azure Files (SMB). At our write frequency (small batches every ~5 min) the performance difference is negligible.
  • max_replicas = 1: Required to prevent SQLite file corruption on concurrent writes from multiple container instances.

Adds infrastructure-as-code and CI/CD workflows to deploy the sharing
server to Azure Container Apps with Azure Files-backed SQLite persistence.

Changes:
- sharing-server/infra/: Terraform module (providers, variables, main, outputs)
  - ACA Environment + Container App per deployment (prod/test isolation)
  - Azure Storage Account + Azure Files share for /data persistence
  - scale-to-zero for test envs, min_replicas=1 for prod
  - BASE_URL auto-derived from ACA environment default_domain
- .github/workflows/sharing-server-deploy.yml: build container image and
  terraform apply on push to any branch; uses GitHub Environments (production
  / testing) for OIDC-based Azure auth with env-scoped secrets
- .github/workflows/sharing-server-cleanup.yml: terraform destroy on branch
  deletion (test envs only; main is excluded)
- sharing-server/src/db.ts: switch SQLite journal_mode from WAL to DELETE;
  WAL requires shared-memory locking that Azure Files (SMB) does not support

Required repo variables: AZURE_RESOURCE_GROUP, AZURE_LOCATION (opt),
  TF_STATE_RESOURCE_GROUP, TF_STATE_STORAGE_ACCOUNT, TF_STATE_CONTAINER,
  SHARING_ALLOWED_GITHUB_ORG (opt)
Required repo secrets (per GitHub Environment):
  AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID,
  SHARING_GITHUB_CLIENT_ID, SHARING_GITHUB_CLIENT_SECRET, SHARING_SESSION_SECRET

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OIDC federated credentials are not available on the target subscription.
Replace ARM_USE_OIDC + id-token:write with ARM_CLIENT_SECRET from a
pre-created service principal stored as a GitHub Environment secret.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ken-usage (subscription 877cc7f4-0de3-4b2a-a3d6-ecd82e7b7cd4)\n\nAdds sharing-server/infra/create-sp.sh to create a service principal scoped to the target resource group and output SDK auth JSON.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The service principal creation script was added during prototyping and should not be committed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ured

- Group multiple >> GITHUB_OUTPUT redirects in 'Compute deployment
  parameters' step using { ...; } >> file syntax to satisfy shellcheck
  SC2129 (style: use grouped redirects instead of individual ones)
- Add 'Check Azure credentials' pre-flight step to the deploy job;
  sets configured=false output when ARM_CLIENT_ID is absent so all
  Terraform steps are skipped gracefully instead of failing with an
  opaque backend error when the GitHub environment is not yet set up

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The azurerm_container_app resource uses 'volume_mounts' (plural) for
the block type inside container, not 'volume_mount'. This caused
Terraform plan to fail with 'Blocks of type volume_mount are not
expected here'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a server-side PAT (ORG_CHECK_TOKEN) to validate org membership
without relying on the user's own token having read:org scope or
SAML SSO authorization. The PAT is stored as a GitHub Environment
secret and passed to the container via an ACA secret reference.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ider

The service principal has Contributor role scoped to the resource group only,
not subscription-level permissions. Terraform's default behavior tries to
register all supported Azure Resource Providers at the subscription level,
which fails with 403 AuthorizationFailed for each provider.

Setting resource_provider_registrations = "none" disables this behavior.
The required providers (Microsoft.App, Microsoft.Storage,
Microsoft.OperationalInsights) must already be registered in the subscription
(which they are for an existing Azure environment).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ACA rejects secrets with empty string values. The testing environment does
not set ORG_CHECK_TOKEN, so github_org_check_token defaults to empty string.

Use dynamic blocks for both the secret and the env var referencing it so
they are only created when a non-empty token value is provided.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two bugs caused a broken database state after any startup lock conflict:

1. No busy_timeout: SQLite returns SQLITE_BUSY immediately on Azure Files
   when a previous container revision still holds a transient SMB lock.
   Adding PRAGMA busy_timeout = 10000 lets SQLite retry for up to 10 s.

2. _db assigned before initSchema: if PRAGMA journal_mode = DELETE threw
   SQLITE_BUSY, _db was set but initSchema had not run. All subsequent
   getDb() calls saw _db as truthy, skipped init, and crashed with
   'no such table: users' on every request until the container restarted.

   Fix: use a local variable during init; only assign _db on full success.
   On failure, close the half-open connection and rethrow so the next
   request retries the whole initialization.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without an onError handler Hono silently converts unhandled exceptions to
a plain-text 500 with nothing in the container logs. Add app.onError() to
log every unhandled error with timestamp, method, and path.

Also wrap upsertUser() in the /auth/github/callback handler — the only
uncaught call site — so that database errors (e.g. SQLite init failure on
Azure Files) surface as a descriptive error page instead of a blank 500,
and the error is logged to stdout for ACA log inspection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SQLite on Azure Files SMB fails with 'database is locked' (SQLITE_BUSY)
because ACA scale-to-zero kills the container process without cleanly
closing the SMB file handle. Azure Files retains the byte-range oplock
for 30-60s after the process dies, blocking the next container's startup.

Fixes applied:
- min_replicas default changed 0 → 1: container stays running, no more
  cold starts that race against stale SMB oplocks.
- busy_timeout increased 10s → 60s: safety net during rolling deploys
  where old and new revisions briefly coexist.
- closeDb() exported from db.ts: properly closes the SQLite connection
  and releases the SMB lock during shutdown.
- SIGTERM/SIGINT handlers in server.ts: call closeDb() before exit so
  ACA's drain period is enough for the new revision to get the lock.
- Eager DB init at startup: lock contention is resolved before the first
  user request rather than surfacing as a mid-request 500 error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ploy

During an ACA rolling update, old and new revisions briefly share the same
Azure Files mount. The old revision's SQLite connection holds a file lock
that blocks the new revision's startup.

Previous approach: busy_timeout=60s blocked the event loop for a full minute
before the server even started listening — health probes failed, ACA restarted
the container in a loop.

New approach:
- busy_timeout reduced to 5 s (short block per attempt, event loop stays live)
- Server starts listening FIRST so health checks respond immediately
- initDbWithRetry() runs async after server start: up to 20 attempts with
  5-30 s back-off between them, giving the old revision's SMB oplock time
  to release (~30-120 s for Azure Files byte-range lock release)
- Each getDb() attempt only blocks the loop for ≤5 s
- On success the module-level _db singleton is set; all subsequent requests
  use it without any retry overhead

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…king

Azure Files SMB does not reliably support the POSIX advisory byte-range
locks that SQLite requires. This caused persistent 'database is locked'
errors even with a single container, as the lock was never released by
the Azure Files server within a reasonable time.

Changes:
- db.ts: SQLite now runs on LOCAL_DATA_DIR (/tmp/db, container local
  ephemeral disk). Azure Files (DATA_DIR /data) is used only for
  backup/restore via plain file copy (no locking involved).
  WAL journal mode is now safe since we're on a local filesystem.
  Added restoreFromBackup() and backupToAzureFiles() exports.
- server.ts: Calls restoreFromBackup() at startup before initDbWithRetry().
  SIGTERM handler now calls backupToAzureFiles() before closeDb().
  Periodic backup every 5 minutes guards against unexpected SIGKILL.
- main.tf: Adds LOCAL_DATA_DIR=/tmp/db and explicit DATA_DIR=/data env
  vars to the container. DATA_DIR remains the Azure Files mount point.

Data persistence: backup is written on clean shutdown (SIGTERM) and
every 5 minutes. Max data loss on unexpected kill: 5 minutes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- variables.tf: add custom_domain variable (optional, default '')
- main.tf: split locals into aca_fqdn + app_fqdn (prefers custom domain),
  add azurerm_container_app_environment_managed_certificate and
  azurerm_container_app_custom_domain resources (count-gated on custom_domain)
- outputs.tf: add aca_fqdn output; existing outputs already use app_fqdn
- deploy workflow: add TF_VAR_custom_domain from SHARING_CUSTOM_DOMAIN var,
  add import step that detects portal-created resources and imports them into
  Terraform state before plan (idempotent — skips if already managed)
- GitHub env var SHARING_CUSTOM_DOMAIN=ai-fluency-server-test.devopsjournal.io
  set for testing environment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Polls /health up to 12 times (10s apart, 2 min total) after terraform apply.
Fails the workflow if the container never comes up healthy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fixed SHARING_ALLOWED_GITHUB_ORGS typo in production env (was plural,
  workflow reads singular SHARING_ALLOWED_GITHUB_ORG)
- Added aca_fqdn to deployment summary table so the production CNAME
  target is visible immediately after first deploy
- Added DNS setup instructions to production summary when custom domain
  is not yet configured

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
terraform import prompts interactively for missing required variables,
blocking the workflow and holding the state lock open. Fix by passing
all the same env vars as plan/apply, and adding -input=false to fail
fast rather than prompt if any variable is still missing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AzureRM provider uses subject_name, not dns_suffix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mporting

The previous import approach caused a 400 'CertificateInUse' error because
Azure auto-generates cert names in the portal, which don't match the Terraform
resource name ('sharing-cert'). After import, TF detected the name drift and
planned a destroy+recreate, but Azure blocks cert deletion while a domain
binding exists.

New approach:
- Detect if TF state has a cert with a non-TF name (portal-created)
- If so: delete the domain binding first, then delete the cert, remove stale
  TF state entries — letting Terraform create both resources fresh with the
  correct names
- If cert is already TF-managed ('sharing-cert'): no-op

Also added lifecycle.create_before_destroy = true on the cert resource as a
safety net for future cert replacements.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Azure API requires the hostname to be bound to a container app before a
managed certificate can be created for it (error: RequireCustomHostnameInEnvironment).
This creates a circular dependency in Terraform: the managed cert needs the
hostname registered, but azurerm_container_app_custom_domain with SniEnabled
needs the cert to exist.

Break the cycle with a null_resource that registers the hostname (Disabled
binding) via az CLI local-exec, giving the cert resource a hostname to
validate against. TF then creates the cert and upgrades the binding to
SniEnabled via azurerm_container_app_custom_domain.

Also adds the hashicorp/null provider (~> 3.0) to providers.tf.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…or cert binding

azurerm_container_app_custom_domain rejects managedCertificate IDs at plan
time (expects .../certificates/... but managed certs use
.../managedCertificates/...). This is a known AzureRM provider limitation.

Replace with a null_resource that PATCHes the container app ingress directly
via az rest. The PATCH reads the current ingress config (GET), updates
customDomains to SniEnabled with the managed cert ID, and writes it back.
This bypasses the provider's ID-format validation entirely.

Execution order:
  1. null_resource.hostname_registration  - az hostname add (Disabled)
  2. azurerm_...managed_certificate        - cert provisioned
  3. null_resource.cert_binding            - az rest PATCH to SniEnabled

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Terraform parses \ inside heredoc strings as TF expressions.
Using \ without braces is valid bash and avoids the conflict.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rajbos rajbos merged commit c1e5ec0 into main Apr 28, 2026
21 checks passed
@rajbos rajbos deleted the rajbos/azure-aca-terraform-deploy branch April 28, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant