feat: Azure Container Apps deployment for sharing server (Terraform + workflows)#677
Merged
feat: Azure Container Apps deployment for sharing server (Terraform + workflows)#677
Conversation
Adds infrastructure-as-code and CI/CD workflows to deploy the sharing server to Azure Container Apps with Azure Files-backed SQLite persistence. Changes: - sharing-server/infra/: Terraform module (providers, variables, main, outputs) - ACA Environment + Container App per deployment (prod/test isolation) - Azure Storage Account + Azure Files share for /data persistence - scale-to-zero for test envs, min_replicas=1 for prod - BASE_URL auto-derived from ACA environment default_domain - .github/workflows/sharing-server-deploy.yml: build container image and terraform apply on push to any branch; uses GitHub Environments (production / testing) for OIDC-based Azure auth with env-scoped secrets - .github/workflows/sharing-server-cleanup.yml: terraform destroy on branch deletion (test envs only; main is excluded) - sharing-server/src/db.ts: switch SQLite journal_mode from WAL to DELETE; WAL requires shared-memory locking that Azure Files (SMB) does not support Required repo variables: AZURE_RESOURCE_GROUP, AZURE_LOCATION (opt), TF_STATE_RESOURCE_GROUP, TF_STATE_STORAGE_ACCOUNT, TF_STATE_CONTAINER, SHARING_ALLOWED_GITHUB_ORG (opt) Required repo secrets (per GitHub Environment): AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, SHARING_GITHUB_CLIENT_ID, SHARING_GITHUB_CLIENT_SECRET, SHARING_SESSION_SECRET Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OIDC federated credentials are not available on the target subscription. Replace ARM_USE_OIDC + id-token:write with ARM_CLIENT_SECRET from a pre-created service principal stored as a GitHub Environment secret. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ken-usage (subscription 877cc7f4-0de3-4b2a-a3d6-ecd82e7b7cd4)\n\nAdds sharing-server/infra/create-sp.sh to create a service principal scoped to the target resource group and output SDK auth JSON.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The service principal creation script was added during prototyping and should not be committed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ured
- Group multiple >> GITHUB_OUTPUT redirects in 'Compute deployment
parameters' step using { ...; } >> file syntax to satisfy shellcheck
SC2129 (style: use grouped redirects instead of individual ones)
- Add 'Check Azure credentials' pre-flight step to the deploy job;
sets configured=false output when ARM_CLIENT_ID is absent so all
Terraform steps are skipped gracefully instead of failing with an
opaque backend error when the GitHub environment is not yet set up
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The azurerm_container_app resource uses 'volume_mounts' (plural) for the block type inside container, not 'volume_mount'. This caused Terraform plan to fail with 'Blocks of type volume_mount are not expected here'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a server-side PAT (ORG_CHECK_TOKEN) to validate org membership without relying on the user's own token having read:org scope or SAML SSO authorization. The PAT is stored as a GitHub Environment secret and passed to the container via an ACA secret reference. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ider The service principal has Contributor role scoped to the resource group only, not subscription-level permissions. Terraform's default behavior tries to register all supported Azure Resource Providers at the subscription level, which fails with 403 AuthorizationFailed for each provider. Setting resource_provider_registrations = "none" disables this behavior. The required providers (Microsoft.App, Microsoft.Storage, Microsoft.OperationalInsights) must already be registered in the subscription (which they are for an existing Azure environment). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ACA rejects secrets with empty string values. The testing environment does not set ORG_CHECK_TOKEN, so github_org_check_token defaults to empty string. Use dynamic blocks for both the secret and the env var referencing it so they are only created when a non-empty token value is provided. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two bugs caused a broken database state after any startup lock conflict: 1. No busy_timeout: SQLite returns SQLITE_BUSY immediately on Azure Files when a previous container revision still holds a transient SMB lock. Adding PRAGMA busy_timeout = 10000 lets SQLite retry for up to 10 s. 2. _db assigned before initSchema: if PRAGMA journal_mode = DELETE threw SQLITE_BUSY, _db was set but initSchema had not run. All subsequent getDb() calls saw _db as truthy, skipped init, and crashed with 'no such table: users' on every request until the container restarted. Fix: use a local variable during init; only assign _db on full success. On failure, close the half-open connection and rethrow so the next request retries the whole initialization. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without an onError handler Hono silently converts unhandled exceptions to a plain-text 500 with nothing in the container logs. Add app.onError() to log every unhandled error with timestamp, method, and path. Also wrap upsertUser() in the /auth/github/callback handler — the only uncaught call site — so that database errors (e.g. SQLite init failure on Azure Files) surface as a descriptive error page instead of a blank 500, and the error is logged to stdout for ACA log inspection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SQLite on Azure Files SMB fails with 'database is locked' (SQLITE_BUSY) because ACA scale-to-zero kills the container process without cleanly closing the SMB file handle. Azure Files retains the byte-range oplock for 30-60s after the process dies, blocking the next container's startup. Fixes applied: - min_replicas default changed 0 → 1: container stays running, no more cold starts that race against stale SMB oplocks. - busy_timeout increased 10s → 60s: safety net during rolling deploys where old and new revisions briefly coexist. - closeDb() exported from db.ts: properly closes the SQLite connection and releases the SMB lock during shutdown. - SIGTERM/SIGINT handlers in server.ts: call closeDb() before exit so ACA's drain period is enough for the new revision to get the lock. - Eager DB init at startup: lock contention is resolved before the first user request rather than surfacing as a mid-request 500 error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ploy During an ACA rolling update, old and new revisions briefly share the same Azure Files mount. The old revision's SQLite connection holds a file lock that blocks the new revision's startup. Previous approach: busy_timeout=60s blocked the event loop for a full minute before the server even started listening — health probes failed, ACA restarted the container in a loop. New approach: - busy_timeout reduced to 5 s (short block per attempt, event loop stays live) - Server starts listening FIRST so health checks respond immediately - initDbWithRetry() runs async after server start: up to 20 attempts with 5-30 s back-off between them, giving the old revision's SMB oplock time to release (~30-120 s for Azure Files byte-range lock release) - Each getDb() attempt only blocks the loop for ≤5 s - On success the module-level _db singleton is set; all subsequent requests use it without any retry overhead Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…king Azure Files SMB does not reliably support the POSIX advisory byte-range locks that SQLite requires. This caused persistent 'database is locked' errors even with a single container, as the lock was never released by the Azure Files server within a reasonable time. Changes: - db.ts: SQLite now runs on LOCAL_DATA_DIR (/tmp/db, container local ephemeral disk). Azure Files (DATA_DIR /data) is used only for backup/restore via plain file copy (no locking involved). WAL journal mode is now safe since we're on a local filesystem. Added restoreFromBackup() and backupToAzureFiles() exports. - server.ts: Calls restoreFromBackup() at startup before initDbWithRetry(). SIGTERM handler now calls backupToAzureFiles() before closeDb(). Periodic backup every 5 minutes guards against unexpected SIGKILL. - main.tf: Adds LOCAL_DATA_DIR=/tmp/db and explicit DATA_DIR=/data env vars to the container. DATA_DIR remains the Azure Files mount point. Data persistence: backup is written on clean shutdown (SIGTERM) and every 5 minutes. Max data loss on unexpected kill: 5 minutes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- variables.tf: add custom_domain variable (optional, default '') - main.tf: split locals into aca_fqdn + app_fqdn (prefers custom domain), add azurerm_container_app_environment_managed_certificate and azurerm_container_app_custom_domain resources (count-gated on custom_domain) - outputs.tf: add aca_fqdn output; existing outputs already use app_fqdn - deploy workflow: add TF_VAR_custom_domain from SHARING_CUSTOM_DOMAIN var, add import step that detects portal-created resources and imports them into Terraform state before plan (idempotent — skips if already managed) - GitHub env var SHARING_CUSTOM_DOMAIN=ai-fluency-server-test.devopsjournal.io set for testing environment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Polls /health up to 12 times (10s apart, 2 min total) after terraform apply. Fails the workflow if the container never comes up healthy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fixed SHARING_ALLOWED_GITHUB_ORGS typo in production env (was plural, workflow reads singular SHARING_ALLOWED_GITHUB_ORG) - Added aca_fqdn to deployment summary table so the production CNAME target is visible immediately after first deploy - Added DNS setup instructions to production summary when custom domain is not yet configured Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
terraform import prompts interactively for missing required variables, blocking the workflow and holding the state lock open. Fix by passing all the same env vars as plan/apply, and adding -input=false to fail fast rather than prompt if any variable is still missing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AzureRM provider uses subject_name, not dns_suffix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mporting
The previous import approach caused a 400 'CertificateInUse' error because
Azure auto-generates cert names in the portal, which don't match the Terraform
resource name ('sharing-cert'). After import, TF detected the name drift and
planned a destroy+recreate, but Azure blocks cert deletion while a domain
binding exists.
New approach:
- Detect if TF state has a cert with a non-TF name (portal-created)
- If so: delete the domain binding first, then delete the cert, remove stale
TF state entries — letting Terraform create both resources fresh with the
correct names
- If cert is already TF-managed ('sharing-cert'): no-op
Also added lifecycle.create_before_destroy = true on the cert resource as a
safety net for future cert replacements.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Azure API requires the hostname to be bound to a container app before a managed certificate can be created for it (error: RequireCustomHostnameInEnvironment). This creates a circular dependency in Terraform: the managed cert needs the hostname registered, but azurerm_container_app_custom_domain with SniEnabled needs the cert to exist. Break the cycle with a null_resource that registers the hostname (Disabled binding) via az CLI local-exec, giving the cert resource a hostname to validate against. TF then creates the cert and upgrades the binding to SniEnabled via azurerm_container_app_custom_domain. Also adds the hashicorp/null provider (~> 3.0) to providers.tf. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…or cert binding azurerm_container_app_custom_domain rejects managedCertificate IDs at plan time (expects .../certificates/... but managed certs use .../managedCertificates/...). This is a known AzureRM provider limitation. Replace with a null_resource that PATCHes the container app ingress directly via az rest. The PATCH reads the current ingress config (GET), updates customDomains to SniEnabled with the managed cert ID, and writes it back. This bypasses the provider's ID-format validation entirely. Execution order: 1. null_resource.hostname_registration - az hostname add (Disabled) 2. azurerm_...managed_certificate - cert provisioned 3. null_resource.cert_binding - az rest PATCH to SniEnabled Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Terraform parses \ inside heredoc strings as TF expressions. Using \ without braces is valid bash and avoids the conflict. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a full infrastructure-as-code deployment for the sharing server to Azure Container Apps, using Terraform and GitHub Actions.
Architecture
Resources per deployment:
/dataEstimated cost:
$0.10/month (prod, always-on) — mostly the Azure Files share ($0.06/GB/month LRS). Test envs scale to zero and cost near nothing while idle.Changes
sharing-server/infra/providers.tfsharing-server/infra/variables.tfsharing-server/infra/main.tfsharing-server/infra/outputs.tf.github/workflows/sharing-server-deploy.yml.github/workflows/sharing-server-cleanup.ymlsharing-server/src/db.tsSetup required (one-time)
1. Terraform state storage
Pre-create an Azure Storage Account + blob container for Terraform state, then set:
TF_STATE_RESOURCE_GROUP(repo var)TF_STATE_STORAGE_ACCOUNT(repo var)TF_STATE_CONTAINER(repo var, e.g.tfstate)2. GitHub Environments
Create two GitHub Environments:
productionandtesting.For each, create an Azure AD app registration with a federated credential:
productionsubject:repo:rajbos/github-copilot-token-usage:environment:productiontestingsubject:repo:rajbos/github-copilot-token-usage:environment:testingAssign Contributor role on the resource group + Storage Blob Data Contributor on the TF state storage account.
Add these secrets to each environment:
AZURE_CLIENT_IDAZURE_TENANT_IDAZURE_SUBSCRIPTION_IDSHARING_GITHUB_CLIENT_IDSHARING_GITHUB_CLIENT_SECRETSHARING_SESSION_SECRET3. Repo variables
AZURE_RESOURCE_GROUP— resource group to deploy intoAZURE_LOCATION— Azure region (default:westeurope)SHARING_ALLOWED_GITHUB_ORG— optional org restriction4. GitHub OAuth App
Create a GitHub OAuth App. The first deployment will output the callback URL to register. For the production deployment, you can also configure a custom domain via ACA's built-in HTTPS binding.
Notes
/health,/api/upload) work without this.