This repo holds the ArgoCD/ACK manifests and container build that run Discourse self-hosted on the mzla-eks-workloads01 EKS cluster (eu-central-1, account 668807881758). The forum is exposed at https://discourse.thunderbird.net via Cloudflare Tunnel.
It is the planned replacement for Topicbox. Data migration from Topicbox is a later phase.
Audience: SREs landing here for the first time should read Architecture + Where things live to understand the system. Senior SREs should jump to Operational notes + Gotchas for the things that bit us during initial deployment.
Source of truth for the deployment plan and decision history: thunderbird/platform-infrastructure#279 (epic), #280 (initial Pulumi), #285 (ECR + GHA OIDC), #287 (YACE IRSA).
Discourse is a Rails (Ruby 3.4) forum platform. Upstream officially supports Docker via discourse_docker but does not officially support Kubernetes. The image we ship is built by discourse_docker's launcher bootstrap (which compiles assets, installs gems, and bakes Postgres + Redis binaries that go unused at runtime since we point at external services).
| Component | Image | Purpose |
|---|---|---|
discourse-web |
668807881758.dkr.ecr.eu-central-1.amazonaws.com/discourse:v0.1.2 |
nginx → Puma + Rails on :80. 2 replicas. discourse-prometheus collector on :9405 (pod-network only). |
discourse-sidekiq |
same image, bundle exec sidekiq |
Background jobs (mail, search index, digest). Reports metrics to web's collector via IPC. 1 replica. |
discourse-tunnel |
cloudflare/cloudflared (managed by cloudflare-operator) |
Outbound tunnel. Zero inbound ports. |
discourse-postgres (RDS) |
postgres:16.6 (AWS RDS) |
Application database. 7-day automated backup retention. |
discourse-redis (ElastiCache) |
redis:7.1 (AWS ElastiCache) |
Cache + Sidekiq queue. AUTH-token enabled, transit encryption on. |
mzla-discourse-uploads (S3) |
— | Avatars, attachments, post images, backups. Pod access via IRSA (no static keys). |
| SES (eu-central-1) | — | Outbound SMTP. Production-access enabled (50k/24h, 14/sec). |
yace |
quay.io/prometheuscommunity/yet-another-cloudwatch-exporter:v0.64.0 |
Pulls AWS service metrics for the dashboard. |
Bundled by discourse_docker base — openid-connect, chat, solved, presence, data-explorer, plus 30+ others (see Admin → Plugins).
Cloned in by container/containers/app.yml (NOT bundled in the base):
discourse-prometheus— exposes/metricson:9405for VMAgent
Internet
│
▼
┌───────────────────────────────┐
│ Cloudflare edge (managed │
│ challenge + TLS termination │
│ on thunderbird.net zone) │
└───────────────────────────────┘
│ outbound tunnel
▼
┌───────────────────────────────┐ namespace: discourse
│ discourse-tunnel │ on mzla-eks-workloads01 (eu-central-1)
│ (cloudflared, 4 conns to │
│ fra03/fra14/fra16/fra08) │
└───────────────────────────────┘
│ http://discourse-web:80 → nginx → unicorn(127.0.0.1:3000)
▼
┌───────────────────────────────┐
│ discourse-web (nginx + Puma) │──── :9405/metrics ──► VMAgent
│ - HTTP API + UI │ (only after PROMETHEUS_WEBSERVER_BIND=0.0.0.0)
│ - discourse-prometheus │
│ - aggregates sidekiq metrics │
│ via Unix socket IPC │
└───────────────────────────────┘
│ │
│ └──► s3:mzla-discourse-uploads (IRSA: workloads-prod-discourse-default)
│
┌────────┴────────┐
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ ElastiCache │ │ RDS Postgres │
│ Redis 7.1 │ │ discourse- │
│ (AUTH+TLS) │ │ postgres │
│ master.disc..│ │ 16.6, db.t4g. │
└──────────────┘ │ medium, 30Gi │
▲ │ gp3, encrypted │
│ jobs │ 7d backups │
│ └──────────────────┘
┌────┴────────────────┐ ▲
│ discourse-sidekiq │────┘
│ (bundle exec sidekiq│ reports metrics → web's collector via Unix socket
└─────────────────────┘
YACE runs in the same namespace, scrapes CloudWatch via IRSA, exposes Prometheus metrics on :5000 for VMAgent.
discourse-deploy/
├── README.md you are here
├── .github/
│ ├── CODEOWNERS @thunderbird/platform-infrastructure
│ └── workflows/build.yml build + push to ECR on tag
├── container/
│ ├── Dockerfile unused (placeholder; we use launcher bootstrap)
│ └── containers/app.yml discourse_docker config (templates, plugin clones)
├── argocd/
│ ├── aws-resources/
│ │ ├── rds-postgres.yaml ACK DBSubnetGroup + DBInstance (postgres 16.6)
│ │ ├── elasticache-redis.yaml ACK CacheSubnetGroup + ReplicationGroup (redis 7.1)
│ │ └── s3-bucket.yaml ACK Bucket: mzla-discourse-uploads
│ ├── secrets/
│ │ ├── discourse-db-credentials.yaml ES → mzla/discourse/db
│ │ ├── discourse-redis-credentials.yaml ES → mzla/discourse/redis
│ │ ├── discourse-app-secrets.yaml ES → mzla/discourse/app
│ │ ├── discourse-smtp-credentials.yaml ES → mzla/discourse/smtp
│ │ └── cloudflare-credentials.yaml ES → mzla/twenty/cloudflare (shared)
│ ├── workloads/
│ │ ├── default-serviceaccount.yaml IRSA annotation overlay (S3 access)
│ │ ├── rds-bootstrap-job.yaml CREATE USER + GRANTs + hstore + pg_trgm + vector
│ │ ├── discourse-config.yaml ConfigMap of DISCOURSE_* env
│ │ ├── discourse-web.yaml Deployment + Service for nginx+Puma (image pinned)
│ │ ├── discourse-sidekiq.yaml Deployment for Sidekiq workers (image pinned)
│ │ ├── discourse-migrate-job.yaml Sync-hook Job: bundle exec rake db:migrate
│ │ └── cloudflare-tunnel.yaml Tunnel + TunnelBinding (discourse.thunderbird.net)
│ └── observability/
│ ├── discourse-vmpodscrape.yaml VMPodScrape on web pods only (port 9405)
│ └── yace.yaml YACE Deployment + ConfigMap + scrape (port 5000)
argocd/projects/discourse.yaml— AppProject scoping discourse to workloads01argocd/workloads/apps/discourse-app-of-apps.yaml— sync pointer at this repo'sargocd/pulumi/environments/mzla-workloads/config.prod.yaml— defines IRSA rolesworkloads-prod-discourse-default(S3),workloads-prod-yace-discourse(CloudWatch),workloads-prod-discourse-deploy(GHA OIDC for ECR push), theworkloads-prod-discourse-sesIAM user (SES SMTP), and the four AWS Secrets Manager secrets atmzla/discourse/{db,redis,app,smtp}. Plus the ECR repodiscourseand GitHub OIDC provider.
The Bolt-styled Discourse theme. Public; v0.1.0 covers basic palette + Inter font. Installed via Admin → Customize → Themes → Install from a git repository → tag v0.1.0.
terraform/dashboards/discourse/overview.json — Grafana dashboard at https://grafana.pi.thunderbird.net/d/discourse-overview. Datasource UID P4169E866C3094E38 (the workloads01 VictoriaMetrics).
Order matters because the ACK CRDs report endpoints in their status only after AWS reconciles them; Discourse needs those endpoints in discourse-config and rds-bootstrap-job to come up.
-
Pulumi up in
mzla-workloadsto create IRSA, SES IAM user, and the four SM secrets. -
Provision ECR repo + GHA OIDC role (also via Pulumi). Create the
productionGitHub Environment in this repo:gh api -X PUT repos/thunderbird/discourse-deploy/environments/production. -
Build the container image by tagging this repo
v0.1.x. The GHA workflow runslauncher bootstraponubuntu-24.04-arm(so the resulting image is arm64 — workloads01's default node groups are Graviton). Image is pushed to ECR as:v0.1.xand:latest. -
Register the app by merging
argocd/projects/discourse.yamlandargocd/workloads/apps/discourse-app-of-apps.yamlinplatform-infrastructure. ArgoCD picks up this repo and starts syncing. -
Patch
<TBD>placeholders once RDS + ElastiCache areavailable(~10–15 min for RDS, ~5 min for Redis):kubectl get dbinstance discourse-postgres -n discourse \ -o jsonpath='{.status.endpoint.address}' kubectl get replicationgroup discourse-redis -n discourse \ -o jsonpath='{.status.nodeGroups[0].primaryEndpoint.address}'PR the resolved values into
discourse-config.yaml(DISCOURSE_DB_HOST,DISCOURSE_REDIS_HOST) andrds-bootstrap-job.yaml(PGHOST). -
Wave 4 hooks (
rds-bootstrap-job+discourse-migrate-job) run after the patch syncs. Bootstrap createsdiscourse_app_user+ grants + extensions (hstore,pg_trgm,vector); migrate runs all Discourse Rails migrations against RDS. -
Wave 5 brings up
discourse-web,discourse-sidekiq, the Cloudflare tunnel, and the VMPodScrape. First Discourse boot takes ~3–5 min. -
Smoke:
curl https://discourse.thunderbird.net/(Cloudflare may challenge — open in browser); admin login uses creds frommzla/discourse/app.aws secretsmanager get-secret-value --secret-id mzla/discourse/app \ --profile mzla-workloads --region eu-central-1 --query SecretString --output text | jq . -
Enable discourse-prometheus: Admin → Plugins → discourse-prometheus → toggle on. Pod
:9405starts serving metrics; VMAgent scrapes within 30s. -
Install the Bolt theme: Admin → Customize → Themes → Install from a git repository →
https://github.com/thunderbird/discourse-theme-bolt, tagv0.1.0. Set as default.
The image: field in argocd/workloads/discourse-{web,sidekiq,migrate-job}.yaml is pinned to a specific v0.1.x tag (not :latest) so a tag-triggered build doesn't auto-replace running pods on restart. To roll a new image:
- Tag this repo
v0.1.x— GHA builds and pushes - PR a manifest bump (
v0.1.{x-1}→v0.1.x) in all three places - ArgoCD rolls
Two paths:
- Promote via UI: user signs up at
/signup→ admin goes to/admin/users→ search → "Grant Admin" - Auto-promote: add their email to the
developerEmailsfield inmzla/discourse/app(comma-separated). Restart web pods to pick up the new env:kubectl rollout restart deploy/discourse-web -n discourse
kubectl --context arn:aws:eks:eu-central-1:668807881758:cluster/mzla-eks-workloads01 \
rollout restart deploy/discourse-web deploy/discourse-sidekiq -n discourse
- RDS: 7-day automated retention, daily snapshot at 03:00 UTC. To take a manual snapshot:
aws rds create-db-snapshot --db-instance-identifier discourse-postgres \ --db-snapshot-identifier discourse-manual-$(date +%Y%m%d) \ --profile mzla-workloads --region eu-central-1 - S3 uploads: bucket has versioning enabled; restore via
aws s3api list-object-versions+restore-object - Discourse-side: Admin → Backups → "Backup" creates a tarball in the S3 uploads bucket under
/backups/
- CloudWatch: log group
/eks/mzla/mzla-eks-workloads01/applications, stream patterndiscourse/<pod>/<container>(3-day retention). Vector ships there in addition to VictoriaLogs. - VictoriaLogs: at https://grafana.pi.thunderbird.net (LogsQL datasource), filter
_stream:{namespace="discourse"}
- Grafana: https://grafana.pi.thunderbird.net/d/discourse-overview
- Discourse app metrics from discourse-prometheus on web pods
:9405 - AWS metrics from YACE pod
:5000(filtered byproject=discoursetag)
The Bolt-styled theme lives in thunderbird/discourse-theme-bolt (public; v0.1.0). Install via Admin UI → Customize → Themes → "Install from a git repository". Theme covers basic palette + Inter font; full Bolt palette + dark scheme tracked in that repo's TODO.