Add env-gating monitoring baseline for ATN stage by e9e4e5f0faef · Pull Request #379 · thunderbird/addons-server

e9e4e5f0faef · 2026-04-16T16:21:08Z

Depends on: #378 (must be merged first)

Summary

Add a monitoring baseline for ATN stage covering availability, resource utilisation, and notification integrity
Create 28 explicit CloudWatch alarms across ALB, target group, ECS (with Container Insights), Amazon MQ, Redis, and SNS
Add an availability-first dashboard for fast triage ("is it alive?" before "is it busy?")
Route notifications through a single SNS topic with recipients loaded from Secrets Manager
Keep thresholds tuneable via config.stage.yaml with per-service overrides where they matter

Changes

File	Change
`infra/pulumi/__main__.py`	Add explicit SNS topic + subscriptions, 28 CloudWatch alarms, and a CloudWatch dashboard built from module-level widget builders (`_alb_requests_widget`, `_ecs_resources_widget`, `_mq_widgets`, `_redis_widgets`, `_availability_widgets`, `_build_dashboard_body`)
`infra/pulumi/config.stage.yaml`	Add `monitoring` config block with tuneable thresholds, per-service `min_tasks_per_service` override, and secret-backed notification recipient configuration

Why

Metrics collection already exists, but actionable alerting and visualisation are largely absent. This change adds the minimum env-gating observability layer needed before treating stage as ready for further activation steps.

The alarms focus on the current stage runtime path:

ALB / target group health for web and versioncheck (positive availability via HealthyHostCount, plus error-count and response-time alarms)
ECS service saturation and availability for web, worker, and versioncheck (CPU, memory, plus RunningTaskCount via Container Insights)
Amazon MQ queue and broker health for the dedicated stage broker
Redis pressure signals for the stage result backend
SNS topic delivery integrity (NumberOfNotificationsFailed)

Redis includes both EngineCPUUtilization and host CPUUtilization because the cache.t3.small node has only 2 vCPUs. AWS explicitly recommends monitoring host CPU alongside engine CPU on nodes with two or fewer vCPUs, as EngineCPUUtilization alone can miss host-level overload.

The implementation is explicit instead of using CloudWatchMonitoringGroup so we keep a single SNS notification path, correct metric selection (the upstream helper has a target_5xx metric bug), and full control over alarm descriptions and behaviour.

Implementation notes

MQ broker Broker dimension uses broker_name, not broker.id: AWS publishes Amazon MQ for RabbitMQ metrics keyed by broker name; the Pulumi aws.mq.Broker.id is the AWS UUID and would point at non-existent metric series.
Redis cache cluster ID uses replication_group.member_clusters[0]: provider's actual output rather than reconstructing the AWS naming convention <rg-id>-001. Robust against multi-node setups and AWS naming changes.
treat_missing_data is per-metric, not uniform: breaching for managed-resource metrics (broker CPU/memory, target group HealthyHostCount, ECS RunningTaskCount) where missing data means the resource is gone; notBreaching for sparse error counts and traffic-dependent metrics under low load.
Per-service min_tasks override: ecs.min_tasks_per_service.<svc> falls back to ecs.min_tasks (default 1). Lets operators set worker: 0 during intentional drain windows without firing the running-tasks alarm globally.
Dashboard layout leads with availability: top row shows ALB healthy hosts, ECS running tasks, and MQ consumers, so "is it alive?" reads before load and resource graphs.
Dashboard widget code is testable: each widget builder is module-level, takes resolved-output dicts plus plain Python parameters, and returns widget shapes with no Pulumi dependencies. Replaces a previous ~420-line inline apply() lambda.
ALB target dimensions are aligned: HTTPCode_Target_5XX_Count and TargetResponseTime use LoadBalancer + TargetGroup in both alarms and dashboard widgets; ALB-side metrics keep LoadBalancer only.

Validation

This PR contributes 31 resources to the stage stack:

1 SNS topic + 1 email subscription
6 ALB alarms (alb-5xx, target-5xx, response-time × web + versioncheck)
4 target group alarms (UnHealthyHostCount, HealthyHostCount × web + versioncheck)
9 ECS alarms (CPU, memory, RunningTaskCount × web + worker + versioncheck)
3 Amazon MQ alarms active; the consumer-count alarm is defined in code but remains disabled while worker-on is not the expected steady state
5 Redis alarms (memory, evictions, engine-cpu, host-cpu, connections)
1 SNS topic delivery alarm (NumberOfNotificationsFailed)
1 CloudWatch dashboard

Resource count assumes the current stage config with one SNS email recipient; if the atn/stage/monitoring_notify_emails list grows, the subscription count grows accordingly.

While stacked on #378, pulumi preview shows + 36 to create / ~ 20 to update / - 3 to delete / +- 3 to replace / = 140 unchanged (combined effect of both PRs against current stage state). Once #378 merges and this PR retargets to stage, pulumi preview for this PR should show only the 31 monitoring resources listed above.

ruff check and ruff format --check pass. Notification recipients are read from Secrets Manager, not committed in the public repo.

Prerequisites

The Secrets Manager entry atn/stage/monitoring_notify_emails (comma-separated list of recipient addresses) must exist before pulumi up

Safety

Monitoring is additive only: no existing runtime behaviour is changed
Alarm recipients are not hardcoded in repository config
The MQ consumer alarm is posture-gated and remains disabled while stage worker desired count is intentionally 0
Per-service min_tasks_per_service lets operators drain a service without globally suppressing availability alarms
Thresholds remain tuneable without code changes
Existing deploy changes from the stacked infra branch (Introduce dedicated stage EFS; fix MQ broker drift and Memcached SG #378) are unchanged

Post-deploy validation

To be run after pulumi up, before this baseline is treated as env-gating green:

pulumi preview after Introduce dedicated stage EFS; fix MQ broker drift and Memcached SG #378 merge / rebase shows no unintended service drain
Live ECS running counts match desired counts after pulumi up
ECS/ContainerInsights:RunningTaskCount emits for web, worker, and versioncheck (verify by reading one CloudWatch datapoint)
All dashboard widgets (availability row, ALB, ECS, MQ, Redis) populate within 1-2 periods
At least one SNS email subscription is confirmed (not merely created)
One cron task ENI inspected; cron network configuration verified at runtime (the EventBridge Scheduler network_configuration is explicit in code since PR feat: Add ECS Fargate infrastructure and deployment configuration #362 but still needs runtime ENI confirmation)
Memcached reachability verified via the deployed SG fix from PR Introduce dedicated stage EFS; fix MQ broker drift and Memcached SG #378
Trigger one low-risk test alarm to verify end-to-end notification delivery

Follow-up

Revisit MQ consumer alarm once worker-on becomes expected steady state
Phase 2: EFS resource alarms (after EFS is in active use), log metric filters for silent degradation, deployment instability beyond RunningTaskCount, external/shared resource monitoring (RDS, OpenSearch), secondary notification channel (SMS / Slack / separate topic) so SNS-side delivery failure doesn't silence itself, standalone triage runbook

…ability widgets

feat(pulumi): add monitoring baseline with explicit CW alarms

e726563

e9e4e5f0faef requested a review from Sancus April 16, 2026 16:21

e9e4e5f0faef self-assigned this Apr 16, 2026

Jonathan Alvarez Delgado added 3 commits April 29, 2026 01:33

fix(pulumi): correct MQ broker dimension and add availability alarms

56b64b1

refactor(pulumi): extract dashboard widget builders to module level

7882889

fix(pulumi): use member_clusters and per-service min_tasks; add avail…

1c797f4

…ability widgets

e9e4e5f0faef marked this pull request as draft April 29, 2026 08:01

e9e4e5f0faef marked this pull request as ready for review April 29, 2026 16:13

e9e4e5f0faef mentioned this pull request Apr 29, 2026

Introduce dedicated stage EFS; fix MQ broker drift and Memcached SG #378

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add env-gating monitoring baseline for ATN stage#379

Add env-gating monitoring baseline for ATN stage#379
e9e4e5f0faef wants to merge 4 commits intofeat/stage-efs-isolationfrom
feat/stage-monitoring-baseline

e9e4e5f0faef commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

e9e4e5f0faef commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why

Implementation notes

Validation

Prerequisites

Safety

Post-deploy validation

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

e9e4e5f0faef commented Apr 16, 2026 •

edited

Loading