Skip to content

Add env-gating monitoring baseline for ATN stage#379

Open
e9e4e5f0faef wants to merge 4 commits intofeat/stage-efs-isolationfrom
feat/stage-monitoring-baseline
Open

Add env-gating monitoring baseline for ATN stage#379
e9e4e5f0faef wants to merge 4 commits intofeat/stage-efs-isolationfrom
feat/stage-monitoring-baseline

Conversation

@e9e4e5f0faef
Copy link
Copy Markdown
Collaborator

@e9e4e5f0faef e9e4e5f0faef commented Apr 16, 2026

Depends on: #378 (must be merged first)

Summary

  • Add a monitoring baseline for ATN stage covering availability, resource utilisation, and notification integrity
  • Create 28 explicit CloudWatch alarms across ALB, target group, ECS (with Container Insights), Amazon MQ, Redis, and SNS
  • Add an availability-first dashboard for fast triage ("is it alive?" before "is it busy?")
  • Route notifications through a single SNS topic with recipients loaded from Secrets Manager
  • Keep thresholds tuneable via config.stage.yaml with per-service overrides where they matter

Changes

File Change
infra/pulumi/__main__.py Add explicit SNS topic + subscriptions, 28 CloudWatch alarms, and a CloudWatch dashboard built from module-level widget builders (_alb_requests_widget, _ecs_resources_widget, _mq_widgets, _redis_widgets, _availability_widgets, _build_dashboard_body)
infra/pulumi/config.stage.yaml Add monitoring config block with tuneable thresholds, per-service min_tasks_per_service override, and secret-backed notification recipient configuration

Why

Metrics collection already exists, but actionable alerting and visualisation are largely absent. This change adds the minimum env-gating observability layer needed before treating stage as ready for further activation steps.

The alarms focus on the current stage runtime path:

  • ALB / target group health for web and versioncheck (positive availability via HealthyHostCount, plus error-count and response-time alarms)
  • ECS service saturation and availability for web, worker, and versioncheck (CPU, memory, plus RunningTaskCount via Container Insights)
  • Amazon MQ queue and broker health for the dedicated stage broker
  • Redis pressure signals for the stage result backend
  • SNS topic delivery integrity (NumberOfNotificationsFailed)

Redis includes both EngineCPUUtilization and host CPUUtilization because the cache.t3.small node has only 2 vCPUs. AWS explicitly recommends monitoring host CPU alongside engine CPU on nodes with two or fewer vCPUs, as EngineCPUUtilization alone can miss host-level overload.

The implementation is explicit instead of using CloudWatchMonitoringGroup so we keep a single SNS notification path, correct metric selection (the upstream helper has a target_5xx metric bug), and full control over alarm descriptions and behaviour.

Implementation notes

  • MQ broker Broker dimension uses broker_name, not broker.id: AWS publishes Amazon MQ for RabbitMQ metrics keyed by broker name; the Pulumi aws.mq.Broker.id is the AWS UUID and would point at non-existent metric series.
  • Redis cache cluster ID uses replication_group.member_clusters[0]: provider's actual output rather than reconstructing the AWS naming convention <rg-id>-001. Robust against multi-node setups and AWS naming changes.
  • treat_missing_data is per-metric, not uniform: breaching for managed-resource metrics (broker CPU/memory, target group HealthyHostCount, ECS RunningTaskCount) where missing data means the resource is gone; notBreaching for sparse error counts and traffic-dependent metrics under low load.
  • Per-service min_tasks override: ecs.min_tasks_per_service.<svc> falls back to ecs.min_tasks (default 1). Lets operators set worker: 0 during intentional drain windows without firing the running-tasks alarm globally.
  • Dashboard layout leads with availability: top row shows ALB healthy hosts, ECS running tasks, and MQ consumers, so "is it alive?" reads before load and resource graphs.
  • Dashboard widget code is testable: each widget builder is module-level, takes resolved-output dicts plus plain Python parameters, and returns widget shapes with no Pulumi dependencies. Replaces a previous ~420-line inline apply() lambda.
  • ALB target dimensions are aligned: HTTPCode_Target_5XX_Count and TargetResponseTime use LoadBalancer + TargetGroup in both alarms and dashboard widgets; ALB-side metrics keep LoadBalancer only.

Validation

This PR contributes 31 resources to the stage stack:

  • 1 SNS topic + 1 email subscription
  • 6 ALB alarms (alb-5xx, target-5xx, response-time × web + versioncheck)
  • 4 target group alarms (UnHealthyHostCount, HealthyHostCount × web + versioncheck)
  • 9 ECS alarms (CPU, memory, RunningTaskCount × web + worker + versioncheck)
  • 3 Amazon MQ alarms active; the consumer-count alarm is defined in code but remains disabled while worker-on is not the expected steady state
  • 5 Redis alarms (memory, evictions, engine-cpu, host-cpu, connections)
  • 1 SNS topic delivery alarm (NumberOfNotificationsFailed)
  • 1 CloudWatch dashboard

Resource count assumes the current stage config with one SNS email recipient; if the atn/stage/monitoring_notify_emails list grows, the subscription count grows accordingly.

While stacked on #378, pulumi preview shows + 36 to create / ~ 20 to update / - 3 to delete / +- 3 to replace / = 140 unchanged (combined effect of both PRs against current stage state). Once #378 merges and this PR retargets to stage, pulumi preview for this PR should show only the 31 monitoring resources listed above.

ruff check and ruff format --check pass. Notification recipients are read from Secrets Manager, not committed in the public repo.

Prerequisites

  • The Secrets Manager entry atn/stage/monitoring_notify_emails (comma-separated list of recipient addresses) must exist before pulumi up

Safety

  • Monitoring is additive only: no existing runtime behaviour is changed
  • Alarm recipients are not hardcoded in repository config
  • The MQ consumer alarm is posture-gated and remains disabled while stage worker desired count is intentionally 0
  • Per-service min_tasks_per_service lets operators drain a service without globally suppressing availability alarms
  • Thresholds remain tuneable without code changes
  • Existing deploy changes from the stacked infra branch (Introduce dedicated stage EFS; fix MQ broker drift and Memcached SG #378) are unchanged

Post-deploy validation

To be run after pulumi up, before this baseline is treated as env-gating green:

Follow-up

  • Revisit MQ consumer alarm once worker-on becomes expected steady state
  • Phase 2: EFS resource alarms (after EFS is in active use), log metric filters for silent degradation, deployment instability beyond RunningTaskCount, external/shared resource monitoring (RDS, OpenSearch), secondary notification channel (SMS / Slack / separate topic) so SNS-side delivery failure doesn't silence itself, standalone triage runbook

@e9e4e5f0faef e9e4e5f0faef requested a review from Sancus April 16, 2026 16:21
@e9e4e5f0faef e9e4e5f0faef self-assigned this Apr 16, 2026
@e9e4e5f0faef e9e4e5f0faef marked this pull request as draft April 29, 2026 08:01
@e9e4e5f0faef e9e4e5f0faef marked this pull request as ready for review April 29, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant