Add env-gating monitoring baseline for ATN stage#379
Open
e9e4e5f0faef wants to merge 4 commits intofeat/stage-efs-isolationfrom
Open
Add env-gating monitoring baseline for ATN stage#379e9e4e5f0faef wants to merge 4 commits intofeat/stage-efs-isolationfrom
e9e4e5f0faef wants to merge 4 commits intofeat/stage-efs-isolationfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
config.stage.yamlwith per-service overrides where they matterChanges
infra/pulumi/__main__.py_alb_requests_widget,_ecs_resources_widget,_mq_widgets,_redis_widgets,_availability_widgets,_build_dashboard_body)infra/pulumi/config.stage.yamlmonitoringconfig block with tuneable thresholds, per-servicemin_tasks_per_serviceoverride, and secret-backed notification recipient configurationWhy
Metrics collection already exists, but actionable alerting and visualisation are largely absent. This change adds the minimum env-gating observability layer needed before treating stage as ready for further activation steps.
The alarms focus on the current stage runtime path:
HealthyHostCount, plus error-count and response-time alarms)RunningTaskCountvia Container Insights)NumberOfNotificationsFailed)Redis includes both
EngineCPUUtilizationand hostCPUUtilizationbecause thecache.t3.smallnode has only 2 vCPUs. AWS explicitly recommends monitoring host CPU alongside engine CPU on nodes with two or fewer vCPUs, asEngineCPUUtilizationalone can miss host-level overload.The implementation is explicit instead of using
CloudWatchMonitoringGroupso we keep a single SNS notification path, correct metric selection (the upstream helper has atarget_5xxmetric bug), and full control over alarm descriptions and behaviour.Implementation notes
Brokerdimension usesbroker_name, notbroker.id: AWS publishes Amazon MQ for RabbitMQ metrics keyed by broker name; the Pulumiaws.mq.Broker.idis the AWS UUID and would point at non-existent metric series.replication_group.member_clusters[0]: provider's actual output rather than reconstructing the AWS naming convention<rg-id>-001. Robust against multi-node setups and AWS naming changes.treat_missing_datais per-metric, not uniform:breachingfor managed-resource metrics (broker CPU/memory, target groupHealthyHostCount, ECSRunningTaskCount) where missing data means the resource is gone;notBreachingfor sparse error counts and traffic-dependent metrics under low load.min_tasksoverride:ecs.min_tasks_per_service.<svc>falls back toecs.min_tasks(default 1). Lets operators setworker: 0during intentional drain windows without firing the running-tasks alarm globally.apply()lambda.HTTPCode_Target_5XX_CountandTargetResponseTimeuseLoadBalancer + TargetGroupin both alarms and dashboard widgets; ALB-side metrics keepLoadBalanceronly.Validation
This PR contributes 31 resources to the stage stack:
UnHealthyHostCount,HealthyHostCount× web + versioncheck)NumberOfNotificationsFailed)Resource count assumes the current stage config with one SNS email recipient; if the
atn/stage/monitoring_notify_emailslist grows, the subscription count grows accordingly.While stacked on #378,
pulumi previewshows+ 36 to create / ~ 20 to update / - 3 to delete / +- 3 to replace / = 140 unchanged(combined effect of both PRs against current stage state). Once #378 merges and this PR retargets tostage,pulumi previewfor this PR should show only the 31 monitoring resources listed above.ruff checkandruff format --checkpass. Notification recipients are read from Secrets Manager, not committed in the public repo.Prerequisites
atn/stage/monitoring_notify_emails(comma-separated list of recipient addresses) must exist beforepulumi upSafety
0min_tasks_per_servicelets operators drain a service without globally suppressing availability alarmsPost-deploy validation
To be run after
pulumi up, before this baseline is treated as env-gating green:pulumi previewafter Introduce dedicated stage EFS; fix MQ broker drift and Memcached SG #378 merge / rebase shows no unintended service drainpulumi upECS/ContainerInsights:RunningTaskCountemits for web, worker, and versioncheck (verify by reading one CloudWatch datapoint)network_configurationis explicit in code since PR feat: Add ECS Fargate infrastructure and deployment configuration #362 but still needs runtime ENI confirmation)Follow-up
RunningTaskCount, external/shared resource monitoring (RDS, OpenSearch), secondary notification channel (SMS / Slack / separate topic) so SNS-side delivery failure doesn't silence itself, standalone triage runbook