The monitor
directory can contain two files: monitor/alarmsets.yaml
and monitor/logging.yaml
. These files
contain CloudWatch Alarm and CloudWatch Agent Log Source configuration. These alarms and log sources
are grouped into named sets, and sets of alarms and logs can be applied to resources.
Currently only CloudWatch is supported, but it is intended in the future to support other monitoring and logging services in the future.
Alarm Sets are defined in the file monitor/alarmsets.yaml
.
AlarmSets are named to match a Paco Resource type, then a unique AlarmSet name.
# AutoScalingGroup alarms
ASG:
launch-health:
GroupPendingInstances-Low:
# alarm config here ...
GroupPendingInstances-Critical:
# alarm config here ...
# Application LoadBalancer alarms
LBApplication:
instance-health:
HealthyHostCount-Critical:
# alarm config here ...
response-latency:
TargetResponseTimeP95-Low:
# alarm config here ...
HTTPCode_Target_4XX_Count-Low:
# alarm config here ...
The base Alarm schema contains fields to add additional metadata to alarms. For CloudWatchAlarms, this metadata set in the AlarmDescription field as JSON:
Alarms can have different contexts, which increases the number of metadata that is populated in the AlarmDescription field:
- Global context. Only has base context. e.g. a CloudTrail log alarm.
- NetworkEnvironmnet context. Base and NetworkEnvironment context. e.g. a VPC flow log alarm.
- Application context alarm. Base, NetworkEnvironment and Application contexts. e,g, an external HTTP health check alarm
- Resource context alarm. Base, NetworkEnvironment, Application and Resource contexts. e.g. an AutoScalingGroup CPU alarm
Base context for all alarms
----------------------------
"project_name": Project name
"project_title": Project title
"account_name": Account name
"alarm_name": Alarm name
"classification": Classification
"severity": Severity
"topic_arns": SNS Topic ARN subscriptions
"description": Description (only if supplied)
"runbook_url": Runbook URL (only if supplied)
NetworkEnvironment context alarms
---------------------------------
"netenv_name": NetworkEnvironment name
"netenv_title": NetworkEnvironment title
"env_name": Environment name
"env_title": Environment title
"envreg_name": EnvironmentRegion name
"envreg_title": EnvironmentRegion title
Application context alarms
--------------------------
"app_name": Application name
"app_title": Application title
Resource context alarms
-----------------------
"resource_group_name": Resource Group name
"resource_group_title": Resource Group title
"resource_name": Resource name
"resource_title": Resource title
Alarms can be set in the monitoring:
field for Application and Resource objects. The name of
each AlarmSet should be listed in the alarm_sets:
field. It is possible to override the individual fields of
an Alarm in a netenv file.
environments:
prod:
title: "Production"
default:
enabled: true
applications:
app:
monitoring:
enabled: true
alarm_sets:
special-app-alarms:
groups:
site:
resources:
alb:
monitoring:
enabled: true
alarm_sets:
core:
performance:
# Override the SlowTargetResponseTime Alarm threshold field
SlowTargetResponseTime:
threshold: 2.0
Stylistically, monitoring
and alarm_sets
can be specified in the base applications:
section in a netenv file,
and set to enabled: false
. Then only the production environment can override the enabled field to true. This makes it
easy to enable a dev or test environment if you want to test alarms before using in a production environment.
Alternatively, you may wish to only specify the monitoring in the environments:
section of your netenv file only
for production, and keep the base applications:
configuration shorter.
Alarm notifications tell alarms which SNS Topics to notify. Alarm notifications are set with the notifications:
field
at the Application, Resource, AlarmSet and Alarm level.
applications:
app:
enabled: true
# Application level notifications
notifications:
ops_team:
groups:
- cloud_ops
groups:
site:
resources:
web:
monitoring:
# Resource level notifications
notifications:
web_team:
groups:
- web
alarm_sets:
instance-health-cwagent:
notifications:
# AlarmSet notifications
alarmsetnotif:
groups:
- misterteam
SwapPercent-Low:
# Alarm level notifications
notifications:
singlealarm:
groups:
- oneguygetsthis
Notifications can be filtered for specific severity
and classification
levels. This allows you to direct
critical severity to one group and low severity to another, or to send only performance classification alarms to one
group and security classification alarms to another.
notifications:
severe_security:
groups:
- security_group
severity: 'critical'
classification: 'security'
Note that although you can configure multiple SNS Topics to subscribe to a single alarm, CloudWatch has a maximum limit of five SNS Topics that a given alarm may be subscribed to.
It is also possible to write a Paco add-on that overrides the default CloudWatch notifications and instead notifies a single SNS Topic. This is intended to allow you to write an add-on that directs all alarms through a single Lambda (regardless or account or region) which is then responsible for delivering or taking action on alarms.
Currently Global and NetworkEnvironment alarms are only supported through Paco add-ons.
App:
special-app-alarms:
CustomMetric:
description: "Custom metric has been triggered."
classification: health
severity: low
metric_name: "custom_metric"
period: 86400 # 1 day
evaluation_periods: 1
threshold: 1
comparison_operator: LessThanThreshold
statistic: Average
treat_missing_data: breaching
namespace: 'CustomMetric'
LBApplication:
core:
HealthyHostCount-Critical:
classification: health
severity: critical
description: "Alert if fewer than X number of backend hosts are passing health checks"
metric_name: "HealthyHostCount"
dimensions:
- name: LoadBalancer
value: paco.ref netenv.wa.applications.ap.groups.site.resources.alb.fullname
- name: TargetGroup
value: paco.ref netenv.wa.applications.ap.groups.site.resources.alb.target_groups.ap.fullname
period: 60
evaluation_periods: 5
statistic: Minimum
threshold: 1
comparison_operator: LessThanThreshold
treat_missing_data: breaching
performance:
SlowTargetResponseTime:
severity: low
classification: performance
description: "Average HTTP response time is unusually slow"
metric_name: "TargetResponseTime"
period: 60
evaluation_periods: 5
statistic: Average
threshold: 1.5
comparison_operator: GreaterThanOrEqualToThreshold
treat_missing_data: missing
dimensions:
- name: LoadBalancer
value: paco.ref netenv.wa.applications.ap.groups.site.resources.alb.fullname
- name: TargetGroup
value: paco.ref netenv.wa.applications.ap.groups.site.resources.alb.target_groups.ap.fullname
HTTPCode4XXCount:
classification: performance
severity: low
description: "Large number of 4xx HTTP error codes"
metric_name: "HTTPCode_Target_4XX_Count"
period: 60
evaluation_periods: 5
statistic: Sum
threshold: 100
comparison_operator: GreaterThanOrEqualToThreshold
treat_missing_data: notBreaching
HTTPCode5XXCount:
classification: performance
severity: low
description: "Large number of 5xx HTTP error codes"
metric_name: "HTTPCode_Target_5XX_Count"
period: 60
evaluation_periods: 5
statistic: Sum
threshold: 100
comparison_operator: GreaterThanOrEqualToThreshold
treat_missing_data: notBreaching
ASG:
core:
StatusCheck:
classification: health
severity: critical
metric_name: "StatusCheckFailed"
namespace: AWS/EC2
period: 60
evaluation_periods: 5
statistic: Maximum
threshold: 0
comparison_operator: GreaterThanThreshold
treat_missing_data: breaching
CPUTotal:
classification: performance
severity: critical
metric_name: "CPUUtilization"
namespace: AWS/EC2
period: 60
evaluation_periods: 30
threshold: 90
statistic: Average
treat_missing_data: breaching
comparison_operator: GreaterThanThreshold
cwagent:
SwapPercentLow:
classification: performance
severity: low
metric_name: "swap_used_percent"
namespace: "CWAgent"
period: 60
evaluation_periods: 5
statistic: Maximum
threshold: 80
comparison_operator: GreaterThanThreshold
treat_missing_data: breaching
DiskSpaceLow:
classification: health
severity: low
metric_name: "disk_used_percent"
namespace: "CWAgent"
period: 300
evaluation_periods: 1
statistic: Minimum
threshold: 60
comparison_operator: GreaterThanThreshold
treat_missing_data: breaching
DiskSpaceCritical:
classification: health
severity: low
metric_name: "disk_used_percent"
namespace: "CWAgent"
period: 300
evaluation_periods: 1
statistic: Minimum
threshold: 80
comparison_operator: GreaterThanThreshold
treat_missing_data: breaching
# CloudWatch Log Alarms
log-alarms:
CfnInitError:
type: LogAlarm
description: "CloudFormation Init Errors"
classification: health
severity: critical
log_set_name: 'cloud'
log_group_name: 'cfn_init'
metric_name: "CfnInitErrorMetric"
period: 300
evaluation_periods: 1
threshold: 1.0
treat_missing_data: notBreaching
comparison_operator: GreaterThanOrEqualToThreshold
statistic: Sum
CodeDeployError:
type: LogAlarm
description: "CodeDeploy Errors"
classification: health
severity: critical
log_set_name: 'cloud'
log_group_name: 'codedeploy'
metric_name: "CodeDeployErrorMetric"
period: 300
evaluation_periods: 1
threshold: 1.0
treat_missing_data: notBreaching
comparison_operator: GreaterThanOrEqualToThreshold
statistic: Sum
WsgiError:
type: LogAlarm
description: "HTTP WSGI Errors"
classification: health
severity: critical
log_set_name: 'ap'
log_group_name: 'httpd_error'
metric_name: "WsgiErrorMetric"
period: 300
evaluation_periods: 1
threshold: 1.0
treat_missing_data: notBreaching
comparison_operator: GreaterThanOrEqualToThreshold
statistic: Sum
HighHTTPTraffic:
type: LogAlarm
description: "High number of http access logs"
classification: performance
severity: low
log_set_name: 'ap'
log_group_name: 'httpd_access'
metric_name: "HttpdLogCountMetric"
period: 300
evaluation_periods: 1
threshold: 1000
treat_missing_data: ignore
comparison_operator: GreaterThanOrEqualToThreshold
statistic: Sum
RDSMysql:
basic-database:
CPUTotal-Low:
classification: performance
severity: low
metric_name: "CPUUtilization"
namespace: AWS/RDS
period: 300
evaluation_periods: 6
threshold: 90
comparison_operator: GreaterThanOrEqualToThreshold
statistic: Average
treat_missing_data: breaching
FreeableMemoryAlarm:
classification: performance
severity: low
metric_name: "FreeableMemory"
namespace: AWS/RDS
period: 300
evaluation_periods: 1
threshold: 100000000
comparison_operator: LessThanOrEqualToThreshold
statistic: Minimum
treat_missing_data: breaching
FreeStorageSpaceAlarm:
classification: performance
severity: low
metric_name: "FreeStorageSpace"
namespace: AWS/RDS
period: 300
evaluation_periods: 1
threshold: 5000000000
comparison_operator: LessThanOrEqualToThreshold
statistic: Minimum
treat_missing_data: breaching
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
A container of Alarm objects.
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
resource_type | String | Resource type | Must be a valid AWS resource type |
Base Schemas Named, Notifiable, Title
A Paco Alarm.
This is a base schema which defines metadata useful to categorize an alarm.
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
classification | String |star| | Classification | Must be one of: 'performance', 'security' or 'health' | unset |
description | String | Description | ||
notification_groups | List<String> | List of notification groups the alarm is subscribed to. | ||
runbook_url | String | Runbook URL | ||
severity | String | Severity | Must be one of: 'low', 'critical' | low |
Base Schemas Deployable, Named, Notifiable, Title
A dimension of a metric
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
name | String | Dimension name | ||
value | PacoReference|String | String or a Paco Reference to resource output. | Paco Reference to Interface. String Ok. |
Container for AlarmNotification objects.
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
Alarm Notification
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
classification | String | Classification filter | Must be one of: 'performance', 'security', 'health' or ''. | |
groups | List<String> |star| | List of groups | ||
severity | String | Severity filter | Must be one of: 'low', 'critical' |
A Simple CloudWatch Alarm
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
actions_enabled | Boolean | Actions Enabled | ||
alarm_description | String | Alarm Description | Valid JSON document with Paco fields. | |
comparison_operator | String | Comparison operator | Must be one of: 'GreaterThanThreshold','GreaterThanOrEqualToThreshold', 'LessThanThreshold', 'LessThanOrEqualToThreshold' | |
dimensions | List<Dimension> | Dimensions | ||
evaluation_periods | Int | Evaluation periods | ||
metric_name | String |star| | Metric name | ||
namespace | String | Namespace | ||
period | Int | Period in seconds | ||
statistic | String | Statistic | ||
threshold | Float | Threshold |
Container for Metric`Filter objects.
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
Metric filter
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
filter_pattern | String | Filter pattern | ||
metric_transformations | List<MetricTransformation> | Metric transformations |
Metric Transformation
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
default_value | Float | The value to emit when a filter pattern does not match a log event. | ||
metric_name | String |star| | The name of the CloudWatch Metric. | ||
metric_namespace | String | The namespace of the CloudWatch metric. If not set, the namespace used will be 'AIM/{log-group-name}'. | ||
metric_value | String |star| | The value that is published to the CloudWatch metric. |
A set of metrics to collect and an optional collection interval:
- name: disk
- measurements: - free collection_interval: 900
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
collection_interval | Int | Collection interval | ||
drop_device | Boolean | Drops the device name from disk metrics | True | |
measurements | List<String> | Measurements | ||
name | String | Metric(s) group name | ||
resources | List<String> | List of resources for this metric |
CloudWatch Logging configuration
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
log_sets | Container<CloudWatchLogSets> | A CloudWatchLogSets container |
Base Schemas CloudWatchLogRetention, Named, Title
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
expire_events_after_days | String | Expire Events After. Retention period of logs in this group |
Container for CloudWatchLogSet objects.
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
A set of Log Group objects
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
log_groups | Container<CloudWatchLogGroups> | A CloudWatchLogGroups container |
Base Schemas CloudWatchLogRetention, Named, Title
Container for CloudWatchLogGroup objects.
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
A CloudWatchLogGroup is responsible for retention, access control and metric filters
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
log_group_name | String | Log group name. Can override the LogGroup name used from the name field. | ||
metric_filters | Container<MetricFilters> | Metric Filters | ||
sources | Container<CloudWatchLogSources> | A CloudWatchLogSources container |
Base Schemas CloudWatchLogRetention, Named, Title
A container of CloudWatchLogSource objects.
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
Log source for a CloudWatch agent.
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|
encoding | String | Encoding | utf-8 | |
log_stream_name | String |star| | Log stream name | CloudWatch Log Stream name | |
multi_line_start_pattern | String | Multi-line start pattern | ||
path | String |star| | Path | Must be a valid filesystem path expression. Wildcard * is allowed. | |
timestamp_format | String | Timestamp format | ||
timezone | String | Timezone | Must be one of: 'Local', 'UTC' | Local |
Base Schemas CloudWatchLogRetention, Named, Title
Container for Route53HealthCheck objects.
Field name | Type | Purpose | Constraints | Default |
---|---|---|---|---|