Skip to content

Update error and usage alarms #377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 27, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 36 additions & 164 deletions cicd/3-app/javabuilder/template.yml.erb
Original file line number Diff line number Diff line change
Expand Up @@ -541,134 +541,6 @@ Resources:
ForwardedValues: {QueryString: true}
ViewerProtocolPolicy: redirect-to-https

HighConcurrentExecutionsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${SubDomainName}_high_concurrent_executions"
AlarmDescription: !Sub |
This will page the DOTD if javabuilder usage exceeds 50 concurrent
executions for 10 minutes. Occasional spikes are expected, but
long-running high usage is an indication of an attack. Go to the
following URLs and set reserved concurrency to 10 immediately
<%JAVALAB_APP_TYPES.each do | name | -%>
https://console.aws.amazon.com/lambda/home?region=${AWS::Region}#/functions/${BuildAndRunJava<%=name%>ProjectFunction}/edit/concurrency?tab=configure
<%end -%>
Then post in #ap-csa-dev.
ActionsEnabled: true
AlarmActions:
- !If [SilenceAlertsCondition, !Ref AWS::NoValue, !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:CDO-Urgent"]
EvaluationPeriods: 10
DatapointsToAlarm: 10
Threshold: 50
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
Metrics:
- Id: e1
Label: Concurrent Executions Across All Lambdas
ReturnData: true
Expression: SUM(METRICS())
<%{Theater: "m2", Neighborhood: "m3", Console: "m4"}.each do |name, id| -%>
- Id: <%=id%>
ReturnData: false
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: ConcurrentExecutions
Dimensions:
- Name: FunctionName
Value: !Ref BuildAndRunJava<%=name%>ProjectFunction
Period: 60
Stat: Maximum
<%end -%>

HighWebsocketConnectionsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${SubDomainName}_high_websocket_connections"
AlarmDescription: Significantly higher websocket connections than normal detected. Investigate if there is a DDOS.
ActionsEnabled: false
EvaluationPeriods: 20
DatapointsToAlarm: 20
ComparisonOperator: GreaterThanUpperThreshold
TreatMissingData: notBreaching
Metrics:
- Id: m1
ReturnData: true
MetricStat:
Metric:
Namespace: AWS/ApiGateway
MetricName: ConnectCount
Dimensions:
- Name: Stage
Value: !Sub "${StageName}"
- Name: ApiId
Value: !Ref WebSocketApi
Period: 60
Stat: Sum
- Id: ad1
Label: ConnectCount (expected)
ReturnData: true
Expression: ANOMALY_DETECTION_BAND(m1, 8)
ThresholdMetricId: ad1

HighHttpRequestsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${SubDomainName}_high_http_requests"
AlarmDescription: Significantly higher HTTP requests than normal detected.
Investigate if there is a DDOS.
ActionsEnabled: true
OKActions: []
AlarmActions: []
InsufficientDataActions: []
EvaluationPeriods: 20
DatapointsToAlarm: 20
ComparisonOperator: GreaterThanUpperThreshold
TreatMissingData: notBreaching
Metrics:
- Id: m1
ReturnData: true
MetricStat:
Metric:
Namespace: AWS/ApiGateway
MetricName: Count
Dimensions:
- Name: ApiId
Value: !Ref HttpApi
Period: 60
Stat: Sum
- Id: ad1
Label: Count (expected)
ReturnData: true
Expression: ANOMALY_DETECTION_BAND(m1, 8)
ThresholdMetricId: ad1

HighUsageCompositeAlarm:
Type: AWS::CloudWatch::CompositeAlarm
DependsOn:
- ConsoleHighInvocationsAlarm
- HighHttpRequestsAlarm
- HighWebsocketConnectionsAlarm
- NeighborhoodHighInvocationsAlarm
- TheaterHighInvocationsAlarm
Properties:
ActionsEnabled: true
AlarmActions:
# TODO: after we have run at high usage for a while, consider re-enabling this alarm. Right now it is too noisy
# - !If [SilenceAlertsCondition, !Ref AWS::NoValue, !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:javabuilder-high-usage"]
- !Ref AWS::NoValue
AlarmDescription: Send message if abnormally high Javabuilder usage detected.
Monitors usage across the HTTP API, WebSocket API, and all Build and Run
Lambdas.
AlarmName: !Sub "${SubDomainName}_high_usage_composite"
AlarmRule: !Sub "ALARM(${SubDomainName}_console_high_invocations) OR
ALARM(${SubDomainName}_high_http_requests) OR
ALARM(${SubDomainName}_high_websocket_connections) OR
ALARM(${SubDomainName}_neighborhood_high_invocations) OR
ALARM(${SubDomainName}_theater_high_invocations)"
InsufficientDataActions: []
OKActions: []

<%JAVALAB_APP_TYPES.each do | name | -%>
<%{
TenPercentSevereErrorRateAlarm: {Threshold: 10, AlarmName: 'ten_percent_severe_error_rate'},
Expand Down Expand Up @@ -871,35 +743,6 @@ Resources:
Threshold: 2500
Period: 60

<%=name%>HighInvocationsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${SubDomainName}_<%=name.downcase%>_high_invocations"
AlarmDescription: Significantly higher <%=name%> build and run invocations than
normal detected. Investigate if there is a DDOS.
ActionsEnabled: false
EvaluationPeriods: 20
DatapointsToAlarm: 20
ComparisonOperator: GreaterThanUpperThreshold
TreatMissingData: notBreaching
Metrics:
- Id: m1
ReturnData: true
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Invocations
Dimensions:
- Name: FunctionName
Value: !Ref BuildAndRunJava<%=name%>ProjectFunction
Period: 60
Stat: Sum
- Id: ad1
Label: Invocations (expected)
ReturnData: true
Expression: ANOMALY_DETECTION_BAND(m1, 8)
ThresholdMetricId: ad1

<%=name%>MinimumUsageAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
Expand Down Expand Up @@ -932,7 +775,8 @@ Resources:
AlarmDescription: Alarm if Javabuilder severe error rate exceeds 10% every 5 minutes for 20
minutes and there are at least 100 requests every 5 minutes.
Occasional spikes are expected, but a sustained elevated severe error rate is an indication of an issue.
Severe errors are generated and emitted by our code.
Severe errors are generated and emitted by our code. Please follow the instructions in this document to mitigate
https://docs.google.com/document/d/1bHvV6pvUcwxgZpw0YWBmxFggQL5KqYx9zwolwkZhjU8/edit#bookmark=id.2gh4dxmz643n
ActionsEnabled: true
AlarmActions:
- !If [SilenceAlertsCondition, !Ref AWS::NoValue, !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:Javabuilder-high-error-rate"]
Expand All @@ -954,10 +798,11 @@ Resources:
AlarmDescription: Alarm if Javabuilder severe error rate exceeds 90% every 5 minutes for 20
minutes and there are at least 100 requests every 5 minutes.
Occasional spikes are expected, but a sustained high severe error rate is an indication of an outage.
Severe errors are generated and emitted by our code.
Severe errors are generated and emitted by our code. Please follow the instructions in this document to mitigate
https://docs.google.com/document/d/1bHvV6pvUcwxgZpw0YWBmxFggQL5KqYx9zwolwkZhjU8/edit#bookmark=id.2gh4dxmz643n
ActionsEnabled: true
AlarmActions:
- !If [SilenceAlertsCondition, !Ref AWS::NoValue, !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:Javabuilder-high-error-rate"]
- !If [SilenceAlertsCondition, !Ref AWS::NoValue, !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:CDO-Urgent"]
AlarmRule: !Sub "ALARM(${SubDomainName}_<%=name.downcase%>_ninety_percent_severe_error_rate) AND
ALARM(${SubDomainName}_<%=name.downcase%>_minimum_usage)"
InsufficientDataActions: []
Expand All @@ -974,7 +819,8 @@ Resources:
AlarmDescription: Alarm if Javabuilder severe error rate exceeds 25% every 5 minutes for 20
minutes and there are at least 100 requests every 5 minutes.
Occasional spikes are expected, but a sustained elevated error rate is an indication of an issue.
Errors are generated by the Lambda system.
Errors are generated by the Lambda system. Please follow the instructions in this document to mitigate
https://docs.google.com/document/d/1bHvV6pvUcwxgZpw0YWBmxFggQL5KqYx9zwolwkZhjU8/edit#bookmark=id.2gh4dxmz643n
ActionsEnabled: true
AlarmActions:
- !If [SilenceAlertsCondition, !Ref AWS::NoValue, !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:Javabuilder-high-error-rate"]
Expand All @@ -996,15 +842,41 @@ Resources:
AlarmDescription: Alarm if Javabuilder error rate exceeds 90% every 5 minutes for 20
minutes and there are at least 100 requests every 5 minutes.
Occasional spikes are expected, but a sustained high error rate is an indication of an outage.
Errors are generated by the Lambda system.
Errors are generated by the Lambda system. Please follow the instructions in this document to mitigate
https://docs.google.com/document/d/1bHvV6pvUcwxgZpw0YWBmxFggQL5KqYx9zwolwkZhjU8/edit#bookmark=id.2gh4dxmz643n
ActionsEnabled: true
AlarmActions:
- !If [SilenceAlertsCondition, !Ref AWS::NoValue, !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:Javabuilder-high-error-rate"]
- !If [SilenceAlertsCondition, !Ref AWS::NoValue, !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:CDO-Urgent"]
AlarmRule: !Sub "ALARM(${SubDomainName}_<%=name.downcase%>_ninety_percent_error_rate) AND
ALARM(${SubDomainName}_<%=name.downcase%>_minimum_usage)"
InsufficientDataActions: []
OKActions: []


<%=name%>HighConcurrentExecutionsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${SubDomainName}_<%=name.downcase%>_high_concurrent_executions"
AlarmDescription: !Sub |
Alarm if javabuilder usage exceeds 400 concurrent
executions for 10 minutes. Occasional spikes are expected, but
long-running high usage is an indication of an attack. Page the student learning
team for further investigation. See this doc for investigation steps
https://docs.google.com/document/d/1bHvV6pvUcwxgZpw0YWBmxFggQL5KqYx9zwolwkZhjU8/edit#bookmark=id.xs1gcuxrw6ze
ActionsEnabled: true
AlarmActions:
- !If [SilenceAlertsCondition, !Ref AWS::NoValue, !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:CDO-Urgent"]
EvaluationPeriods: 10
DatapointsToAlarm: 10
Period: 60
Threshold: 400
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
MetricName: ConcurrentExecutions
Namespace: AWS/Lambda
Statistic: Maximum
Dimensions:
- Name: FunctionName
Value: !Ref BuildAndRunJava<%=name%>ProjectFunction
<%end -%>

# We use shortened versions of names for partition keys (eg, user_id),
Expand Down