Skip to content

Update error and usage alarms #377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 27, 2022
Merged

Update error and usage alarms #377

merged 8 commits into from
Sep 27, 2022

Conversation

molly-moen
Copy link
Contributor

This PR makes the following updates:

  • Remove the high usage alarms. These did not provide a useful signal and high usage is covered by the high concurrent executions alarm
  • Update the high concurrent execution alarm to a threshold of 400 concurrent executions for 10 minutes for a single build and run lambda. Previously it alarmed if there were 50 concurrent executions across all lambdas, which was too low. The alarm is evaluated so that if we see 400 concurrent executions at least once each minute for 10 minutes, it will alarm and page the DOTD.
  • Update the elevated severe error rate and elevated error rate alarms to page the DOTD instead of post to slack. These have been up and running for a week now with no issues.
  • include a link to the debug steps for DOTD in the error rate and high concurrent execution alarms

Testing

Tested on a dev instance

@molly-moen molly-moen requested review from cat5inthecradle and a team September 22, 2022 17:02
@molly-moen molly-moen merged commit 4d4e029 into main Sep 27, 2022
@molly-moen molly-moen deleted the molly-update-alarm branch September 27, 2022 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants