Configure health monitoring alerts #250

adunkman · 2020-06-04T23:39:08Z

As the Court, so that we can ensure we have a secure and available system, we need assurance that after code updates, known weak points are not vulnerable to leak sensitive data and the application is available.

Acceptance criteria

Alerts are configured when:

The UI is unavailable. Monitored by alerting on uptime/ping testing for dawson.ustaxcourt.gov and app.dawson.ustaxcourt.gov (and the equivalent in other environments). Implemented in Add alarms for UIs being unavailable. #497.
The system health endpoints return red. Monitored by alerting on the system health endpoints (as implemented in Tech Lead: Health and Configuration Endpoints flexion/ef-cms#6281). Blocked on BUG: Public API endpoint unknown to health monitoring alerts. flexion/ef-cms#6903. Implemented in Add alert for the application’s status endpoint. #650.
The Elasticsearch cluster has a non-green status. Monitored by alerting on the cluster health status. Implemented in Add Cloudwatch Alarms for Elasticsearch clusters. #539.

adunkman · 2020-10-01T14:18:57Z

Blocked by #387.

adunkman · 2020-10-22T15:33:09Z

Had a quick chat with @julialeague about this to help set my head straight on a path forward — thanks Julia!!

There are a lot of things we could be looking at, but without knowing usage patterns well (#137), we can’t know what unusual looks like. Therefore, I propose:

This issue tackles the obvious fires. These are things we know would mean an outage or a problem. They include:
- The UI is unavailable. Monitored by alerting on uptime/ping testing for dawson.ustaxcourt.gov and app.dawson.ustaxcourt.gov (and the equivalent in other environments).
- The system health endpoints return red. Monitored by alerting on the system health endpoints (as implemented in Tech Lead: Health and Configuration Endpoints flexion/ef-cms#6281).
- The Elasticsearch cluster has a non-green status. Monitored by alerting on the cluster health status.

At a future point, once we know what "normal" traffic looks like, we can consider monitoring things like:

Unexpected traffic volume.
Login or new signup rates abnormal.

adunkman · 2020-10-22T17:22:35Z

Configured S3 buckets are not publicly available.

Elasticsearch cannot be accessed directly.

Speaking to these points, given the rate of change of these, I think they’d result in fragile tests. Considering that the "pass" state is a URL is inaccessible, I think we’d quickly be asserting things which were no longer of help (for example, that a nonsense URL was inaccessible).

Instead, I think we might want to consider introducing a security scanner for these. GitHub’s super-linter seems to be a good option, which uses tflint for identifying formatting/known linter problems and terrascan for identifying security risks.

I’ll file a new issue for these, for post-MVP.

adunkman · 2020-10-23T16:12:37Z

The UI is unavailable. Monitored by alerting on uptime/ping testing for dawson.ustaxcourt.gov and app.dawson.ustaxcourt.gov (and the equivalent in other environments).

I believe this can be accomplished with Route53 health checks which would trigger a CloudWatch alarm.

The system health endpoints return red. Monitored by alerting on the system health endpoints (as implemented in flexion#6281).

This can be achieved by a Route53 health check as well, hitting the health check endpoint on the public API. Unfortunately, we no longer have a set URL for the public API — it will be either https://public-api-green.dawson.ustaxcourt.gov/public-api/health or https://public-api-blue.dawson.ustaxcourt.gov/public-api/health. I’ll need to file an issue (closely related to flexion#6864) and have it fixed before I can fully implement this health check.

The Elasticsearch cluster has a non-green status. Monitored by alerting on the cluster health status.

Already handled by a CloudWatch metric, and we can add an alarm.

adunkman · 2020-10-23T16:13:26Z

CloudWatch Alarms looks like a natural clearing house for status, and it uses SNS for notifications.

adunkman · 2020-10-23T16:26:02Z

Speaking with @mmarcotte on direction here — we’ll use a simple SNS configuration for now and go with the Route53 approach.

We know we have a blindspot that if AWS has a catastrophic outage, we will not be notified (because the notification system may also be offline), and I’ll file an issue to consider using an external notification service like Opsgenie for post-MVP.

adunkman · 2020-10-26T22:45:27Z

Reported flexion#6903 to get a single API endpoint for the system health JSON.

adunkman · 2020-10-30T18:10:35Z

~~Speaking w/ Mike, he’d like the API endpoint covered by health alerts for the first pass as well. Updating the description!~~ I misunderstood.

adunkman · 2020-11-05T15:27:06Z

Latest Elasticsearch alarms are in https://github.com/ustaxcourt/ef-cms/compare/add-es-alarms, blocked on running account-specific terraform steps, communication in Slack.

adunkman · 2020-12-02T15:48:30Z

flexion#6903 is completed by flexion#7177, awaiting PR to the Court.

adunkman · 2020-12-03T15:04:35Z

Awaiting #608.

adunkman mentioned this issue Jul 2, 2020

Determine a strategy for user management in Cognito pools. #226

Closed

adunkman changed the title ~~Ensure smoke tests cover all critical paths.~~ Ensure all critical paths are covered by either smoke tests or health monitoring. Sep 3, 2020

vickimcfadden changed the title ~~Ensure all critical paths are covered by either smoke tests or health monitoring.~~ Configure health monitoring alerts Sep 22, 2020

vickimcfadden assigned adunkman Sep 22, 2020

vickimcfadden mentioned this issue Oct 6, 2020

Load test the Court's migration environment. #138

Closed

4 tasks

This was referenced Oct 20, 2020

Application monitoring #317

Closed

Load test production environment #472

Closed

adunkman mentioned this issue Oct 22, 2020

Consider using a security scanner to catch insecure infrastructure configuration. #477

Closed

This was referenced Oct 26, 2020

Add alarms for UIs being unavailable. #497

Merged

BUG: Public API endpoint unknown to health monitoring alerts. flexion/ef-cms#6903

Closed

adunkman mentioned this issue Nov 12, 2020

Add Cloudwatch Alarms for Elasticsearch clusters. #539

Merged

adunkman mentioned this issue Dec 14, 2020

Add alert for the application’s status endpoint. #650

Merged

mmarcotte closed this as completed in #650 Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure health monitoring alerts #250

Configure health monitoring alerts #250

adunkman commented Jun 4, 2020 •

edited

Loading

adunkman commented Oct 1, 2020

adunkman commented Oct 22, 2020

adunkman commented Oct 22, 2020

adunkman commented Oct 23, 2020

adunkman commented Oct 23, 2020

adunkman commented Oct 23, 2020

adunkman commented Oct 26, 2020

adunkman commented Oct 30, 2020 •

edited

Loading

adunkman commented Nov 5, 2020

adunkman commented Dec 2, 2020

adunkman commented Dec 3, 2020

Configure health monitoring alerts #250

Configure health monitoring alerts #250

Comments

adunkman commented Jun 4, 2020 • edited Loading

adunkman commented Oct 1, 2020

adunkman commented Oct 22, 2020

adunkman commented Oct 22, 2020

adunkman commented Oct 23, 2020

adunkman commented Oct 23, 2020

adunkman commented Oct 23, 2020

adunkman commented Oct 26, 2020

adunkman commented Oct 30, 2020 • edited Loading

adunkman commented Nov 5, 2020

adunkman commented Dec 2, 2020

adunkman commented Dec 3, 2020

adunkman commented Jun 4, 2020 •

edited

Loading

adunkman commented Oct 30, 2020 •

edited

Loading