Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure health monitoring alerts #250

Closed
3 tasks done
adunkman opened this issue Jun 4, 2020 · 11 comments · Fixed by #650
Closed
3 tasks done

Configure health monitoring alerts #250

adunkman opened this issue Jun 4, 2020 · 11 comments · Fixed by #650
Assignees

Comments

@adunkman
Copy link
Contributor

adunkman commented Jun 4, 2020

As the Court, so that we can ensure we have a secure and available system, we need assurance that after code updates, known weak points are not vulnerable to leak sensitive data and the application is available.

Acceptance criteria

Alerts are configured when:

@adunkman adunkman changed the title Ensure smoke tests cover all critical paths. Ensure all critical paths are covered by either smoke tests or health monitoring. Sep 3, 2020
@vickimcfadden vickimcfadden changed the title Ensure all critical paths are covered by either smoke tests or health monitoring. Configure health monitoring alerts Sep 22, 2020
@adunkman
Copy link
Contributor Author

adunkman commented Oct 1, 2020

Blocked by #387.

@adunkman
Copy link
Contributor Author

Had a quick chat with @julialeague about this to help set my head straight on a path forward — thanks Julia!!

There are a lot of things we could be looking at, but without knowing usage patterns well (#137), we can’t know what unusual looks like. Therefore, I propose:

  • This issue tackles the obvious fires. These are things we know would mean an outage or a problem. They include:
    • The UI is unavailable. Monitored by alerting on uptime/ping testing for dawson.ustaxcourt.gov and app.dawson.ustaxcourt.gov (and the equivalent in other environments).
    • The system health endpoints return red. Monitored by alerting on the system health endpoints (as implemented in Tech Lead: Health and Configuration Endpoints flexion/ef-cms#6281).
    • The Elasticsearch cluster has a non-green status. Monitored by alerting on the cluster health status.

At a future point, once we know what "normal" traffic looks like, we can consider monitoring things like:

  • Unexpected traffic volume.
  • Login or new signup rates abnormal.

@adunkman
Copy link
Contributor Author

  • Configured S3 buckets are not publicly available.
  • Elasticsearch cannot be accessed directly.

Speaking to these points, given the rate of change of these, I think they’d result in fragile tests. Considering that the "pass" state is a URL is inaccessible, I think we’d quickly be asserting things which were no longer of help (for example, that a nonsense URL was inaccessible).

Instead, I think we might want to consider introducing a security scanner for these. GitHub’s super-linter seems to be a good option, which uses tflint for identifying formatting/known linter problems and terrascan for identifying security risks.

I’ll file a new issue for these, for post-MVP.

@adunkman
Copy link
Contributor Author

The UI is unavailable. Monitored by alerting on uptime/ping testing for dawson.ustaxcourt.gov and app.dawson.ustaxcourt.gov (and the equivalent in other environments).

I believe this can be accomplished with Route53 health checks which would trigger a CloudWatch alarm.

The system health endpoints return red. Monitored by alerting on the system health endpoints (as implemented in flexion#6281).

This can be achieved by a Route53 health check as well, hitting the health check endpoint on the public API. Unfortunately, we no longer have a set URL for the public API — it will be either https://public-api-green.dawson.ustaxcourt.gov/public-api/health or https://public-api-blue.dawson.ustaxcourt.gov/public-api/health. I’ll need to file an issue (closely related to flexion#6864) and have it fixed before I can fully implement this health check.

The Elasticsearch cluster has a non-green status. Monitored by alerting on the cluster health status.

Already handled by a CloudWatch metric, and we can add an alarm.

@adunkman
Copy link
Contributor Author

CloudWatch Alarms looks like a natural clearing house for status, and it uses SNS for notifications.

@adunkman
Copy link
Contributor Author

Speaking with @mmarcotte on direction here — we’ll use a simple SNS configuration for now and go with the Route53 approach.

We know we have a blindspot that if AWS has a catastrophic outage, we will not be notified (because the notification system may also be offline), and I’ll file an issue to consider using an external notification service like Opsgenie for post-MVP.

@adunkman
Copy link
Contributor Author

Reported flexion#6903 to get a single API endpoint for the system health JSON.

@adunkman
Copy link
Contributor Author

adunkman commented Oct 30, 2020

Speaking w/ Mike, he’d like the API endpoint covered by health alerts for the first pass as well. Updating the description! I misunderstood.

@adunkman
Copy link
Contributor Author

adunkman commented Nov 5, 2020

Latest Elasticsearch alarms are in https://github.com/ustaxcourt/ef-cms/compare/add-es-alarms, blocked on running account-specific terraform steps, communication in Slack.

@adunkman
Copy link
Contributor Author

adunkman commented Dec 2, 2020

flexion#6903 is completed by flexion#7177, awaiting PR to the Court.

@adunkman
Copy link
Contributor Author

adunkman commented Dec 3, 2020

Awaiting #608.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant