Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postmortem: wpt.fyi status checks not showing up on WPT repo #1661

Open
3 of 6 tasks
stephenmcgruer opened this issue Nov 22, 2019 · 14 comments
Open
3 of 6 tasks

Postmortem: wpt.fyi status checks not showing up on WPT repo #1661

stephenmcgruer opened this issue Nov 22, 2019 · 14 comments
Assignees
Labels
postmortem This issue is a postmortem for some outage

Comments

@stephenmcgruer
Copy link
Contributor

stephenmcgruer commented Nov 22, 2019

Owner: @stephenmcgruer
Postmortem Created: 2019-11-22 09:57 EST
Status: Published
Issue: #1660

Impact: Approximately 20.5 hours of PRs to WPT did not have status checks run on them to report the change in pass/fail rate of affected tests. 34 PRs were merged during this time. Any affected PR would not have Safari or Edge results uploaded to wpt.fyi.

Root Cause: The wpt.fyi app's secret was changed inadvertently. This is believed to have been caused by Chrome's password manager auto-filling the secret field when the app name was changed.

Timeline

  • 2019-11-21 12:58 EST (approx): We change the name of the wpt.fyi app. At the same time, we believe, the app's webhook secret token was inadvertently changed.
  • 2019-11-21 12:58 EST: The first call to /api/webhook/check hits a 500 error for payload signature check failed. This goes unnoticed.
  • OUTAGE BEGINS
  • 2019-11-22 04:59 EST: @foolip notices a PR is missing the wpt.fyi status checks, and asks @stephenmcgruer to investigate.
  • 2019-11-22 07:32 EST: @stephenmcgruer wakes up and determines that the problem occurs to more than one PR.
  • 2019-11-22 07:45 EST: Tracking issue opened. 500 errors are discovered in wpt.fyi logs.
  • 2019-11-22 07:55 EST: Not knowing what secret to look for, @stephenmcgruer attempts to replace the datastore secret with the wpt.fyi app's "client secret".
  • 2019-11-22 08:06 EST: Attempted fix confirmed not to work, and datastore is reverted.
  • 2019-11-22 08:24 EST: @stephenmcgruer attempts to investigate further, but is stymied by lack of experience and documentation.
  • 2019-11-22 09:43 EST: @Hexcles wakes up, and is able to point @stephenmcgruer to the correct secret to change to match the datastore.
  • 2019-11-22 09:45 EST: Suspected root cause found when @stephenmcgruer updates the secret and discovers that his Chrome password manager was also trying to override it at the same time.
  • 2019-11-22 09:49 EST: First non-500-ing call to /api/webhook/check is logged on the server.
  • OUTAGE ENDS
  • 2019-11-22 10:21 EST: @stephenmcgruer confirms that checks are showing up in PRs again, example.

Lessons Learnt

Things that went well

  • The server logs from wpt.fyi were reasonably clear in what had failed (signature mismatch). The debugging engineer did have to do a Google search for the error string, however, to fully understand.
  • Once alerted to the problem, @Hexcles was able to quickly point out where the secret for /api/webhook/check was set in GitHub.

Things that went poorly

  • There is no monitoring of 500 errors on the wpt.fyi GCP project. As such, the spike in 500 errors went unnoticed.
  • There is otherwise also no monitoring of 'expected' checks for a PR to WPT.
  • /api/webhook/check is undocumented, which meant the engineer debugging was not sure what even calls that endpoint.
  • GitHub apps do not show up in the repository's list of webhooks (they have their own special location), and the engineer debugging was not aware of that.

Where we got lucky

  • @foolip was quick to notice when a change they reviewed did not have the wpt.fyi status checks. Without their sharp eyes, it is unclear when we would have realized checks were missing.

Action Items

  • File bug on Chrome Password Manager (type=prevent, owner=@Hexcles)
  • File bug on GitHub regarding hidden form (type=prevent, owner=@stephenmcgruer)
  • Setup monitoring for 500 errors from wpt.fyi (type=detect, owner=@Hexcles)
  • Write design doc for monitoring 'expected' checks on WPT PRs (type=detect, owner=???)
  • Document /api/webhook/check (type=mitigate, owner=@stephenmcgruer)
  • Determine how to re-run checks for PRs that missed it? (type=repair, owner=???) - CANCELLED
@stephenmcgruer stephenmcgruer added the postmortem This issue is a postmortem for some outage label Nov 22, 2019
@stephenmcgruer stephenmcgruer self-assigned this Nov 22, 2019
@Hexcles
Copy link
Member

Hexcles commented Nov 22, 2019

Action item 1: filed a bug to Chrome Password Manager: https://crbug.com/1027556

@stephenmcgruer
Copy link
Contributor Author

@foolip @Hexcles postmortem should be ready for review, PTAL

@Hexcles
Copy link
Member

Hexcles commented Nov 22, 2019

Setup monitoring for 500 errors from wpt.fyi (type=detect, owner=@Hexcles)

Done via Stackdriver. Tentatively set the threshold to 0.02 5XX responses per second (~= 1 error per minute).

@stephenmcgruer
Copy link
Contributor Author

File bug on GitHub regarding hidden form (type=prevent, owner=@stephenmcgruer)

GitHub does not have a public issue tracker. Sent email to support@github.com with details of problem.

@Hexcles
Copy link
Member

Hexcles commented Nov 22, 2019

GitHub does not have a public issue tracker. Sent email to support@github.com with details of problem.

I stumbled upon https://github.community/ just now. Not sure if this is their UserVoice or issue tracker or both.

@stephenmcgruer
Copy link
Contributor Author

From that page:

If you want to submit a feature request or feedback about either the GitHub Community Forum or GitHub itself, please use our contact form.

(The contact form is equivalent to emailing support@github.com afaik)

@Hexcles
Copy link
Member

Hexcles commented Nov 25, 2019

Setup monitoring for 500 errors from wpt.fyi (type=detect, owner=@Hexcles)

Follow-up: turns out Stackdriver cannot send emails to groups, so we have to put individual emails there.

@stephenmcgruer
Copy link
Contributor Author

GitHub does not have a public issue tracker. Sent email to support@github.com with details of problem.

Follow-up: I received an email from support saying they will pass it onto the engineering team.

@foolip
Copy link
Member

foolip commented Nov 27, 2019

There is otherwise also no monitoring of 'expected' checks for a PR to WPT

@stephenmcgruer I think this warrants an action item as it would catch problems wherever in the chain they occur, and we could also monitor Taskcluster and Azure Pipelines on PRs with this.

@foolip
Copy link
Member

foolip commented Nov 27, 2019

Other than that, this postmortem is great, LGTM!

@stephenmcgruer
Copy link
Contributor Author

@stephenmcgruer I think this warrants an action item as it would catch problems wherever in the chain they occur, and we could also monitor Taskcluster and Azure Pipelines on PRs with this.

Added action item to write a design doc for monitoring 'expected' checks on WPT PRs.

@stephenmcgruer
Copy link
Contributor Author

In terms of the repair task:

Determine how to re-run checks for PRs that missed it

At this point I don't believe we're going to do this (most of the affected PRs have landed anyway), so marking it as such.

@stephenmcgruer
Copy link
Contributor Author

We have two out-standing AIs here:

Write design doc for monitoring 'expected' checks on WPT PRs (type=detect, owner=???)
Document /api/webhook/check (type=mitigate, owner=@stephenmcgruer)

The former is on our 2020 OKRs and related to the productionization effort being led by @LukeZielinski , so I think we can expect that to happen this year. The latter is still on me; I'll try to get that done soon so we can close this out :)

@Hexcles
Copy link
Member

Hexcles commented Mar 12, 2020

Re-assigning to folks with action items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postmortem This issue is a postmortem for some outage
Projects
None yet
Development

No branches or pull requests

4 participants