Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wpt.fyi status checks are not showing up on WPT repo #1660

Closed
stephenmcgruer opened this issue Nov 22, 2019 · 20 comments
Closed

wpt.fyi status checks are not showing up on WPT repo #1660

stephenmcgruer opened this issue Nov 22, 2019 · 20 comments
Assignees
Labels
bug Something isn't working regression Something isn't working, but used to

Comments

@stephenmcgruer
Copy link
Contributor

See web-platform-tests/wpt#20386 (comment)

cc @foolip @Hexcles

Investigating now. I'm seeing 500 errors on /api/webhook/check for prod wpt.fyi, which may be related:

2019-11-22 04:42:08.264 PST payload signature check failed

@stephenmcgruer stephenmcgruer added bug Something isn't working regression Something isn't working, but used to labels Nov 22, 2019
@stephenmcgruer stephenmcgruer self-assigned this Nov 22, 2019
@stephenmcgruer
Copy link
Contributor Author

I would hazard a guess that the error comes from https://github.com/google/go-github/blob/master/github/messages.go#L201

@stephenmcgruer
Copy link
Contributor Author

My suspicion would be that when the app was changed for #1557 , this changed the secret in some way, but I'm not familiar with the flow there so that may be a red herring.

@stephenmcgruer
Copy link
Contributor Author

We read the secret from

secret, err := shared.GetSecret(ds, "github-check-webhook-secret")
:

secret, err := shared.GetSecret(ds, "github-check-webhook-secret")

Which reads it from the datastore.

@stephenmcgruer
Copy link
Contributor Author

The secret at https://github.com/organizations/web-platform-tests/settings/apps/wpt-fyi and the one stored in the datastore appear to be different. I'm going to try backing up the datastore entry, changing it, and seeing if that helps.

@stephenmcgruer
Copy link
Contributor Author

stephenmcgruer commented Nov 22, 2019

That does not appear to have helped. I am now suspecting that https://github.com/organizations/web-platform-tests/settings/apps/wpt-fyi is not the right secret; so much for the quick fix.

@stephenmcgruer
Copy link
Contributor Author

For now, I have restored the previous secret in the datastore.

@stephenmcgruer
Copy link
Contributor Author

For the life of me, I cannot figure out who actually sends requests to /api/webhook/check... it doesn't seem to be documented anywhere.

@stephenmcgruer
Copy link
Contributor Author

Ok, there's a webhook secret stored in https://github.com/organizations/web-platform-tests/settings/apps/wpt-fyi that I missed. Resetting that to the one in the datastore.

@stephenmcgruer
Copy link
Contributor Author

Also, I think we've found the cause. Chrome password manager is overriding bits of the settings page, which means that when @Hexcles changed the app name his password manager probably reset the secret too.

@stephenmcgruer
Copy link
Contributor Author

From the logs side, things are recovering. Looking for a WPT PR that has the checks showing up.

@foolip
Copy link
Member

foolip commented Nov 22, 2019

Thanks @stephenmcgruer for tackling this today!

@stephenmcgruer
Copy link
Contributor Author

I believe we're good here, e.g. see web-platform-tests/wpt#20402

Postmortem is well underway too.

@stephenmcgruer
Copy link
Contributor Author

This happened again (see web-platform-tests/wpt#20418 (comment)). I checked the app settings page, and for some reason the value of the webhook secret input was value='value=[secret]'... I've reset it to just value=[secret] and expect that will fix the 500s.

@Hexcles were you editing the wpt.fyi app again?

Also @Hexcles - did our alert trigger? /api/webhook/check was 500-ing like crazy...

@Hexcles
Copy link
Member

Hexcles commented Nov 24, 2019 via email

@Hexcles
Copy link
Member

Hexcles commented Nov 25, 2019

Also @Hexcles - did our alert trigger? /api/webhook/check was 500-ing like crazy...

Yes, but apparently Stackdriver failed to send emails to group aliases. Would you mind me adding your email directly, along with mine?

@stephenmcgruer
Copy link
Contributor Author

Sgtm. If we are worried about the corp filter blocking it, also happy to add my chromium or even personal email...

@Hexcles
Copy link
Member

Hexcles commented Nov 25, 2019

I might've got an explanation of what happened: I forgot to save the setting when trying to fix my screw-up last Friday -- remember I told you I accidentally did it again last Friday @stephenmcgruer ? I told you I fixed it, but looking at the open tabs of my workstation, I realized the settings tab was still open so chances are I did not actually click save... Sorry about that.

On the bright side, this has been a good test of our alerting setup -- and it didn't work.

@Hexcles
Copy link
Member

Hexcles commented Nov 25, 2019

There is an audit log for admin activities, but somehow it doesn't include changes to apps... Sigh.

https://github.com/organizations/web-platform-tests/settings/audit-log

@stephenmcgruer
Copy link
Contributor Author

Oh dear :(. These things happen though, and if I had more control over the app settings I would definitely have included changing that UI as part of the postmortem (adding a confirmation box, for one!).

Now that we seem to have alerting working (I got an email saying the outage was over); would we have continued to receive emails if we thought we had fixed it but it was still actually broken? (I.e. are there 'update' emails whilst an outage still continues?)

@Hexcles
Copy link
Member

Hexcles commented Nov 25, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Something isn't working, but used to
Projects
None yet
Development

No branches or pull requests

3 participants