Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postmortem: Chrome Dev 78→79 broke PR and master runs #19362

Open
foolip opened this issue Sep 27, 2019 · 6 comments

Comments

@foolip
Copy link
Contributor

@foolip foolip commented Sep 27, 2019

Owner: @stephenmcgruer
Postmortem Created: 2019-09-27 05:03 EST
Status: Published
Issue: #19297

Impact: When the Chrome Dev channel moved from 78 to 79, full runs on Taskcluster (for wpt.fyi) stopped working, and infrastructure/ tests on PRs that trigger those began failing. This led to 2.5 days of a complete outage for Chrome Dev results (until we pinned to the working 78), and delayed the upgrade from 78 to 79 by over a month.

Root Cause: A Chromium change to use X11 shared-memory capabilities failed to work in the Taskcluster environment specifically (working correctly in all other known environments). This change would then cause the browser under test to crash.

Timeline in UTC:

Lessons Learnt

Things that went well

  • Despite lack of tooling, bisecting Chromium builds on Taskcluster was actually fairly simple - bisect-builds.py spits out a URL which can (with small hacking) be given to the run_tc.py script to use.
  • Once the Chromium author was alerted, they were both eager to help fix it and able to quickly identify what the likely problem was.

Things that went poorly

  • We don't run Chrome Canary for wpt.fyi, so did not spot the issue when it actually landed in Chromium.
  • It took 2.5 days and a custom PR to pin the Chrome release to M78.
  • Since pinning to M78 stopped the obvious bleeding, the work to upgrade to M79 was put to one side for nearly 2 weeks, resulting in wpt.fyi ultimately being behind Chrome Dev for over a month.
  • Running the WPT docker image locally was unable to reproduce the bug, despite the fact that docker is supposed to be a self-contained runtime. (It turns out that shared memory privileges bleed through from the host).
  • The Chromium bug went un-addressed for almost a week, as the assignee was out of office.

Where we got lucky

  • (No known entries)

Action Items

@foolip

This comment has been minimized.

Copy link
Contributor Author

@foolip foolip commented Sep 27, 2019

With infrastructure/ working again in #18830 (comment) I'll go around and tell people on PRs if they need to retry Taskcluster.

@foolip

This comment has been minimized.

Copy link
Contributor Author

@foolip foolip commented Sep 27, 2019

is:pr is:open updated:>2019-09-24T00:00:00Z -status:success is the search to go through...

@stephenmcgruer

This comment has been minimized.

Copy link
Contributor

@stephenmcgruer stephenmcgruer commented Nov 22, 2019

Eugh, this is why one should do postmortems as soon as possible afterwards. @foolip @Hexcles can you take a look and see if any other action items come to mind for you? All the things I can think of seem a little 'one off' as opposed to addressing a structural issue.

@foolip

This comment has been minimized.

Copy link
Contributor Author

@foolip foolip commented Nov 27, 2019

@stephenmcgruer I'd say a release process for Chrome+ChromeDriver Dev, so that web developers can rely on Chrome Dev for their testing if they like. What happened here is that we started getting the tip-of-tree chromedriver and while the mismatch wasn't the cause of the breakage, we might have had more time if there was a dev channel for chromedriver. The flip side, however, is that we would have noticed later.

@stephenmcgruer

This comment has been minimized.

Copy link
Contributor

@stephenmcgruer stephenmcgruer commented Nov 27, 2019

Sorry, it's not clear to me why having a release process'd chromedriver would have helped here. The bug was in chrome, not chromedriver, and we were able to pin chrome to avoid the problem. I didn't feel a time pressure (indeed, I think lack of time pressure was a bug here that caused us to let this lie for too long), can you explain further?

@foolip

This comment has been minimized.

Copy link
Contributor Author

@foolip foolip commented Nov 28, 2019

OK, so the mismatched chromedriver was merely a red herring in my initial investigation and not really part of them problem. Let's wait for it to actually break something before calling it out as an action item :)

By "might have had more time" I mean that if chromedriver breaks then having a chromedriver dev release channel would cause us to notice later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.