Meta: use tests and implementation bugs to shorten time to interoperability #1849

Open
foolip opened this Issue Oct 3, 2016 · 36 comments

Projects

None yet

9 participants

@foolip
Member
foolip commented Oct 3, 2016 edited

One year ago, I removed navigator.appCodeName and similar from workers, while ignoring @domenic's suggestion to write web-platform-tests. Today, we still don't have interop, and it even looks like things have moved in the direction of exposing more things.

This is not an unusual situation, small changes are easily overlooked by those who don't participate and easily forgotten by those who do. When the changes are (textually) minor changes in prose, it's even more likely to go unnoticed than IDL changes, which can in principle be tracked.

We can do better, and sometimes do: #1233 (comment)

The main goal of a standard is interop between implementations, so to help reach this goal more consistently and possibly faster, I propose a number of changes over time:

For all but the first step, tools and infrastructure could help immensely.

@foolip foolip added the spec tooling label Oct 3, 2016
@foolip
Member
foolip commented Oct 3, 2016

@domenic @annevk @zcorpan @sideshowbarker, what do you think of this overall direction? I'd like to prepare a PR for the first step, which I think will be useful in itself. There would be no requirement that the same people write/review the spec and test change, just that there are in fact tests. I think we should not require pass/fail status in implementations before we have some infrastructure for it, which something @RByers and I have talked about.

@annevk
Member
annevk commented Oct 3, 2016

I think making sure issues are filed against implementations is actually more important. Implementations typically have the resources to provide tests for any changes they make too and are already moving in the direction of putting those in WPT.

There's also a couple of areas around navigation, document lifecycles, browsing contexts, etc. that are rather hard to write tests for. I'm not entirely sure what to do there.

@annevk
Member
annevk commented Oct 3, 2016

But also that I'd like to see automated in some manner. My main concern is that we already lack the resources to define everything properly, putting more constraints on editors is not going to help with that.

@sideshowbarker
Member

As somebody who’s spent time trying to help make sure WPT is as useful as it can be, I’d love to see more high-value tests going into it—and to see those tests helping to get better interoperability among engines. This is related to something that @RByers brought up during the WPT discussion we had at TPAC—about ensuring that for WPT we’re making it a priority to focus the effort on real interoperability problems.

So the proposal in the issue description sounds like something that could help with getting some good stuff into WPT and in focusing efforts on improving interoperability.

Actually I think the last three items/checkboxes listed in the issue description are things that anybody should already be doing when they write WPT for existing features with known interop problems. The one thing that’s not is the “Require normative changes to come with tests” item.

On that I agree with @annevk about it maybe being not so helpful to put further constraints on editors (and other contributors too). But I hope there’s some way to try it out with at least a few issues and see whether it ends up being too much of additional PITA or not.

@foolip
Member
foolip commented Oct 3, 2016

I had begun to prepare #1850 before any of the above comments, submitting it as a reference for discussion.

@domenic
Member
domenic commented Oct 3, 2016

In general I support this direction strongly. This seems like the single biggest thing we can do to boost our effectiveness as editors, as well as increasing the credibility of the specification.

I think we should discuss softening the requirement in two potential ways:

  • Features that are hard to write tests for such as navigation etc., as Anne mentioned. Often these can be managed, but sometimes it's just extremely hard or requires manual tests or similar. Maybe we want to allow exceptions for that sort of thing.
  • Large features or integrations, such as custom elements. Although these are becoming increasingly rare in HTML, I think we should recognize that when they occur, requiring the editor to write a comprehensive test suite is too much of a burden in addition to their job of writing a comprehensive spec, and we should instead look toward building a test suite as a collaborative effort between the editors and the engineering teams of browser vendors.

I'd envision these as ways in which you could petition for an exception, reminiscent of the Blink launch process's various "we would prefer if you had X, but if there's a good reason, explain it and the API owners may grant you your wish". (Where X is something like "positive signals from other vendors" or "sufficient devtools support" or similar.) The baseline would still be requiring tests, so it's either "tests or a good explanation as to why it is unreasonable to provide them." And in this case you'd of course be expected to file a bug on WPT at least.

@RByers
RByers commented Oct 3, 2016 edited

This sounds great, thank you! FWIW the Pointer Events working group has been mostly operating in such a mode for the past 6 months and it's been working alright there - although that's obviously a much simpler space than all of HTML. In practice it has ended up being the browser engineer who writes both spec pull requests, test pull requests and browser implementation changes, and the editors primarily do the reviews.

I agree with Domenic that there's no need for absolutism here. I see this as moving to a culture that's exactly like the testing culture in any healthy software project: There's an expectation on changes to either show how the change is adequately tested or explain why testing isn't worth the cost. Most changes tend to land with tests, and a reviewer who forgets to ask where the tests are is seen as not upholding their responsibility to product quality. But nobody is absolutist about it - it's a judgement call. It's often tempting to skip writing good tests testing because it feels like a hassle and often takes much longer than making the product change. But that just accrues technical debt which leads in the long run to severe product quality issues (which I'd argue the web is definitely suffering from). If somebody wants to do some rapid prototyping or other iteration that's not production-quality, they can do that in a branch. But when the focus is actually on building a great product (not just in shipping your features as fast as possible), then what's in master is expected to be production-quality, and production-quality product changes are always thoughtful about appropriate testing.

I disagree that impl bugs are more valuable than tests. How will someone even know which browser to file the bugs on if they haven't tested the current behavior? How will engine developers decide how to prioritize these bugs? How do we reduce the risk of browser engineers failing to understand what was really intended by long complex spec text? Instead all the engines are all already prioritizing investigating web-platform-tests which other browsers pass and they fail. Fundamentally I see this as the age-old spec question of what matters more: a beautiful specification document or running code? High quality tests are the number one protection against the lure of spec fiction.

To help ease the burden this new process places on editors/reviewer I can commit to funding some tooling improvements. In particular, I'm exploring having a tool for web-platform-test PRs which automatically runs the changed tests (with and without the patch) on a number of browsers and surfaces the results directly in the PR. I'm open to other suggestions for what tools Google should fund.

@foolip
Member
foolip commented Oct 3, 2016

To expand a bit on the possible tools/infrastructure:

  • Learn which implementations fail those tests: #1849 (comment)
  • File issues for those implementations: Based on the results of the above, buttons to directly file implementations issues, with the option to edit the title/description.
  • Track those issues over time, revisiting spec issues if necessary: The simplest might be a cron job to check status of open issues filed using above method, highlighting the oldest issues.
@foolip
Member
foolip commented Oct 3, 2016

Regarding "we already lack the resources to define everything properly," I agree there's some risk of shifting the burden of testing onto editors.

Bad scenario: an editor notices a 3 vs. 1 split in behavior where the shortest path to interop is clearly the majority behavior. No implementer is interested, so the editor has to write the test, files an issue, but it's never fixed.

Good scenario: an implementer finds a bug in an algorithm and submits a PR. Everyone agrees it makes sense, but since it's most urgent for the implementer in question, she does the web-platform-tests PR which is also used to test the code change. Another engine also fails the test, an issue is filed, and it's fixed before long.

I also agree that we should be pragmatic, for any case where up-front testing is not worth the effort, we can just make note of it and move on. There will still be some bad scenarios, but I suspect that will be outweighed by the benefits of understanding existing behavior before making a change.

Another thing that might help is if implementers prioritize web-platform-tests failures where they are the single failing engine.

@cdumez
Contributor
cdumez commented Oct 3, 2016

I also strongly support @foolip's proposal including the part about making sure that normative changes to the specification are covered by web-platform-tests. Web platform tests have been extremely useful to improve WebKit's compliance with the specifications (and interoperability by making it easier to test in other browsers). WebKit gets a lot of bug reports and triaging them is not always easy. It is also sometimes not easy to understand the specification change and tests help a lot with this.

I am not saying that this applies to all features (some are hard to write tests for) but hopefully this would be the general rule with exceptions. I think writing the tests does not necessarily have to be the editor's responsibility, especially in cases where the editor is specifying a feature suggested by developers a particular engine. What matters is that changes to the specification get timely coverage in Web-platform-tests.

Note that while working running web-platform-tests, I have noticed that it is not uncommon for some of them to be out-of-date with the latest specification. Hopefully, this would be less likely to happen with the proper processes in place.

@foolip foolip changed the title from Process: use tests and implementation bugs to shorten time to interoperability to Meta: use tests and implementation bugs to shorten time to interoperability Oct 3, 2016
@foolip
Member
foolip commented Oct 3, 2016

I have softened the language in #1850 somewhat, to "For normative changes, you will typically be asked for a web-platform-tests PR testing the new behavior. If testing is not practical, please explain why and if appropriate file an issue to follow up later."

@annevk, is there anything you'd add to cover who is expected to write the tests? I think this is mostly a matter of getting the culture right, and that it's hard to spell out in much detail, but would like your thoughts.

@annevk
Member
annevk commented Oct 4, 2016

"A web-platform-tests PR is typically required before a normative PR can land." would be a little more neutral as to who is responsible.

@annevk
Member
annevk commented Oct 4, 2016 edited

In practice it has ended up being the browser engineer who writes both spec pull requests, test pull requests and browser implementation changes, and the editors primarily do the reviews.

This is the nightmare scenario by the way.

@foolip
Member
foolip commented Oct 4, 2016

Latest wording is "For normative changes, a corresponding web-platform-tests PR is typically required. The author and reviewer can be different than for the spec change. If testing is not practical, please explain why and if appropriate file an issue to follow up later."

For cases where only one person is motivated enough to write the spec and test PRs, if it's non-trivial, I think it's very reasonable to ask someone else to review the tests.

@annevk
Member
annevk commented Oct 4, 2016

The other thing we need for this is to have generated copies of the standard post-PR. Otherwise reviewing the tests requires reviewing the change to the standard too. That would be good to have in general, also for reviewing the change to the standard, but this seems to make that more of a necessity.

@foolip
Member
foolip commented Oct 4, 2016

Disclosure: @annevk had a VC earlier today. There some cases where requiring tests up front could be unhelpful:

  • An entirely new area, where writing a lot of tests without any implementation will scarcely result in quality tests.
  • Under-tested old areas, where testing the single thing you changed would require you to build the whole test infrastructure for the area – might be too demanding in some cases.
  • Changes that are hard to understand before merged to the spec, where testing and iterating after merge is more productive.

As long as we have some kind of tracking, it's fine to make exceptions for cases like this, and perhaps others.

So that we can learn as we go, I'm going to downgrade the language from "typically required" to "highly appreciated". Over the coming months, it'll be interesting to see what works and what does not. If the only outcome is that things that are trivial to test are tested, that's still an improvement.

@foolip
Member
foolip commented Oct 5, 2016

Step 1 landed, thanks all!

@inikulin
Member
inikulin commented Oct 5, 2016

Hi guys, what about parser-related tests?

@domenic
Member
domenic commented Oct 5, 2016

What about them? They'd be appreciated/encouraged for any changes to the parser, for sure :)

@foolip
Member
foolip commented Oct 5, 2016

@inikulin @gsnedders, are parser tests still maintained outside of web-platform-tests?

@gsnedders

@foolip Yes. Myself and @jgraham discussed this a while back, and I think he was happy to follow my suggestion that we ought to improve the situation for tests that require a build-step in w-p-t first. Notably, we don't want the parser tests to become web-browser specific, given we sorely want more HTML parsers to follow the spec.

@inikulin
Member
inikulin commented Oct 5, 2016

@domenic I mean, shouldn't we add reference to them like it's done for WPT in 6408a8f?

@zcorpan
Member
zcorpan commented Oct 5, 2016

Yeah... But would it not be better to document that in web-platform-tests?

@foolip
Member
foolip commented Oct 5, 2016

How are browsers importing the html5lib-tests, if at all?

@gsnedders

@foolip WebKit and Blink both run html5lib-tests in mine and Philip's ancient running that's terrible as well as the imported compiled copies in web-platform-tests.

@domenic
Member
domenic commented Oct 5, 2016

To me it sounds like the solution is to keep the advice in the HTML Standard as-is, but add a README to web platform tests (possibly a subdirectory of it) explaining that to get parsing tests into WPT, you need to go do something in some other repo?

@annevk
Member
annevk commented Oct 6, 2016

Yeah, and possibly explaining what needs to be solved in WPT before the tests can be moved there. (So that if someone wants to tackle the issue @gsnedders mentioned they have a starting point.)

@RByers
RByers commented Oct 7, 2016

In practice it has ended up being the browser engineer who writes both spec pull requests, test pull requests and browser implementation changes, and the editors primarily do the reviews.

This is the nightmare scenario by the way.

We haven't seen it be much of a problem in practice on the Chrome team. Certainly a better situation than having specs that don't match any implementation (which happens ALL the time) or not having any interop tests (and so implementations deviating from the specs in many ways people don't realize).

@RByers
RByers commented Oct 7, 2016 edited

There some cases where requiring tests up front could be unhelpful:
An entirely new area, where writing a lot of tests without any implementation will scarcely result in quality tests.

IMHO having text in the spec without any implementation anywhere indefinitely is unhelpful too. When no implementation exists for something in the spec, I'd argue that's a spec bug which should be tracked by an open issue. It's fine to land the spec changes prior to implementation, but our work as standards engineers isn't done until the spec actually reflects reality. So, to resolve the issue, either tests should land that pass on some implementation, or (if that doesn't happen in some reasonable timeframe) the spec text should be reverted to match implementations.

[Edited to clarify I'm not opposed to spec changes preceding implementation / testing]

@foolip
Member
foolip commented Oct 7, 2016

having text in the spec without any implementation anywhere indefinitely is unhelpful too

Happily, this is no longer on the menu for WHATWG specs, or at least we've been removing cases that have come to light. I don't know what the optimal (eventual-interop-wise) process for entirely new areas is, but probably some kind of parallel progression of implementation, spec and tests.

Big picture, I think we need to somehow track the progression of new features on the platform. "Track those issues over time, revisiting spec issues if necessary" is in that vein, but I think we need to experiment a bit. If we require either tests or bugs for missing tests, maybe working backwards from long-standing test failures and long-open test bugs would suffice.

@annevk
Member
annevk commented Oct 10, 2016

We haven't seen it be much of a problem in practice on the Chrome team.

Right, if one person/browser does all the things you don't notice the problems either. That was the point.

In any event, it seems we largely agree here.

@foolip
Member
foolip commented Jan 11, 2017 edited

Survey time! It's now been just over 3 months since 6408a8f where we began "appreciating" tests with spec changes.

Some other WHATWG specs have been handled in the same way, so @annevk suggested on IRC that maybe the time has come to elevate this to a generic WHATWG process.

But first, how has this been working in practice, and what most needs improvement?

The most active mergers of PRs have been @domenic @zcorpan, @annevk and @sideshowbarker. The most active PR authors in addition have been @yuyokk, @jungkees, @junov and @mikewest. I'd be very interested to hear how it's been on both sides.

For my own part, I've found it pretty much entirely positive. Knowing that others will now expect tests has prompted me to write some test for older changes that I otherwise wouldn't have. aaf2435 and ad778cb were the two PRs I merged, and I wrote one of them, and it worked well in those cases.

As for needed improvements, "Learn which implementations fail those tests" is definitely at the top of my list, and it's tooling that Blink's platform predictability will prioritize this quarter.

@zcorpan
Member
zcorpan commented Jan 14, 2017

I have been on both sides and think that it has worked really well. Tests and bugs speed up the turnaround time a lot. Without this it could go years before a browser vendor picked up or objected to a change (even if someone from that browser had given an a-OK for the change). Now some changes have patches in browsers before the change has landed in the spec.

Writing tests also increases the quality of the spec, since problems become clear in the tests. It also seems reasonable to assume that the tests help getting interoperable implementations, which is the goal of the spec.

wpt-stability-checker is very useful in seeing actual results (and if the tests are stable); would be nice to have Safari TP and Edge tested as well.

Some kind of bug-filer to file the same bug in several browsers could be useful, but we typically need to search for an existing bug anyway, and it's not that hard to just copy/paste. Also typically one wants to pick a good component and maybe cc someone, so not sure how much it would help.

All in all, I think the extra work is well worth it, and it is probably a smarter use of time. When you write a spec change you understand the problem, so might as well write some tests before it's swapped out and forgotten about.

@domenic
Member
domenic commented Jan 17, 2017 edited

I also feel that this is working really really well. In addition to seconding everything @zcorpan wrote, I want to say that writing tests and filing bugs has increased my feeling of participation in the whole process. That is, I feel much more sure that when I make a spec change, I am doing all I can to get it implemented everywhere, in precisely the manner I intended.

One interesting thing that's come up is the work in URL and Streams where we maintain a "reference implementation" alongside the spec. The reference implementation is more or less a line-by-line transcription of the spec text. This allows us to update the reference implementation for any spec change, and make sure it passes the new tests. For URL in particular this has caught several incorrect spec changes (where the spec text was wrong, even if the test updates were right). See e.g. whatwg/url#196 (comment) and w3c/web-platform-tests#4405 (comment) . And for streams it has led to very high test coverage.

This kind of reference implementation approach requires a higher up-front cost and is only applicable in limited situations involving self-contained systems that are relatively easy to write from scratch. But I could see it extending to e.g. Encoding or MIME Sniffing in the future, or to specific self-contained parts of larger specs (e.g. mini parsers).

Regarding bug-filing improvements, I think the best thing we could do there is have some kind of "hub" page/wiki entry/markdown file that provides quick-links to all the relevant trackers. It could also include advice on how to get yourself edit-bugs access for different trackers, or how to label things to get them maximum attention. Also perhaps a link to https://browser-issue-tracker-search.appspot.com/ to help you search for dupes. This kind of info is especially important for spreading the process beyond those of us who know how to do this already. For Web IDL @tobie put together https://github.com/heycam/webidl/#filing-issues-elsewhere which is a nice start.

@annevk
Member
annevk commented Jan 18, 2017

I was extremely skeptic and extremely wrong. This is indeed a great system.

I agree that it would be good to have more reference implementations to catch subtle/blatant bugs. (I want a different term for them though since other implementations should not use them as reference. The standard and tests should be used as reference always, ideally.)

@gsnedders

Historically, the opposition to reference implementations was two-fold: firstly, not wanting browser vendors to all just use a single implementation and instead independently implement it; secondly, not wanting to end up with all the issues disagreements between specification and implementation.

That said, there's been at least one notable example of specifying using executable code (SML in the ES4 drafts), which avoids both first and second risks (nobody's going to ship SML code!)… given much of what we're touching, is there much risk in using JS in the specification? Is there much risk of browsers just shipping that?

It becomes so much easier to start answering questions about specification interpretation and correctness when we have a specification in a formal language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment