Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Extra CSS testsuite requirements #10053
The status quo is that we have a number of lints to keep the CSS test harness (and the build system it relies on working), however these do not suffice to ensure everything keeps working.
Ideally, we'd get rid of all the extra requirements the test harness (directly and indirectly through the build system), as this appears to be the only achievable path to have a common set of requirements across the entire repo (I think it's clear from prior discussions that extending the currently CSS-only ones to the whole repo is not a feasible option).
Things the build system relies on we don't lint for
Things I'm aware of missing that are required to keep the built output correct:
There are, as far as I'm aware, a number of ways to solve this:
As far as I'm aware:
Ways in which wpt.fyi doesn't suffice
When it comes to option 4, my understanding of the reasons why wpt.fyi doesn't currently work is (ignoring current infra issues like reliability of test runs and the fact we're currently running old versions of browsers):
It would be nice to know how important each of these is considered and whether it blocks moving away from the existing test harness.
I think there's rough consensus across many of us that keeping the extra rules isn't workable longer term, because it stops us from doing large-scale cleanup of the CSS testsuite (most obviously, getting rid of the large number of duplicate files within it that are currently required all to be changed simultaneously), and I think 4 above is the right option (2 and 3 seem particularly unlikely, given despite the CSS WG caring significantly about the test harness nobody has been paid to address any of its issues in a long time).
In the short term, I think at the very least we should make test.csswg.org run the build system only on css/ (currently this causes two tests to disappear:
There are some special things on the CSS tooling that one can easily oversight/forget/miss when writing/reviewing a new test:
This causes that we have to duplicate files (which clearly is not good), or put them in common folders (which is bad is the file is specific to a spec) and use absolute paths (which is bad as you wouldn't be able to just open the test without running the server).
I agree with @gsnedders that the path to follow seems to be 4., making https://wpt.fyi/ (or other tools) better to cover the use cases of the CSS build system (such as running manual tests and store results).
The key functionality that I've been relying on lately is the CSSWG test harness's ability to
As for the build system, I've mainly just used its indexes to help analyze spec coverage; and this sort of indexing system is easy to rebuild if necessary. In the past it was needed to convert XHTML<->HTML, since some CSS systems could only read XHTML files, but that seems to be less necessary now that the HTML parsing algorithm is well-specced and more widely implemented. At this point my main concern in this area would be people working with EPUB engines, and it might be worth talking to e.g. @dauwhe and his colleagues to see how they're using (or unable to use) our web platform tests.
The last concern I have is that the use of absolute paths requires the tests to be served up on a server, which makes WPT harder to use and develop locally. This doesn't affect folks who are comfortable with tooling much, but we have CSS test contributors who struggle with that. Hardly any of the CSS tests actually need a server, and so up until now it was possible to work with tests without setting up anything other than a browser and a text editor; but using absolute paths for basic resources like images or fonts changes that.
@gsnedders your solution #3 actually shouldn't be very hard. Removing the file format conversions and output restructuring from the build should be straightforward. The only thing the build would really need to produce would be the manifest files to feed the test harness. But getting the basic needs met by the normal wpt infrastructure would be my preference as well (#3 could be an interim step).
IIRC the reason we made the CSS build scan the entire repo was when the CSSOM tests got moved out of /css, looks like those are back in /css now (or never quite got moved).
@mrego you can use relative paths in the css tests or .css, .js, or images, they just can't go up ('../') and need to be either in the same directory or in a './support' directory. There's no difference between handling of .css, .js, or images. The only relative paths that get fixed up currently are for reference files.
Two other things the test harness does:
referenced this issue
Mar 16, 2018
Yes, I was mentioning that, it's quite annoying. But that's needed as you cannot use
Yes, but if you decide to have a subfolder for spec section (like we have in grid layout, e.g.
And yeah it seems there's no difference between
And checking the repository I can find more relative paths, and different outputs in test.csswg.org:
harness cannot find the
Also I see
More suites (apart from CSS2) that contain
Also I found that css-fonts uses
referenced this issue
Mar 16, 2018
I don't think we should be redefining what pass/fail mean for reftests (nor should we expect people to manually run them; there's no reason for people to manually run the majority of the testsuite, because that's just a waste of time). If tests are failing because of anti-aliasing, we should give "fail" as a result in the implementation report (because that's the result of the test), and then give an explanation as to why we believe the failure doesn't matter (i.e., it's an anti-aliasing bug we don't believe is a significant part of the feature). We should definitely be aiming for there to be sufficiently few manual tests that a harness is a marginal benefit.
Note we already require
The hard part is modifying the test harness, as the test harness assumes references are referred to as "ref ids" all over the place, IIRC, and that you can just prefix "reference/" to them and have it work. Obviously not insurmountable, but it was hardcoded in enough places that I didn't want to try and change it when I looked at it before.
Yes, that's my recollection, I think driven by @zcorpan wanting to use wptserve features that don't work within the test harness (and I think those tests are still broken within the test harness).
As @dbaron basically said, same directory doesn't work: they have to be in a
As @mrego says:
References are put in a
FWIW: I think manual running tests is mostly a distraction, because nobody is really going to run many of them, and we should work towards having as much as possible automated so we have vendors actually running them regularly (and yes, there's things that are unlikely to get automated, but they're few and far between).
Yeah, Shepherd works against the source repo, while the harness works against the built test suites, a bit of apples and oranges.
It's been a while since I've been in the harness code, but I think the key to solving that is to have the manifest files include the path to the reference file as part of the reference id, then just remove the "reference/" prefixes in the harness. (It'll have to use a full path as the test file id as well anyway since the tests won't be living in the root of the test suite either, which it also assumes.) Unfortunately that will mean that if a test gets moved within the repo the harness will consider it a different test and old results will get lost (the test equivalency code could possibly be leveraged to address that). We also have to lose the multiple test format feature of the harness (or at least disable it for current suites and keep it for the old ones).
Unfortunately there are tests that no one has figured out how to test in an automated fashion yet. So until that gets solved, manual tests aren't a "distraction", they're a necessity. Yes, we all want as few of them as possible, but we can't simply ignore them.
If reftests are failing but the feature is implemented correctly, we need to prioritize fixing that by some means. We certainly shouldn't sweep this under the carpet by manually overriding the results, because that's not a solution that vendors can reasonably use.
There are several possible approaches to improving things here; the most obvious is allowing specified fuzziness in reftests (N pixels different, maximum difference of X per channel).
Generally wpt is not expected to work without a server, and I object to having restrictions that are based on the desire to avoid a server in a subset of cases. There is already tooling to make running the server easier (e.g.
They are a distraction if we are building systems around the assumption that there will be a lot of manual tests. Requirements for features like crowdsourcing results and long-term results storage seems predicated on the idea that there are so many manual tests it's not reasonable for someone to sit down and run all the manual tests for a single spec in a session, generating a results file. If that is true, it's a problem for other reasons. If it's not true, then a simple client side harness that produces a json file is all the infrastructure required.
I think it's valuable to not require the server for the large numbers of tests that don't need it. This allows things like developing the tests elsewhere and then copying them into the wpt repo at the end, or other patterns that are often useful, such as allowing reftests developed for other reftest systems to be used as wpt reftests and vice-versa. It also sometimes allows debugging techniques or tools (especially if you need to debug tests on devices that aren't desktop/laptop computers) that are a good bit harder if the server is required.
I have also found that relative paths which go up for fonts (../) cause the font to not be loaded as this seems to trigger a not-same-source origin. At least when viewing from local disk.
Testing APIs is straightforward, automatic results are returned. testing CSS can sometimes do that, but not always; CSS is primarily used to produce a visual result, which can sometimes be automatically compared to a reference (but not always); and thus there is a need to manually run tests.
Those of us who do, frequently, manually run test appreciate being able to do so, and having the nice, autogenerated reports to examine, filter, drill down into to see who is failing what at what browser version on which platform. And having those test results go away, because someone made a non-substantive change to a test, is frustrating and wastes a lot of time having to re-run tests that we already had data for.
I think that the rest of WPT does not have the requirement that the CSS tests have, of actually linking to the specific section or sections of the spec(s) being tested, is a bug and reduces the value of the WPT. If a test author can't be bothered to document, or perhaps does not even know, what precisely they are testing then the test is at best suspect (causing extra work for others to figure out what they thought they were testing) and at worst wrong or misleading.
Personally, if I revise a test and change a support file, it is faster and more efficient to know that I only have to care that my changes are correct for that test. If I have to search the entire test suite for every other test that uses a file of the same name, and verify one by one that the updated support file is still valid for all those other tests, then that is significant extra workload. Disk space is cheap, skilled labor is not.
Just to note that I also very much value those features, and use them daily.
Yes that's faster to review, but then you might end up with, for example, 10 duplicated helper files that start to diverge... IMHO that's not good thinking on the maintenance of those helper files in the repository and it'd be better to have just one copy of them.
You still have to update the support file for other tests, though, because all the support files must be byte-for-byte identical.
referenced this issue
Mar 26, 2018
@svgeesus Wrt support files, the CSSWG test infrastructure used to have scripts that would copy the contents of support/ directories further up in a directory structure to those further down and ensure binary compatiblity, so files shared by a set of tests would go into the support/ directory of their common ancestor directory and get copied down for use throughout, e.g. hixie's swatch-color.png files or his cat would get propagated through the css2.1 directory structure and be available for use in new tests. I don't know what's the state of tooling now, though.
The Working Group just discussed
The full IRC log of that discussion<dael_> Topic: WPT - Extra CSS testsuite requirements
<fantasai> DECREED: fantasai to be listed as former editor of CSS2.1
<dael_> link: https://github.com//issues/10053
<gsnedders> GitHub: https://github.com//issues/10053
<dael_> gsnedders: There's a lot in here. I won't cover all of it.
<dael_> gsnedders: Big pain points are mostly caused by CSS test rebuild system. The current one we have because the CSS test harness relies on it.
<dael_> gsnedders: We've had agreement before to make CSS test harness not rely on build system. afaict the WG believes we need test harness for sake of impl reports
<dael_> ChrisL: Correct
<dael_> gsnedders: But no one in this room will pay anyone to work on test harness.
<dael_> ChrisL: Pretty much.
<dael_> gsnedders: This is unfortunate. If we care as much as we say we do it shouldn't be hard to find someone to pay to fix the major issues that we know it has.
<dael_> florian: Who can estimate how much work that would be?
<dael_> plinss: I'm probably the only one still around.
<dael_> gsnedders: I can talk build system, not test harness.
<dael_> florian: WE either fix test harness or improve build system.
<dael_> ChrisL: But they have different goals. One is a getting out of CR system and the other is regression system.
<dael_> gsnedders: There is a JS program that creates an impl report from automated test run data.
<dael_> ChrisL: And that's very parial for CSS
<dael_> gsnedders: Depends on the spec.
<dael_> ChrisL: Those are API specs
<dael_> gsnedders: Even those that aren't we have very few manual tests. Majority of modules don't have manual.
<dael_> ChrisL: ref tests are non manual and they fail because the reference isn't exactly what it says?
<dael_> gsnedders: Then it fails. The pass is these two files are pixel to pixel.
<dael_> fantasai: I think if you generate an impl report that says all these test fail but they actually pass and did you check it's just the anti-aliasing? For specs with the anit-aliasing it shouldn't be pixel to pixel it should be can a human tell the fdifference. It's not ideal but we can't get out of this situation any times soon.
<fantasai> s/it shouldn't be/pass condition shouldn't be/
<dael_> astearns: 2 things. First we started with the requirements in the build system that are at times blindly broken by WPT because they don't have those requirements. Second is getting WPT to where we can use for impl reports and we should table that.
<dael_> florian: If second was easy we wouldn't need the first one. It probably isn't so we need the first.
<dael_> fantasai: I'd also like to assert that in terms of regression testing system for things that are manual only WPT ignores all those things and there are test suites where we haven't figured out how to test these things. Ideally we should run these tests every year or so. WE did regression testing manually in the past and it sucks and we should avoid it but there are cases where we must and WPT is ignoring that.
<dael_> dbaron: There are a bunch of things browsers can automate internally so we're not interested in those tests.
<dael_> florian: They're necessary to go to rec.
<dael_> fantasai: No, this is different.
<dael_> fantasai: No one is asking the browsers to do it but it should be possible to test the browsers.
<dael_> dbaron: We're not interested in running them.
<dael_> dbaron: I think their existence is mostly a waste of time
<dael_> florian: But how do we go to rec if we don't have tests for things that cannot be automated.
<dael_> dbaron: You try and find ways to automate them.
<dael_> emilio: There's a lot of work going on to automate previously hard to automate things.
<dael_> florian: I'd like to not rabbit hole.
<dael_> gsnedders: As example you can't automate a test to see if the Media Query works for a monitor that has P3 colors.
<dael_> gsnedders: You can't automate that.
<dael_> astearns: One of the problems of fixing the competing systems is we can't find someone to pay to fix our system. Will someone pay to make WPT better for our impl reports?
<dael_> gsnedders: Yes.
<dael_> astearns: Has anyone tried?
<dael_> florian: Part of the discussion on github was trying to gather our needs and I don't think that's concluded.
<dael_> gsnedders: 2 sides to that. We know there are other WG that have used WPT tools to take specs to rec. How much would htose tools need to improve for CSSWG to be happy using them. There will prob be more and less painful cases.
<gsnedders> s/2 sides to that.//
<dael_> fantasai: Given current sitation we need to be able to deal with a set of manual tests. WPT handles ref tests but we can't do that for manual tests and depend on those for a number of specs.
<dael_> gsnedders: But for doing CR we don't need to run them all the time.
<dael_> gsnedders: A proposal was for WPT to stop writing test results but it assumes a future build won't be different from a previous. Current assumes Chrome 1 will be same as Chrome 60.
<dael_> plinss: It'll store a result from each. If there's a pass in a past version and fail on the current you can see that but it will still pass
<dael_> florian: That's sufficient to show it's impl.
<dael_> fantasai: It's a different analysis because different goal
<dael_> gsnedders: WPT looks for interop
<dael_> plinss: They have different goals so do different things in different ways. IT's a different view of the same data.
<dael_> florian: For requirements I think a desirable improvement of what we have is that currently the manual tests can be run by anybody. Some results are bogus though that's not entirely useless. If someone adds a result that is different that's worthwile. Storing the things that are known to be correct is valuable.
<dael_> gsnedders: What groups have done with manual tests is have someone runthem in a browser and create a file with the results.
<dael_> florian: When I create an impl report I create all the tests. BUt I don't want to have to run all the tests just to check if it's time to run an impl report. Having the history of knwoing something has passed before it saves time for me so I can jsut verify at the end.
<dael_> florian: WE don't have to have that, we've lived without it, but if we're re-implementing it's useful.
<dael_> fantasai: 2 things I'd like to see, one is to have the manual test harness deprioritize ref tests. It should put them at the end of the list. And WPT exporting test results in a way that the test harness can import.
<dael_> plinss: Test harness already orders in the way for most needed to exit CR.
<dael_> gsnedders: If we get it running every 12 hours that's a lot of data going into the system. No reason to import that often. but if you go through writing modes there's 200 fails.
<dael_> fantasai: I think the writing modes test case we don't need to optimize, we just need to deal with it.
<fantasai> plinss^: If we load in the automated results, they'll get automatically deprioritized
<dael_> florian: I think we'll write fewer of these tests. First because we know and second because we know these tests have problems on high DPI screens.
<dael_> gsnedders: Currently WPT assumes you're in low DPI.
<dael_> rego: Current problem that it has is you'll have to duplicate different things and you don't realize that because you'll have something break in the build system when something changes. If there's a solution for that it's not good that things pass in a browser and then it's bad for csswg.
<dael_> florian: If we start to share a bunch of files should we forbit modifying them?
<dael_> florian: Now every test folder has it's own files so you know you're th eonly user. If we centralize tehre's a lot of test suites using htem.
<dael_> gsnedders: Keep in mind the merging of support directories.
<dael_> florian: So you will not have files like that?
<dael_> gsnedders: They have to be bit to bit identical.
<dael_> florian: So if we put it in one place there's nothing to detect. So if one test changes and forgets it's shared....
<dael_> dbaron: If you're reviewing a patch that changes a support file you have to review all the places it touches. That genreal statement that you must review all places that touch that file. It should be possible for reviewers to know that easily so reviewers can figure it out.
<dael_> ChrisL: Other thing you can do is if you attempt to modify the file, don't, fork it. But that gives you lots of extra files.
<dael_> fantasai: Yeah, fonts glyphs can just be modified in most cases.
<dael_> plinss: Yes, but sometimes you're testing for a glyph and the glyph isn't there.
<dael_> gsnedders: Potentially easier because hopefully this year we can run all of WPT on every pr.
<dael_> fremy: Will still work but for wrong reasons.
<dael_> gsnedders: Totally. DOesn't solve everything but it will show up.
<dael_> astearns: One thing I heard mentioned is that we still lack way to take WPT results and importing. Is that more or less important then other tasks you outlined?
<dael_> gsnedders: That relies on knowing what source file each test in the build suite comes from.
<dael_> astearns: I don't want details on the how. But I've heard many people want it so I'm asking if that's most important or are there other possible items you outlined that are more important.
<dael_> gsnedders: One of the big questions is can we move to WPT report and away from test harness.
<dael_> fremy: Quick thing. As of today WPT FYI is still completely useless for CSS. They still don't use AM font on windows so these results are useless.
<dael_> fantasai: Sounds like getting AM resolved on those machines is #1 problem.
<dael_> fremy: That has been open since last TPAC. As long as it's not fixed this is useless for CSS>
<dbaron> We could use @font-face rules when we use Ahem...
<dael_> florian: how come this is so important to us and no one wants to care? How many people have recently tried to take a spec out of CR? Not a lot. But everyone who does needs it.
<cbiesinger> @font-face also requires using document.fonts.ready, fyi
<dael_> gsnedders: I mean, the implication is how much does this group care about leaving CR.
<gsnedders> dbaron: the problem is updating all the tests for that
<dael_> fantasai: Then it means how much do we care about patent protections...those kick in on REC.
<dael_> florian: Those that have tried to exit CR care. That's only a few people which is why only a few people feel the pain.
<dael_> astearns: What do you want out of this discussion?
<dael_> gsnedders: High level questions: We've answered can we have static safe results saved somewhere and the answer is you want to know if we seemingly have all tests passing
<dael_> fantasai: I also want to be able to run a lot of manual tests without a lot of pain.
<dael_> fantasai: It's much easier to click buttons in the test harness then trying to take hours to runt he tests and then have to generate a report and figure out what needs to be fixed in the spec.
<dael_> gsnedders: Anything apart from manual tests?
<dael_> fantasai: Ahem font needs to be installed.
<dael_> gsnedders: Yes.
<dael_> astearns: fuzzy matching for ref tests.
<dael_> fantasai: Random questions, does WPT handle paginated?
<dael_> gsnedders: No.
<dael_> florian: afaict WPT doesn't care much about helping with evaluating test coverage.
<fantasai> s/No/No. So, those are all manual as well atm/
<dael_> ChrisL: That is useful about the current because you can add a link at hte top of the document to show you your coverage.
<dbaron> Mozilla css testsuite importing code for adding @font-face rules on import is https://searchfox.org/mozilla-central/source/layout/reftests/w3c-css/import-tests.py#243-256,268-274
<dael_> TabAtkins: I'm working on a section in bikeshed to include tests. This lets you see what's tested in the source. ANd bikeshed can warn you about tests.
<dael_> astearns: And it's bi-directional where as you move you note thesea re tests.
<astearns> s/thesea re tests/which tests need to be updated/
<ChrisL> dbaron, nice. Can we get that added to the wpt.fyi build system?
<cbiesinger> dbaron: does gecko block onload on font loading?
<dael_> fantasai: Goal of this...we have tests linking to sections but it's not very granuar. For tests with small section it's good enough, but for complex specs you want...to understand coverage you need to annotate spec paragraph by paragraph and you might need notes and this test is for this section and you want to be able to write an annotated spec as a test report
<gsnedders> cbiesinger: yes
<dael_> fantasai: To make that possible without having to keep a sep. spec copy the idea was to include those in bikeshed so you can incluede them for the test coverage and exclude when publishing spec
<gsnedders> dbaron: in general we can't rely on the Ahem font
<cbiesinger> gsnedders: ok. I don't think blink does
<dael_> florian: I don't have a problem with that, but only if it's not used to ignore annotations.
<dael_> fantasai: I think this will make it trivial to ensure you have annotations.
<gsnedders> s/rely on the Ahem font/rely on the Ahem flag/
<gsnedders> cbiesinger: it doesn't
<dael_> TabAtkins: In so far as tests have annotations I want to make sure you can include them.
<dael_> ChrisL: Will that cope with tests changing?
<gsnedders> cbiesinger: https://github.com/w3c/csswg-drafts/issues/1088
<dael_> TabAtkins: I dunno. I can prob make it rerunnable. We'll see.
<dael_> florian: You prob need to.
<cbiesinger> gsnedders: so if want to use this for wpt.fyi it would need to be smarter, somehow (re ChrisL suggestion)
<dael_> TabAtkins: Once you start and you add a new test bikshed will warn you hey there's a test not in your spec.
<dael_> TabAtkins: Point of this is not to produce a 2-way it's to help with import.
<gsnedders> cbiesinger: wptrunner already waits on font loading, FWIW
<dael_> florian: Once you inport and someone makes changes what happens.
<gsnedders> cbiesinger: because there are plenty of other tests that use web fonts
<cbiesinger> ah ok
<dael_> TabAtkins: bikeshed will not that there's a bunch of different tests.
<dael_> florian: IN the future bikeshed will have a flag to let you import and it will check for differences?
<dael_> TabAtkins: Yes general. Check is separate from import.
<dael_> florian: Okay. I guess. Yeah. Something similar to what it does for can i use.
<dael_> TabAtkins: Yeah, soemthing like that.
<dael_> florian: That does releave some pressure from test harness.
<dael_> fantasai: I like that I can search for tests tagged against a section, but that's trivial to do if test harness isn't there.
<dael_> fantasai: I think the collation of test results is important.
<dael_> rego: Will WPT be open to add support to WPT FYI to run manual tests?
<fantasai> and being able to run all the manual tests against a particular spec or section of a spec
<dael_> gsnedders: What someone mentioned, I think plh, was that...It hasn't been discussed. Some people aren't opposed to storing the tests in this revision run against this browser got this result.
<dael_> florian: That's tore but how to enter? I wish this not to be me submitting a PR with a JSON file.
<dael_> gsnedders: I would expect that not to be a solution.
<dael_> astearns: At this point it's just an idea.
<dael_> gsnedders: Yeah.
<dael_> florian: Seems like this is more likely to materialize then paying for the harness work.
<dael_> astearns: I'm hearing we have all sorts of ideas, but no commitment to work on anything except TabAtkins and bikeshed.
<dael_> TabAtkins: Not too slow of work.
<dael_> plinss: How are you getting data on tests?
<gsnedders> s/plh/Philip Jägenstedt/
<dael_> TabAtkins: Eventually through WPT FYI's API.
<dael_> florian: Another missing thing from WPT is differentiating the tests with may and should flag and the rest.
<dael_> dbaron: Especially for tests with a may.
<dael_> florian: Should are equally important to know.
<dael_> dbaron: May are other "you really shoudln't fail this"
<dael_> florian: May is sometimes there to allow you to do something bad.
<dael_> fantasai: When you write tests you're supposed to pass when you do the thing we want to do.
<dael_> ChrisL: Or you can write tests that have multiple pass options.
<dael_> dbaron: I've seen tests testing against the thing we don't want you to do.
<dael_> [many unhappy people]
<dael_> fantasai: I think submit a PR for removing that test.
<dael_> astearns: Can we close this?
thanks. On the plus side, this kind of report is really to the point. On the flip side, this is awefully dry and hard to interpret if you don't have context. In other words, I am way more verbose when I do this :) see the css-ui report for example.
Anyway, the wording of the report is different, but the test result part is quite similar (except that it is missing the optional / mandatory classification).