-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC 114: Change results aggregation on wpt.fyi and visualization on Interop-20** views #114
Changes from 1 commit
e05638c
942be3a
960bde5
b132fb0
a04cc5e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
# RFC: Change results aggregation and visualization on wpt.fyi to increase accuracy and utility | ||
|
||
## Summary | ||
|
||
The aggregate scoring that is displayed in the results view of wpt.fyi has | ||
consistently been inaccurate in some aspects. This proposal will describe a | ||
change to the way that test results are aggregated, as well as a change to | ||
how those results are displayed. | ||
|
||
## Details | ||
There are a number of aspects to this change. Each subsection will describe | ||
these aspects in detail. | ||
### Remove harness status from subtest aggregation | ||
|
||
When interpreting the results of test suite runs, a status is present that | ||
represents either the overall test's status or the "harness status", which | ||
represents whether any harness errors were present for `testharness.js` test | ||
runs. The current results aggregation method counts this harness status as a | ||
separate subtest itself, which can inflate the overall passing percentage of | ||
a test. There are many instances of a test outright failing its only subtest, | ||
but the test is weighted as 50% passing because there were no harness errors | ||
[(example)](https://wpt.fyi/results/audio-output/selectAudioOutput-sans-user-activation.https.html?run_id=5642791446118400&run_id=5668568732532736&run_id=5739999172493312&run_id=5735075831349248). | ||
The solution proposed is to no longer aggregate this harness status into the | ||
subtest totals. | ||
|
||
### Count test as a failure if a harness error exists | ||
|
||
Although it seems intuitive to simply disregard the harness status when | ||
calculating tests, there are instances where a harness error will occur, which | ||
causes a number of subtests to not be executed. In some circumstances, it is | ||
possible that a harness error will occur, allowing for some subtests to run | ||
and pass as expected, but some subtests to be not run and not be registered, | ||
as they are not present during aggregation. This has the possibility of hiding | ||
the scope of failure for a specific test in this scenario. The solution | ||
proposed, although not perfect, is to mark these tests with harness errors | ||
as 0% passing, regardless of subtest results. | ||
|
||
_Note: Individual subtest results will still be visible normally even with_ | ||
_this change. This only affects the total percentage that the test counts as_ | ||
_passing._ | ||
|
||
### Display percentages to represent the fraction of tests passing. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm pretty solidly opposed to changing the display to be based on percentages. It has been a long term goal to avoid wpt.fyi becoming a primary source for non-browser developers trying to do cross-browser comparisons. It's not clear to me that percentages here are helpful to browser developers, but by making random wpt.fyi results pages look like Interop-202x scores we are encouraging third parties to make value judgements using test sets that haven't been subject to the extensive discussion/auditing of the Interop-selected tests. I'd expect that to affect whether people are willing to submit tests to web-platform-tests in the first place. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I definitely see the issue you're describing with percentages giving a false sense of normalized numbers to compare between browsers. This is definitely not the intention of the change. Maybe this is the most contentious change proposed, and finding a middle-ground here would be a better option. |
||
|
||
Currently, wpt.fyi displays a fraction of the number of subtests that pass and | ||
the number of subtests that exist for each test or every test that exists in a | ||
directory. These fractions can get large and be hard to quickly parse their | ||
meaning, and a single test with many subtests is heavily weighted. The | ||
proposed change is twofold: display this number as a percentage, and instead | ||
count each test as 1 total unit, with its pass rate counted as the percentage | ||
of subtests that pass. Directories will aggregate based on this each test pass | ||
percentage rather than the accumulation of every subtest. | ||
|
||
Passing percentage for a single test = `passing subtests / total subtests` | ||
|
||
Passing percentage for directories containing multiple tests = `sum of passing percentages of tests in directory / total number of tests in directory` | ||
|
||
For example, if a test has 5 out of 10 subtests that pass, the it will now be | ||
displayed in its representing cell as `50%` rather than `5 / 10` for that run. | ||
Additionally, if we have a directory containing 3 tests and 2 of those tests | ||
pass 100% of their subtests and the last test passes 50%, the directory will | ||
display `83.3%`. | ||
|
||
### Display the number of tests in a directory next to the directory name | ||
|
||
The current implementation of displaying fractions of subtests has a benefit | ||
of signaling where a bulk of test information is located and how much to | ||
expect inside of that directory. Displaying percentages causes some loss of | ||
this information. To mitigate this, the proposal is to instead display test | ||
totals next to a directory name. This will be all tests that exist in the | ||
directory as well as the number of tests that exist in all subdirectories | ||
within that directory. | ||
|
||
### Display test results in single test cells rather than percentages | ||
|
||
If a test has no subtests and only a pass/fail result, that result status | ||
( `PASS` , `FAIL` , `TIMEOUT` , etc.) will be displayed in the cell that | ||
represents that test rather than the existing `0 / 1` or `1 / 1` . The user | ||
will no longer have to drill down into the test to view the status message in | ||
these scenarios. | ||
|
||
## Risks | ||
* This change will cause a net loss in the percentage of passing | ||
tests/subtests, as the harness status passing has been artificially inflating | ||
some subtest numbers. | ||
* Some tests might pass most or all subtest but encounter a harness error, | ||
which will cause the entire test suite to be counted as a failure. This will | ||
also result in a lower rate of overall passing percentages. | ||
* This change will require a rework of the current process for aggregating and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the final conclusion, or can we use the search cache when there aren't pre-computed summary numbers in the new format? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can reword this a bit - viewing runs using old vs. new summary files will not affect the scores being displayed in the Interop-20** views. However, there will be a discrepancy when comparing old and new results in the default view. See here for an example of results that are identical but were aggregated differently so their subtest counts look different (new aggregation on the left, old on the right). There is a larger than expected scope of work for no longer counting the harness status in old summary files. The results view on wpt.fyi assumes there is ALWAYS a summary file available to use and bypasses any aggregation on simple queries (simple = results views with no search query in the search bar). As I can see it, these are the options available:
I think option 2 is promising, but it will feasibly cause a small performance hit when viewing results from these old runs. However, this performance hit should be lessened the further we move away from these old runs and compare them less. |
||
summarizing test results, so older test runs will still be aggregated with the | ||
harness status and display inflated passing percentages. | ||
* Tests will be weighted equally, regardless of the number of subtests that | ||
exist in that test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the caveat here is important: this change doesn't seem to help browser developers at all; they still need to go down into the individual test and figure out "oh actually there are some passing subtests, the problem is that the test is slow running, so we time out" (or whatever). That suggests the change isn't actually very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that this change isn't perfect, but hopefully still a net positive.
To use the example you mentioned, it could be the case that, for example, 7 subtests pass as expected, but a timeout occurs on the 8th test. There are instances where a test like this will be marked as 7/8 subtests passing (or 7/9 since the harness status is currently counted as a subtest), but the test could have many more subtests that were not run and are missing from the results. The current aggregation method can hide the scope of an issue like this, so the hope is that this change will bring attention to areas of improvement that could currently be overlooked.
That being said, I'm not certain about how common this scenario is. Perhaps delving into this further and having numbers would be of use.