Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 114: Change results aggregation on wpt.fyi and visualization on Interop-20** views #114

Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions rfcs/test_scoring_change.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# RFC: Change results aggregation and visualization on wpt.fyi to increase accuracy and utility

## Summary

The aggregate scoring that is displayed in the results view of wpt.fyi has
consistently been inaccurate in some aspects. This proposal will describe a
change to the way that test results are aggregated, as well as a change to
how those results are displayed.

## Details
There are a number of aspects to this change. Each subsection will describe
these aspects in detail.
### Remove harness status from subtest aggregation

When interpreting the results of test suite runs, a status is present that
represents either the overall test's status or the "harness status", which
represents whether any harness errors were present for `testharness.js` test
runs. The current results aggregation method counts this harness status as a
separate subtest itself, which can inflate the overall passing percentage of
a test. There are many instances of a test outright failing its only subtest,
but the test is weighted as 50% passing because there were no harness errors
[(example)](https://wpt.fyi/results/audio-output/selectAudioOutput-sans-user-activation.https.html?run_id=5642791446118400&run_id=5668568732532736&run_id=5739999172493312&run_id=5735075831349248).
The solution proposed is to no longer aggregate this harness status into the
subtest totals.

### Count test as a failure if a harness error exists

Although it seems intuitive to simply disregard the harness status when
calculating tests, there are instances where a harness error will occur, which
causes a number of subtests to not be executed. In some circumstances, it is
possible that a harness error will occur, allowing for some subtests to run
and pass as expected, but some subtests to be not run and not be registered,
as they are not present during aggregation. This has the possibility of hiding
the scope of failure for a specific test in this scenario. The solution
proposed, although not perfect, is to mark these tests with harness errors
as 0% passing, regardless of subtest results.

_Note: Individual subtest results will still be visible normally even with_
Copy link
Contributor

@jgraham jgraham May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the caveat here is important: this change doesn't seem to help browser developers at all; they still need to go down into the individual test and figure out "oh actually there are some passing subtests, the problem is that the test is slow running, so we time out" (or whatever). That suggests the change isn't actually very helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that this change isn't perfect, but hopefully still a net positive.

To use the example you mentioned, it could be the case that, for example, 7 subtests pass as expected, but a timeout occurs on the 8th test. There are instances where a test like this will be marked as 7/8 subtests passing (or 7/9 since the harness status is currently counted as a subtest), but the test could have many more subtests that were not run and are missing from the results. The current aggregation method can hide the scope of an issue like this, so the hope is that this change will bring attention to areas of improvement that could currently be overlooked.

That being said, I'm not certain about how common this scenario is. Perhaps delving into this further and having numbers would be of use.

_this change. This only affects the total percentage that the test counts as_
_passing._

### Display percentages to represent the fraction of tests passing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty solidly opposed to changing the display to be based on percentages. It has been a long term goal to avoid wpt.fyi becoming a primary source for non-browser developers trying to do cross-browser comparisons. It's not clear to me that percentages here are helpful to browser developers, but by making random wpt.fyi results pages look like Interop-202x scores we are encouraging third parties to make value judgements using test sets that haven't been subject to the extensive discussion/auditing of the Interop-selected tests. I'd expect that to affect whether people are willing to submit tests to web-platform-tests in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely see the issue you're describing with percentages giving a false sense of normalized numbers to compare between browsers. This is definitely not the intention of the change. Maybe this is the most contentious change proposed, and finding a middle-ground here would be a better option.


Currently, wpt.fyi displays a fraction of the number of subtests that pass and
the number of subtests that exist for each test or every test that exists in a
directory. These fractions can get large and be hard to quickly parse their
meaning, and a single test with many subtests is heavily weighted. The
proposed change is twofold: display this number as a percentage, and instead
count each test as 1 total unit, with its pass rate counted as the percentage
of subtests that pass. Directories will aggregate based on this each test pass
percentage rather than the accumulation of every subtest.

Passing percentage for a single test = `passing subtests / total subtests`

Passing percentage for directories containing multiple tests = `sum of passing percentages of tests in directory / total number of tests in directory`

For example, if a test has 5 out of 10 subtests that pass, the it will now be
displayed in its representing cell as `50%` rather than `5 / 10` for that run.
Additionally, if we have a directory containing 3 tests and 2 of those tests
pass 100% of their subtests and the last test passes 50%, the directory will
display `83.3%`.

### Display the number of tests in a directory next to the directory name

The current implementation of displaying fractions of subtests has a benefit
of signaling where a bulk of test information is located and how much to
expect inside of that directory. Displaying percentages causes some loss of
this information. To mitigate this, the proposal is to instead display test
totals next to a directory name. This will be all tests that exist in the
directory as well as the number of tests that exist in all subdirectories
within that directory.

### Display test results in single test cells rather than percentages

If a test has no subtests and only a pass/fail result, that result status
( `PASS` , `FAIL` , `TIMEOUT` , etc.) will be displayed in the cell that
represents that test rather than the existing `0 / 1` or `1 / 1` . The user
will no longer have to drill down into the test to view the status message in
these scenarios.

## Risks
* This change will cause a net loss in the percentage of passing
tests/subtests, as the harness status passing has been artificially inflating
some subtest numbers.
* Some tests might pass most or all subtest but encounter a harness error,
which will cause the entire test suite to be counted as a failure. This will
also result in a lower rate of overall passing percentages.
* This change will require a rework of the current process for aggregating and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the final conclusion, or can we use the search cache when there aren't pre-computed summary numbers in the new format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reword this a bit - viewing runs using old vs. new summary files will not affect the scores being displayed in the Interop-20** views. However, there will be a discrepancy when comparing old and new results in the default view. See here for an example of results that are identical but were aggregated differently so their subtest counts look different (new aggregation on the left, old on the right).

There is a larger than expected scope of work for no longer counting the harness status in old summary files. The results view on wpt.fyi assumes there is ALWAYS a summary file available to use and bypasses any aggregation on simple queries (simple = results views with no search query in the search bar). As I can see it, these are the options available:

  1. Remove old summary files and regenerate the summary files with the new format. This would take a very large amount of time and resources to run.
  2. When fetching a summary file server side, check if the summary file is an old format and don't use it if that is the case. Instead aggregate in the same way as is done with search queries.
  3. Accept that there will be small subtest count discrepancies when comparing new runs against older runs.

I think option 2 is promising, but it will feasibly cause a small performance hit when viewing results from these old runs. However, this performance hit should be lessened the further we move away from these old runs and compare them less.

summarizing test results, so older test runs will still be aggregated with the
harness status and display inflated passing percentages.
* Tests will be weighted equally, regardless of the number of subtests that
exist in that test.