RFC 114: Change results aggregation on wpt.fyi and visualization on Interop-20** views #114

DanielRyanSmith · 2022-05-18T17:17:22Z

This is a proposal for a larger change to the results view of wpt.fyi.

Screenshot examples (with links)

Tests with harness errors display warnings with title text next to results (link)

Interop-2022 results are viewed with an aggregation that is more indicative of actual interop-2022 scores (link)

Harness status will no longer count toward subtest totals (link)

The run in the left column is aggregated using the new method, compared to the old totals displayed in the right column. This change more accurately represents the scope of test failures.

zcorpan · 2022-05-24T10:59:12Z

The proposed changes make sense to me.

Does this change affect the scoring for Interop 2022?

foolip · 2022-05-24T14:35:25Z

Does this change affect the scoring for Interop 2022?

No, but I would like us to pick a scoring method that is as close as possible. And if it's not exactly the same, I'd like to use the scoring method we pick here for Interop 2023.

foolip · 2022-05-24T14:39:13Z

@DanielRyanSmith I love how this resolves web-platform-tests/wpt.fyi#421, nice!

On the precision, I'm leaning towards no decimal points, because it's more compact, and a difference of <1% rarely matters.

jgraham · 2022-05-24T14:49:28Z

I haven't looked at this in detail, but I'm concerned that we originally avoided percentage scores because they're so easy to misinterpret into some kind of ranking. They're also lossy; you can't tell whether a score of 50% is based on passing 1/2 tests or 500/1000. The latter is at least suggestive of a more complete testsuite.

For Interop-* we are putting a lot of care into selecting the tests so that everyone agrees that it's OK to publish this kind of simplified score. I'm not sure who the intended audience is for doing that for all tests; I don't see that it's especially helpful to browser developers who I consider to be the main target for the raw test scores.

foolip · 2022-05-24T15:08:35Z

@jgraham note that the number of tests are preserved next to directory names. In an earlier prototype they were next to the percentage, but since they're the same for each cell it was moved outside. Some other way of showing the number of tests would be OK too I think.

DanielRyanSmith · 2022-05-24T16:41:14Z

@zcorpan

Does this change affect the scoring for Interop 2022?

This will not affect Interop 2022 scores and will move wpt.fyi a bit closer to mirroring interop 2022's scores.

@jgraham

They're also lossy; you can't tell whether a score of 50% is based on passing 1/2 tests or 500/1000. The latter is at least suggestive of a more complete testsuite.

This is true for a single test with multiple subtests. Like @foolip mentioned, an attempt to mitigate this has been to make the number of tests in a directory visible next to the path name. Not displaying subtests is a potential risk, but hopefully mitigated by the test numbers display.

davidsgrogan · 2022-05-24T19:51:41Z

I'd like to use the staging instance to see what this long diff query looks like, but I get an error when I try to just substitute the staging domain into that url

'diff' was requested, but is only valid when comparing two runs.

Are those runs specified at the end of the URL not available to the staging instance?

DanielRyanSmith · 2022-05-25T05:57:21Z

I'd like to use the staging instance to see what this long diff query looks like

Thanks @davidsgrogan for taking a look 😊 - the original plan was to only change the results view to percentages and leave the diff query view with its current implementation. Perhaps this is something that people have opinions about?

The view you linked is not available because the run IDs differ between the staging and production environments, but the view should look mostly the same as it currently is in production.

foolip · 2022-05-25T08:30:09Z

Not displaying subtests is a potential risk, but hopefully mitigated by the test numbers display.

I think that counting subtests in aggregate, like for all of css/ or editing/ doesn't make sense, and am glad to see that go away. But if anyone thinks it's important to continue to show subtest counts at the test level, an option would be to continue to show "3 / 4" instead of 75% for the cell for an individual test.

zcorpan · 2022-05-25T09:14:03Z

Or for tests that have subtests, show the number of subtests next to the filename

foolip · 2022-05-25T10:16:13Z

The tricky part of that is that the number of subtests can be different between browsers, in cases like these:
https://wpt.fyi/results/media-capabilities/decodingInfo.webrtc.html?label=master&label=experimental&product=chrome&product=firefox&product=safari&aligned&q=status%3Amissing
https://wpt.fyi/results/touch-events/multi-touch-interactions.html?label=master&label=experimental&product=chrome&product=firefox&product=safari&aligned

We can just pick the biggest number to show since it would be purely informational and not for scores, and I think that'd be OK enough, and is how the single "Showing 18 subtests" in the blue banner is currently computed.

@DanielRyanSmith do you think displaying that next to test names would be OK? What do we do about names that are too long? (This can also happen with directories, but much more rarely.)

DanielRyanSmith · 2022-05-25T19:04:06Z

Or for tests that have subtests, show the number of subtests next to the filename

This is a quick and easy addition, and here is an example of the change. If any subtests exist in the test, the number will be displayed next to the name.

In terms of text wrapping, it would happen the same as it does for long test file names now. Here is an example of long test names. Some are already wrapping without the new subtest numbers.

One possible issue I could see is if there is a need to differentiate the numbers displayed by a directory (tests) with numbers displayed by a single test (subtests), but it seems relatively intuitive to me. Great suggestion!

davidsgrogan · 2022-05-25T19:12:39Z

Thanks @davidsgrogan for taking a look 😊 - the original plan was to only change the results view to percentages and leave the diff query view with its current implementation. Perhaps this is something that people have opinions about?

I have opinions on it! 😊

I will need to calculate the interop2022 score difference caused by engines changing behavior for a large subset of tests. It will be absurdly tedious to do manually based on the existing diff view because of subtests. So showing score change in the diff view would eliminate a lot of manual calculation on my part.

Optimally we could have a toggle button that lets us users switch the view between score difference and the existing view that shows # test differences, but maybe that would add too much complexity on your side?

mfreed7 · 2022-05-26T00:53:34Z

Overall, thanks for implementing this! It'll be helpful as we work through improving Interop2022 scores.

On the precision, I'm leaning towards no decimal points, because it's more compact, and a difference of <1% rarely matters.

I think we need to keep the decimal precision. In the third screenshot example, "backgrounds" has 480 sub-tests and a pass rate of 99.8%, meaning a "combined" score of 480*0.998 = 479. So there's a full point to pick up there, which wouldn't be visible if you rounded that to 100%.

DanielRyanSmith · 2022-05-26T17:08:11Z

@davidsgrogan

I have opinions on it! 😊

Your pain point is an important one, and is tangentially related. I think it would be very useful to describe your situation in an issue on wpt.fyi if it's not already there. Because it's a separate view with a use case that is typically different than this results view it might take some more discussion, but this feels like very useful information.

jgraham

The RFC isn't really clear on who the intended beneficiary of these changes is. As far as I can tell it doesn't really help browser engineers to find and address problems in their engines, which I consider the main use case for wpt.fyi. Making the results look similar to interop-2022 seems explictly like it should be a non-goal; that's a carefully curated set of tests which vendors have mutually agreed are useful for public consumption. The full set of web-platform-tests doesn't have such agreement, and I'm extremely reluctant to turn it into something that looks like an externally meaningful metric.

jgraham · 2022-05-30T10:42:06Z

rfcs/test_scoring_change.md

+_this change. This only affects the total percentage that the test counts as_
+_passing._
+
+### Display percentages to represent the fraction of tests passing.


I'm pretty solidly opposed to changing the display to be based on percentages. It has been a long term goal to avoid wpt.fyi becoming a primary source for non-browser developers trying to do cross-browser comparisons. It's not clear to me that percentages here are helpful to browser developers, but by making random wpt.fyi results pages look like Interop-202x scores we are encouraging third parties to make value judgements using test sets that haven't been subject to the extensive discussion/auditing of the Interop-selected tests. I'd expect that to affect whether people are willing to submit tests to web-platform-tests in the first place.

I definitely see the issue you're describing with percentages giving a false sense of normalized numbers to compare between browsers. This is definitely not the intention of the change. Maybe this is the most contentious change proposed, and finding a middle-ground here would be a better option.

jgraham · 2022-05-30T10:46:23Z

rfcs/test_scoring_change.md

+proposed, although not perfect, is to mark these tests with harness errors
+as 0% passing, regardless of subtest results.
+
+_Note: Individual subtest results will still be visible normally even with_


I think the caveat here is important: this change doesn't seem to help browser developers at all; they still need to go down into the individual test and figure out "oh actually there are some passing subtests, the problem is that the test is slow running, so we time out" (or whatever). That suggests the change isn't actually very helpful.

It is true that this change isn't perfect, but hopefully still a net positive.

To use the example you mentioned, it could be the case that, for example, 7 subtests pass as expected, but a timeout occurs on the 8th test. There are instances where a test like this will be marked as 7/8 subtests passing (or 7/9 since the harness status is currently counted as a subtest), but the test could have many more subtests that were not run and are missing from the results. The current aggregation method can hide the scope of an issue like this, so the hope is that this change will bring attention to areas of improvement that could currently be overlooked.

That being said, I'm not certain about how common this scenario is. Perhaps delving into this further and having numbers would be of use.

foolip · 2022-05-30T16:57:08Z

@jgraham this change is related to (motivated by) these wpt.fyi issues, some very old, some new:

Pass/total numbers inflated by "Test file" for testharness.js tests (harness status) wpt.fyi#62
For reftests, inline PASS / FAIL into directory views wpt.fyi#421
Showing test (not subtest) summary numbers wpt.fyi#2290 (actually not fixed by this RFC as it is)
Showing percentage (not test/subtest) summary numbers wpt.fyi#2825

The intention is of course that it should be an improvement for users of wpt.fyi, and not regress any important use cases.

That last issue is about being able to browse (and sort) the Interop 2022 tests. The original suggestion was a fraction like "43.95 / 90", and that would actually work better when drilling down into a part of the test suite, since a percentage would be normalized at each level, while a test count would not.

One could split this all up into the following considerations:

Not counting the harness status as a subtest
Counting tests instead of subtests
The effect of a non-OK harness status on either subtest or test counts
Showing percentages instead of fractions (for either subtest or test counts)
Changing what the defaults are

Which parts of this would be fine and which need more discussion?

jgraham · 2022-05-31T10:31:17Z

I think that wanting the results on wpt.fyi to look like Interop 2022 scores is unnecessarily constraining the solution space. If we drop that requirement then there are more options to surface relevant information in a more usable way, without it having to look like a single numerical score.

For reftests I agree that it makes sense to inline the actual test result in the directory level view.

For tests with subtests, the limiting assumption seems to be that we have to combine both the overall harness status and the subtest scores into a single meaningful number. I don't think that's true. For example, for tests with both some passing subtests and a non OK status we could display the result like "6/7 ⚠" with title-text on the warning triange to convey the problematic status. For directory level views we could display the test count, the subtest pass rate and also the warning triange with details of any non-OK overall statuses. Maybe clicking the triange could take you through to a view of just the problematic tests. From that thought, I wonder if we should actually be displaying the subtest failure count rather than the pass count, and making it possible to click through to a view only containing the failures (or tests containing failures at the directory level). Maybe that's too confusing, and I'm certainly not wedded to the failure-triangle suggestion; I'm all in favour of a better way to convey the same information. But in general setting up the dasboard to be useful to developers looking for problems that need to be solved involves surfacing different information to setting it up as a source of metrics.

I also understand that there are requests to be able to reflect Interop metrics directly in the wpt.fyi display. Generally I think this kind of thing is a bit misguided; knowing that fixing case A will improve the score by 0.5% whereas B will improve it by 1% doesn't mean that B is going to be twice as useful as A; we just haven't calibrated the scores down to that level and I don't think we could even if we tried. That's different performance benchamarks, where a 1% win is pretty much twice as desirable as a 0.5% win.

Of course once you have a metric that's a proxy for some underlying goal (e.g. "allow developers to use without running into interop problems") people are going to want to optimise for the metric rather than the goal. I don't think we should encourage that, but to the extent that people disagree with me there, I think we should at least make sure that any view designed to support the metric is gated on only being available when displaying tests that are part of the metric. So anything specifically for surfacing Interop 2022 scores should be contingent on first having filtered down to one of the interop-2022 labels.

mfreed7 · 2022-05-31T17:08:52Z

I think that wanting the results on wpt.fyi to look like Interop 2022 scores is unnecessarily constraining the solution space. If we drop that requirement then there are more options to surface relevant information in a more usable way, without it having to look like a single numerical score.

As a browser implementer who raised one of the issues leading to this change, I just wanted to chime in that I do find changes along these lines helpful. The old display improperly (in my opinion) weighed sub-test failures over test-file failures. My request (which I believe this change fixes) was to report fractional files-passing statistics, rather than subtest-passing statistics. I'm generally agnostic to the many different ways that can be achieved, but I do think it's an important change to make. This was motivated by Interop2022, which, for the first time, set out a methodology for how to score a set of tests. While we don't need to be tied to that exact methodology, I think it is more valuable to use that approach than to stick to just counting sub-tests.

I also understand that there are requests to be able to reflect Interop metrics directly in the wpt.fyi display. Generally I think this kind of thing is a bit misguided; knowing that fixing case A will improve the score by 0.5% whereas B will improve it by 1% doesn't mean that B is going to be twice as useful as A; we just haven't calibrated the scores down to that level and I don't think we could even if we tried. That's different performance benchamarks, where a 1% win is pretty much twice as desirable as a 0.5% win.

Yes, Goodhart's law applies. But Goodhart doesn't offer a better solution to the problem. We need to measure things to try to improve them, and to do that, we need a well-defined metric. Certainly the Interop2022 methodology isn't perfect, but it is very practically helpful. Compat2021 was fairly successful I think, using a similar methodology.

DanielRyanSmith · 2022-06-01T23:11:32Z

Hello everyone, I've drafted some changes to the view and have made it available to test in the staging environment.

To summarize:

Results are displayed as fractions similar to the original view, but are instead aggregated by number of tests, and each test's pass value is calculated by the % of subtests that pass within that test.
At the single test view, cells will instead display subtest counts. This is shown in parentheses to differentiate between test counts.
Tests with harness errors will display a warning sign next to their subtest count and the cell color is displayed in red, rather than just displaying the harness error.
Test/subtest counts have been removed from the test/directory name cell on the left.

Some views that have a new look as a result:

Warning sign shown for Chrome timeout with all missing subtests These would previously count as a single subtest failure, but they'll now show they have no run subtests and are marked as a full test failure, which should increase visibility.
Displaying subtest information while also making harness errors more visible.
First few rows here show a directory and then single tests. Are the parentheses enough to differentiate that this is a subtest count and different from the directory's test count?

I think this looks like a promising direction, but criticisms are encouraged here. Thanks @jgraham for bringing up the suggestion for parts of this, and I think this is closer to your original suggestion if I'm not mistaken @mfreed7. I think we're moving closer to something with the most utility for users. 😎

mfreed7 · 2022-06-02T15:56:59Z

I think this is closer to your original suggestion if I'm not mistaken @mfreed7. I think we're moving closer to something with the most utility for users. 😎

This definitely still solves my problem. And I think I actually like it better - there's more information available now. Not just the fraction of tests passing but also the total number, in the same place.

First few rows here show a directory and then single tests. Are the parentheses enough to differentiate that this is a subtest count and different from the directory's test count?

I do kind of think the parenthesis might not be enough to distinguish these. The difference between "19 / 19" and "(10 / 10)" is pretty subtle, especially if you haven't been in this conversation. I wish I had a great alternative suggestion to make, but I don't. Perhaps it's too verbose, but what about "0.8/1 (8/10 subtests)" or something like that?

zcorpan · 2022-06-03T08:45:25Z

Is there a need to differentiate? The fact that it's a folder vs a file seems clear to me. You can click though to get detailed results of the subtests.

foolip · 2022-06-03T12:00:05Z

Thanks for the update @DanielRyanSmith!

I think it will be hard to distinguish between test and subtest fractions in a way that is clear without explanation. It's already pretty clear what's a directory and what is a test, but neither that nor the parenthesis makes it obvious what the fractions mean without explanation.

Maybe we should have an "how to read these results" explanation somewhere? And/or we could have a tooltip that says "8 of 10 subtests pass"?

foolip · 2022-06-03T12:04:17Z

On harness errors, I'm not sure about the warning triangles. I would prefer to inline the status into test listing, making it even more obvious that there's a problem. But this depends on how scores are aggregated, and makes the most sense if we count a test as 0 if there's a harness error, which I think we should since the test didn't run to completion.

DanielRyanSmith · 2022-06-13T04:18:19Z

Hello everyone - I have been OOO and haven't been able to respond here.
I was hoping the wording on the blue banner would shed some light on how to read into subtests

@jgraham do you have any further opinions on this implementation? Soon I'll update the RFC proposal with more info and focus on the user we're targeting here (which has been the browser engineers).

jgraham · 2022-06-13T09:30:52Z

I had a discussion about this with @foolip @gsnedders and @past

My conclusions:

I still think it's very misleading to make the entire dashboard look like an interop metric. Changes to display something that look like Interop-202x scores should be gated on being in a mode that only shows the labelled tests which we've agreed are suitable for use in that kind of metric.
For all other cases, the fractional scores aren't adding any useful information, but are confusing.
I'm happy with not counting harness scores as part of the overall summary (for test types with subtests), but think we should bubble overall status information upwards in some way (e.g. the warning triangles). I think it's actively unhelpful to "score" those tests which run some subtests as 0; that's a clear example of where the current proposal makes a tradeoff in favour of something that looks like a score rather than something that actually helps browser engineers understand the state of the test.
For tests without subtests (reftest, crashtests), having seen the current state I think we should present the result in directory views as a score out of 1, rather than directly as PASS, FAIL, etc. Mixing the two kinds of presentation in a single view is visually noisy.
In terms of helping browser engineers, I think we'd get more milage out of making some common filters easily accessible (e.g. "all tests in the current directory that don't pass in [browser]" or "[browser] specific failures", rather than putting more effort into tweaking the presentation of the existing data. Although it's currently possible to construct queries for those cases, it's not trivial and so I suspect we're leaving a lot of value on the table.

DanielRyanSmith · 2022-06-17T07:48:28Z

I still think it's very misleading to make the entire dashboard look like an interop metric. Changes to display something that look like Interop-202x scores should be gated on being in a mode that only shows the labelled tests which we've agreed are suitable for use in that kind of metric.
For all other cases, the fractional scores aren't adding any useful information, but are confusing.

There seems to be a consensus that this new view would be more useful specifically for Interop scores, so I've been working to make that so.

I'm happy with not counting harness scores as part of the overall summary (for test types with subtests), but think we should bubble overall status information upwards in some way (e.g. the warning triangles). I think it's actively unhelpful to "score" those tests which run some subtests as 0; that's a clear example of where the current proposal makes a tradeoff in favour of something that looks like a score rather than something that actually helps browser engineers understand the state of the test.

There also seems to be a consensus on not counting the harness status toward the subtest count and displaying harness errors above in some way. 👌 I don't think I explained this very well, but my rationale towards marking tests with a harness error as 0 was less about mirroring the Interop scoring and more to increase visibility of possibly larger failures (a test marked as 0 will catch more eyes than one with a better pass rate because missing tests aren't weighted). However, making the harness error more visible above is a solid middle-ground there.

I've kept in the warning triangles with title text currently.

For tests without subtests (reftest, crashtests), having seen the current state I think we should present the result in directory views as a score out of 1, rather than directly as PASS, FAIL, etc. Mixing the two kinds of presentation in a single view is visually noisy.

I agree that the mixture of numbers and PASS/FAIL can look cluttered in some views where they show together. I've removed this from all views as of now.

In terms of helping browser engineers, I think we'd get more milage out of making some common filters easily accessible (e.g. "all tests in the current directory that don't pass in [browser]" or "[browser] specific failures", rather than putting more effort into tweaking the presentation of the existing data. Although it's currently possible to construct queries for those cases, it's not trivial and so I suspect we're leaving a lot of value on the table.

Perhaps this is something to look into for the future if there is enough demand - I think it sounds like a valuable idea.

Here is a demonstration of the newest iteration. Our staging environment has some resource issues so an error message might occur, but here is an example of an Interop score with:

EDIT: These links are no longer relevant and have been removed. See this PR description for examples.

Adding the view param to the query string switches this view as desired. The subtest view will be the default for all views, but the links in the Interop 2022 page will direct to a different view (I've set them to use the percentage view because that is how they are displayed on the Interop 2022 page).

The main page should look mostly unchanged with the exception of removing the harness status from subtest aggregation and having harness error notices.

Thanks for the detailed thoughts @jgraham 😊 Maybe having these alternate hidden views and adding them where it makes sense is a workable approach.

jgraham · 2022-06-17T09:54:36Z

I haven't fully looked at this, but a quick comment: I'd really like views like the percentage view to be gated on having a filter that selects interop-* labels, i.e. if you don't have such a filter the view parameter should be ignored (alternatively the view parameter could apply a prefix to the specified filter so that only interop-* tests are visible). I know the generic filtering mechanism makes this tricky, but I think it's important that the policy is actually enforced in the code.

foolip · 2022-06-23T14:18:01Z

rfcs/test_scoring_change.md

+* This change will cause a net loss in the percentage of passing
+tests/subtests, as the harness status passing has been artificially inflating
+some subtest numbers.
+* This change will require a rework of the current process for aggregating and


Is this the final conclusion, or can we use the search cache when there aren't pre-computed summary numbers in the new format?

I can reword this a bit - viewing runs using old vs. new summary files will not affect the scores being displayed in the Interop-20** views. However, there will be a discrepancy when comparing old and new results in the default view. See here for an example of results that are identical but were aggregated differently so their subtest counts look different (new aggregation on the left, old on the right).

There is a larger than expected scope of work for no longer counting the harness status in old summary files. The results view on wpt.fyi assumes there is ALWAYS a summary file available to use and bypasses any aggregation on simple queries (simple = results views with no search query in the search bar). As I can see it, these are the options available:

Remove old summary files and regenerate the summary files with the new format. This would take a very large amount of time and resources to run.

When fetching a summary file server side, check if the summary file is an old format and don't use it if that is the case. Instead aggregate in the same way as is done with search queries.

Accept that there will be small subtest count discrepancies when comparing new runs against older runs.

I think option 2 is promising, but it will feasibly cause a small performance hit when viewing results from these old runs. However, this performance hit should be lessened the further we move away from these old runs and compare them less.

rfcs/test_scoring_change.md

as well as risks update

DanielRyanSmith · 2022-06-24T02:58:31Z

I've updated the RFC, as well as the screenshots and links in the PR description (see these for examples of the current iteration). I believe I've responded to the concerns that have arisen at this point. Are there any more items anyone would like to raise?

foolip · 2022-06-29T14:52:35Z

If there is no more feedback on this by Friday I'll consider this RFC accepted and merge it. (Per our process.)

gsnedders · 2022-06-30T22:47:59Z

If there is no more feedback on this by Friday I'll consider this RFC accepted and merge it. (Per our process.)

I disagree that's our process; I think it's clear there was substantive disagreement from @jgraham's comments above (and I broadly agree with them).

foolip · 2022-07-02T11:35:24Z

@gsnedders the requested changes were made though, right? Do you have additional changes you'd like to see?

foolip · 2022-07-02T11:49:25Z

The part of the process I was referring to is this step:

When no new information is forthcoming, a decision should be made within 1 additional week.

This isn't explicit about what happens if the person requesting a change doesn't get back to the RFC within a week. An out-of-band ping would be helpful, and I didn't do that here. I'll do that now.

jgraham

I think this is probably OK, but I would also appreciate @gsnedders confirming that it addresses their concerns.

rfcs/test_scoring_change.md

foolip · 2022-07-05T15:19:14Z

@gsnedders do you want to review/approve as well, given #114 (comment)?

foolip · 2022-07-12T14:26:53Z

@gsnedders said in web-platform-tests/wpt#34752 (comment) they'd be out this week, so I'll go ahead and declare this RFC accepted due to a lack of further feedback. Sam, feel free to raise any issues when you're back, nothing is set in stone.

Initial proposal

e05638c

DanielRyanSmith force-pushed the DanielRyanSmith/wpt.fyi-scoring-change branch from 705683d to e05638c Compare May 19, 2022 23:05

DanielRyanSmith marked this pull request as ready for review May 19, 2022 23:05

DanielRyanSmith changed the title ~~Change results aggregation and visualization on wpt.fyi~~ RFC 114: Change results aggregation and visualization on wpt.fyi May 19, 2022

DanielRyanSmith mentioned this pull request May 20, 2022

Remove harness status aggregation and display percentages on Interop-202X scores web-platform-tests/wpt.fyi#2858

Merged

foolip mentioned this pull request May 24, 2022

Showing percentage (not test/subtest) summary numbers web-platform-tests/wpt.fyi#2825

Closed

jgraham requested changes May 30, 2022

View reviewed changes

Update test_scoring_change.md

942be3a

foolip reviewed Jun 23, 2022

View reviewed changes

Add description of "view" query parameter

960bde5

as well as risks update

DanielRyanSmith force-pushed the DanielRyanSmith/wpt.fyi-scoring-change branch from af7872e to 960bde5 Compare June 24, 2022 02:08

DanielRyanSmith changed the title ~~RFC 114: Change results aggregation and visualization on wpt.fyi~~ RFC 114: Change results aggregation on wpt.fyi and visualization on Interop-20** views Jun 24, 2022

Update test_scoring_change.md

b132fb0

jgraham approved these changes Jul 4, 2022

View reviewed changes

rfcs/test_scoring_change.md Outdated Show resolved Hide resolved

Update test_scoring_change.md

a04cc5e

foolip merged commit c815d13 into web-platform-tests:master Jul 12, 2022

foolip mentioned this pull request Apr 8, 2024

Enable view=test mode in wpt.fyi #190

Merged

RFC 114: Change results aggregation on wpt.fyi and visualization on Interop-20** views #114

RFC 114: Change results aggregation on wpt.fyi and visualization on Interop-20** views #114

Conversation

DanielRyanSmith commented May 18, 2022 • edited

Screenshot examples (with links)

Tests with harness errors display warnings with title text next to results (link)

Interop-2022 results are viewed with an aggregation that is more indicative of actual interop-2022 scores (link)

Harness status will no longer count toward subtest totals (link)

zcorpan commented May 24, 2022

foolip commented May 24, 2022

foolip commented May 24, 2022

jgraham commented May 24, 2022

foolip commented May 24, 2022

DanielRyanSmith commented May 24, 2022

davidsgrogan commented May 24, 2022

DanielRyanSmith commented May 25, 2022

foolip commented May 25, 2022

zcorpan commented May 25, 2022

foolip commented May 25, 2022

DanielRyanSmith commented May 25, 2022

davidsgrogan commented May 25, 2022

mfreed7 commented May 26, 2022

DanielRyanSmith commented May 26, 2022

jgraham left a comment

Choose a reason for hiding this comment

jgraham May 30, 2022

Choose a reason for hiding this comment

DanielRyanSmith May 31, 2022

Choose a reason for hiding this comment

jgraham May 30, 2022 • edited

Choose a reason for hiding this comment

DanielRyanSmith May 31, 2022

Choose a reason for hiding this comment

foolip commented May 30, 2022

jgraham commented May 31, 2022

mfreed7 commented May 31, 2022 • edited

DanielRyanSmith commented Jun 1, 2022

mfreed7 commented Jun 2, 2022

zcorpan commented Jun 3, 2022

foolip commented Jun 3, 2022

foolip commented Jun 3, 2022

DanielRyanSmith commented Jun 13, 2022

jgraham commented Jun 13, 2022 • edited

DanielRyanSmith commented Jun 17, 2022 • edited

jgraham commented Jun 17, 2022

foolip Jun 23, 2022

Choose a reason for hiding this comment

DanielRyanSmith Jun 24, 2022

Choose a reason for hiding this comment

DanielRyanSmith commented Jun 24, 2022

foolip commented Jun 29, 2022

gsnedders commented Jun 30, 2022

foolip commented Jul 2, 2022

foolip commented Jul 2, 2022

jgraham left a comment

Choose a reason for hiding this comment

foolip commented Jul 5, 2022

foolip commented Jul 12, 2022

DanielRyanSmith commented May 18, 2022 •

edited

jgraham May 30, 2022 •

edited

mfreed7 commented May 31, 2022 •

edited

jgraham commented Jun 13, 2022 •

edited

DanielRyanSmith commented Jun 17, 2022 •

edited