Remove harness status aggregation and display percentages on Interop-202X scores #2858

DanielRyanSmith · 2022-05-09T22:28:27Z

Description

Addresses #2825 and #1958

See RFC #114 for detailed proposal information.

This change adds a new display for how test results are shown on wpt.fyi's Interop-2022 label view. Instead of displaying the flat number of tests that have passed over the test total, a percentage will display with this information. Additionally, a new scoring method has been implemented that no longer counts the Harness status toward the subtest count, and will mark the test as a failure if the test's status was not "OK" or "PASS".

Here is a staging view of these changes in action. The summaries used on the runs in this link were generated with the new aggregation and display method.

Here is another staging view displaying similar Chrome results. The FIRST run shown on the left is displaying results with the new aggregation method. This is a way to visualize the effect on the totals that will display from older vs. newer aggregations.

Changes

"Harness Status" will no longer be displayed on subtest views that do not use TestHarness and have been replaced with "Test Status".
Interop-2022 results views will display more accurate aggregation that reflects how the scores are calculated.ry next to the directory name.
Tests that experience harness errors will display a warning with harness error as title text.
A new aggregation method has been added for creating and interpreting runs and run summary files. A Harness status for a test will no longer be counted toward the subtest total of a test. In addition, a test is marked at 0% passing if the Harness status is not "OK" (and a test's status is not "PASS"). The rationale here is that these error might stop further subtest failures from running, which can hide the scope of the problem. Marking as 0% makes it as visible as possible that something has gone wrong.
NOTE: This will result in an overall drop of passing percentages in these scenarios:
- A test passes some subtests but has a non-"OK" Harness status. This passing percentage will drop to 0%.
- A test has subtest failures but an "OK" Harness status. This status will not be counted toward the passing percentage. e.g. A test with 1 subtest that fails and an "OK" Harness status will be marked as 0% passing rather than 50%.
Old summary files that were generated before this change will NOT take this new Harness status aggregation change into account, and so those results will display with the old aggregation method. (which will likely have an artificially inflated percentage when compared to newer summaries.

Screenshots

Tests with harness errors display warnings with title text next to results

Interop-2022 results are viewed with an aggregation that is more indicative of actual interop-2022 scores

Harness status will no longer count toward subtest totals

The run in the left column is aggregated using the new method, compared to the old totals displayed in the right column. This change more accurately represents the scope of test failures.

DanielRyanSmith · 2022-05-13T22:09:53Z

@KyleJu The deployment CI run has been having issues with resources it seems. Everything seems to successfully deploy except the new results processor. Maybe you have seen this problem before?

DanielRyanSmith · 2022-05-17T21:42:06Z

~~I think it's probably best to separate this into two PRs - one with the visual changes to the UI and one implementing the scoring change. I'll separate them shortly.~~

Keeping these changes in a single PR for now and opening an RFC.

api/query/query.go

api/query/query_test.go

api/query/search_test.go

KyleJu · 2022-05-26T22:19:53Z

api/query/search.go

@@ -187,21 +187,45 @@ func prepareSearchResponse(filters *shared.QueryFilter, testRuns []shared.TestRu
 	// Dedup visited file names via a map of results.
 	resMap := make(map[string]shared.SearchResult)
 	for i, s := range summaries {


Could you also double check what other queries could be affected? https://github.com/web-platform-tests/wpt.fyi/tree/main/api/query#readme. I will double check as well

webapp/components/test-file-results.js

KyleJu · 2022-05-26T22:29:14Z

Errors from staging deployment (I will take a look):

ERROR: (gcloud.app.deploy) Error Response: [9] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2022-05-26T20:59:07.274Z601.fm.2: 
No sufficient free disk space left for your App Engine Flexible application.
Please increase your VM instances disk size in the resource settings in the
app.yaml file for your deployment and retry.
See https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#resource-settings
for how to set disk resource.

KyleJu · 2022-05-28T00:41:22Z

Errors from staging deployment (I will take a look):

ERROR: (gcloud.app.deploy) Error Response: [9] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2022-05-26T20:59:07.274Z601.fm.2: 
No sufficient free disk space left for your App Engine Flexible application.
Please increase your VM instances disk size in the resource settings in the
app.yaml file for your deployment and retry.
See https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#resource-settings
for how to set disk resource.

OK I cleaned up some resources and got CI to build. I have proposed a solution for this issue in #2867.

davidsgrogan · 2022-05-31T19:36:11Z

The description in this PR says it fixes #2825, but I don't think that's accurate. The request in 2825 is for "the numbers in the table are the actual scores used for Interop2022." AFAICT, this PR shows the % in the cells but does not take Interop2022 into account.

foolip · 2022-05-31T19:57:52Z

If combined with filtering by label, we get something that's similar:
https://scoring-change-dot-wptdashboard-staging.uk.r.appspot.com/results/?label=master&label=experimental&product=chrome&product=firefox&product=safari&aligned&q=label%3Ainterop-2022-forms

(If you see "Failed to fetch test runs" that's a resource issue on staging instances, you might have to imagine it working.)

But actually, this doesn't do a perfect job of making it clear where the biggest wins are. Starting at the top level and drilling down, the score is 0-100% at each level, so a few levels down it will still require some math to know how much the Interop 2022 score would increase by fixing all the tests in view.

The original suggestion of "43.95 / 90" in #2825 would give numbers that can be compared between subdirectories, since any 10 tests would contribute as much to the score. Although one still needs to compute (90 - 43.95) / number_of_tests to figure out the score improvement.

shared/models.go

webapp/views/wpt-results.js

results-processor/wptreport.py

jcscottiii · 2022-07-20T20:23:35Z

api/query/search_test.go

@@ -1,3 +1,4 @@
+//go:build small


Could you remove this line and the same one added to ~~api/query/search.go~~ ( i meant api/query/query_test.go sorry about that) please? This syntax is only valid for go 1.17+ https://pkg.go.dev/go/build/constraint

Yes! This was automatically added by my editor - I'll remove it.

DanielRyanSmith force-pushed the scoring-change branch from 4c2b872 to 9004c18 Compare May 13, 2022 20:41

DanielRyanSmith requested review from KyleJu and foolip May 13, 2022 22:08

DanielRyanSmith marked this pull request as ready for review May 13, 2022 22:10

DanielRyanSmith force-pushed the scoring-change branch 3 times, most recently from b12f265 to 41c7ba2 Compare May 25, 2022 19:07

KyleJu reviewed May 26, 2022

View reviewed changes

DanielRyanSmith force-pushed the scoring-change branch 3 times, most recently from 11f07c4 to 2994e3c Compare May 27, 2022 19:29

KyleJu mentioned this pull request May 27, 2022

[CI] Processor staging deployment failure #2867

Closed

DanielRyanSmith added 12 commits June 16, 2022 17:27

display percents

06a8c34

update summary creation

f63e23e

update tests

2c3bc45

Keep test status in test summaries

cfc7c28

display harness error on cell

c780282

keep old diff view

5414298

show subtest counts next to test name

72ed300

changes suggested by @KyleJu

53e2db6

don't round to 0 or 100 for percents

95bee62

Display test fractions

d54512c

Add title text to warning

91ae4fb

new query string param 'views'

99d3ab3

DanielRyanSmith added 8 commits June 21, 2022 11:17

stop yelling about Missing

a923258

harness warnings can show on missing rows

17baf0f

Only show percent view for interop results

ed09412

handle web component error

01284ae

handle diffRun reference

2a0e187

code cleanup

6164b26

total row will display color for interop views

ce342db

Correctly sort based on view

7737760

DanielRyanSmith force-pushed the scoring-change branch from d9f4a62 to 7737760 Compare June 21, 2022 18:18

DanielRyanSmith added 4 commits June 22, 2022 12:00

single test shows "subtests" text

404c5c1

display test or subtest total text

166ef92

change view name to 'interop'

6bcec82

rename vars to match rfc

7c3e6d0

DanielRyanSmith requested a review from jgraham July 13, 2022 03:05

jcscottiii reviewed Jul 13, 2022

View reviewed changes

shared/models.go Show resolved Hide resolved

jcscottiii reviewed Jul 13, 2022

View reviewed changes

webapp/views/wpt-results.js Show resolved Hide resolved

jcscottiii reviewed Jul 13, 2022

View reviewed changes

results-processor/wptreport.py Show resolved Hide resolved

DanielRyanSmith added 3 commits July 14, 2022 15:59

add changes suggested by @jcscottiii & @jgraham

9858346

remove log statement

259b4d5

update test

3fc0ef3

foolip approved these changes Jul 19, 2022

View reviewed changes

jcscottiii reviewed Jul 20, 2022

View reviewed changes

DanielRyanSmith added 3 commits July 20, 2022 13:39

Update search_test.go

b6938af

Add TODO for old summary format

fa782c9

remove unnecessary tag

8bc8856

jcscottiii approved these changes Jul 20, 2022

View reviewed changes

DanielRyanSmith mentioned this pull request Jul 20, 2022

Resource issues occur sporadically for staging redeployments #2913

Open

DanielRyanSmith merged commit 39e07b6 into main Jul 20, 2022

DanielRyanSmith deleted the scoring-change branch July 20, 2022 22:06

jugglinmike mentioned this pull request Dec 21, 2022

api/results documentation does not reflect current behaviour #2940

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove harness status aggregation and display percentages on Interop-202X scores #2858

Remove harness status aggregation and display percentages on Interop-202X scores #2858

DanielRyanSmith commented May 9, 2022 •

edited

Loading

DanielRyanSmith commented May 13, 2022

DanielRyanSmith commented May 17, 2022 •

edited

Loading

KyleJu May 26, 2022

KyleJu commented May 26, 2022

KyleJu commented May 28, 2022

davidsgrogan commented May 31, 2022 •

edited

Loading

foolip commented May 31, 2022

jcscottiii Jul 20, 2022 •

edited

Loading

DanielRyanSmith Jul 20, 2022

Remove harness status aggregation and display percentages on Interop-202X scores #2858

Remove harness status aggregation and display percentages on Interop-202X scores #2858

Conversation

DanielRyanSmith commented May 9, 2022 • edited Loading

Description

Changes

Screenshots

Tests with harness errors display warnings with title text next to results

Interop-2022 results are viewed with an aggregation that is more indicative of actual interop-2022 scores

Harness status will no longer count toward subtest totals

DanielRyanSmith commented May 13, 2022

DanielRyanSmith commented May 17, 2022 • edited Loading

KyleJu May 26, 2022

Choose a reason for hiding this comment

KyleJu commented May 26, 2022

KyleJu commented May 28, 2022

davidsgrogan commented May 31, 2022 • edited Loading

foolip commented May 31, 2022

jcscottiii Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

DanielRyanSmith Jul 20, 2022

Choose a reason for hiding this comment

DanielRyanSmith commented May 9, 2022 •

edited

Loading

DanielRyanSmith commented May 17, 2022 •

edited

Loading

davidsgrogan commented May 31, 2022 •

edited

Loading

jcscottiii Jul 20, 2022 •

edited

Loading