Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove harness status aggregation and display percentages on Interop-202X scores #2858

Merged
merged 39 commits into from
Jul 20, 2022

Conversation

DanielRyanSmith
Copy link
Contributor

@DanielRyanSmith DanielRyanSmith commented May 9, 2022

Description

Addresses #2825 and #1958

See RFC #114 for detailed proposal information.

This change adds a new display for how test results are shown on wpt.fyi's Interop-2022 label view. Instead of displaying the flat number of tests that have passed over the test total, a percentage will display with this information. Additionally, a new scoring method has been implemented that no longer counts the Harness status toward the subtest count, and will mark the test as a failure if the test's status was not "OK" or "PASS".

Here is a staging view of these changes in action. The summaries used on the runs in this link were generated with the new aggregation and display method.

Here is another staging view displaying similar Chrome results. The FIRST run shown on the left is displaying results with the new aggregation method. This is a way to visualize the effect on the totals that will display from older vs. newer aggregations.

Changes

  • "Harness Status" will no longer be displayed on subtest views that do not use TestHarness and have been replaced with "Test Status".
  • Interop-2022 results views will display more accurate aggregation that reflects how the scores are calculated.ry next to the directory name.
  • Tests that experience harness errors will display a warning with harness error as title text.
  • A new aggregation method has been added for creating and interpreting runs and run summary files. A Harness status for a test will no longer be counted toward the subtest total of a test. In addition, a test is marked at 0% passing if the Harness status is not "OK" (and a test's status is not "PASS"). The rationale here is that these error might stop further subtest failures from running, which can hide the scope of the problem. Marking as 0% makes it as visible as possible that something has gone wrong.
    NOTE: This will result in an overall drop of passing percentages in these scenarios:
    • A test passes some subtests but has a non-"OK" Harness status. This passing percentage will drop to 0%.
    • A test has subtest failures but an "OK" Harness status. This status will not be counted toward the passing percentage. e.g. A test with 1 subtest that fails and an "OK" Harness status will be marked as 0% passing rather than 50%.
  • Old summary files that were generated before this change will NOT take this new Harness status aggregation change into account, and so those results will display with the old aggregation method. (which will likely have an artificially inflated percentage when compared to newer summaries.

Screenshots

Tests with harness errors display warnings with title text next to results

Screen Shot 2022-06-21 at 10 45 08 AM

Interop-2022 results are viewed with an aggregation that is more indicative of actual interop-2022 scores

Screen Shot 2022-06-21 at 10 52 19 AM

Harness status will no longer count toward subtest totals

The run in the left column is aggregated using the new method, compared to the old totals displayed in the right column. This change more accurately represents the scope of test failures.
Screen Shot 2022-06-21 at 10 45 58 AM

@DanielRyanSmith
Copy link
Contributor Author

@KyleJu The deployment CI run has been having issues with resources it seems. Everything seems to successfully deploy except the new results processor. Maybe you have seen this problem before?

@DanielRyanSmith DanielRyanSmith marked this pull request as ready for review May 13, 2022 22:10
@DanielRyanSmith
Copy link
Contributor Author

DanielRyanSmith commented May 17, 2022

I think it's probably best to separate this into two PRs - one with the visual changes to the UI and one implementing the scoring change. I'll separate them shortly.

Keeping these changes in a single PR for now and opening an RFC.

@DanielRyanSmith DanielRyanSmith force-pushed the scoring-change branch 3 times, most recently from b12f265 to 41c7ba2 Compare May 25, 2022 19:07
api/query/query.go Show resolved Hide resolved
api/query/query.go Outdated Show resolved Hide resolved
api/query/query.go Outdated Show resolved Hide resolved
api/query/query_test.go Outdated Show resolved Hide resolved
api/query/search_test.go Show resolved Hide resolved
@@ -187,21 +187,45 @@ func prepareSearchResponse(filters *shared.QueryFilter, testRuns []shared.TestRu
// Dedup visited file names via a map of results.
resMap := make(map[string]shared.SearchResult)
for i, s := range summaries {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also double check what other queries could be affected? https://github.com/web-platform-tests/wpt.fyi/tree/main/api/query#readme. I will double check as well

webapp/components/test-file-results.js Show resolved Hide resolved
@KyleJu
Copy link
Collaborator

KyleJu commented May 26, 2022

Errors from staging deployment (I will take a look):

ERROR: (gcloud.app.deploy) Error Response: [9] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2022-05-26T20:59:07.274Z601.fm.2: 
No sufficient free disk space left for your App Engine Flexible application.
Please increase your VM instances disk size in the resource settings in the
app.yaml file for your deployment and retry.
See https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#resource-settings
for how to set disk resource.

@KyleJu
Copy link
Collaborator

KyleJu commented May 28, 2022

Errors from staging deployment (I will take a look):

ERROR: (gcloud.app.deploy) Error Response: [9] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2022-05-26T20:59:07.274Z601.fm.2: 
No sufficient free disk space left for your App Engine Flexible application.
Please increase your VM instances disk size in the resource settings in the
app.yaml file for your deployment and retry.
See https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#resource-settings
for how to set disk resource.

OK I cleaned up some resources and got CI to build. I have proposed a solution for this issue in #2867.

@davidsgrogan
Copy link
Member

davidsgrogan commented May 31, 2022

The description in this PR says it fixes #2825, but I don't think that's accurate. The request in 2825 is for "the numbers in the table are the actual scores used for Interop2022." AFAICT, this PR shows the % in the cells but does not take Interop2022 into account.

@foolip
Copy link
Member

foolip commented May 31, 2022

If combined with filtering by label, we get something that's similar:
https://scoring-change-dot-wptdashboard-staging.uk.r.appspot.com/results/?label=master&label=experimental&product=chrome&product=firefox&product=safari&aligned&q=label%3Ainterop-2022-forms

(If you see "Failed to fetch test runs" that's a resource issue on staging instances, you might have to imagine it working.)

But actually, this doesn't do a perfect job of making it clear where the biggest wins are. Starting at the top level and drilling down, the score is 0-100% at each level, so a few levels down it will still require some math to know how much the Interop 2022 score would increase by fixing all the tests in view.

The original suggestion of "43.95 / 90" in #2825 would give numbers that can be compared between subdirectories, since any 10 tests would contribute as much to the score. Although one still needs to compute (90 - 43.95) / number_of_tests to figure out the score improvement.

@@ -1,3 +1,4 @@
//go:build small
Copy link
Collaborator

@jcscottiii jcscottiii Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this line and the same one added to api/query/search.go ( i meant api/query/query_test.go sorry about that) please? This syntax is only valid for go 1.17+ https://pkg.go.dev/go/build/constraint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This was automatically added by my editor - I'll remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants