Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding regressions in diff view is too hard #411

Closed
foolip opened this Issue Aug 8, 2018 · 7 comments

Comments

Projects
None yet
3 participants
@foolip
Copy link
Contributor

foolip commented Aug 8, 2018

https://staging.wpt.fyi/results/?product=chrome%5Bexperimental%5D&product=chrome%5Bstable%5D&diff shows this:
image

The numbers in the 3rd column are the differences between the pass / total numbers in columns 1 and 2.

Because the number of failures isn't counted, when both the total number of tests and total number of passes has changed, it's difficult to see if the number of failures has changed. The appropriate comparison would be in the number of failures, which is (totalA - passA) - (totalB - passB). That's (totalA - totalB) - (passA - passB), so as it turns out one can compare the numerator and denominator in column 3 to check if there are more failures, but that is not possible to do at a glance.

(#408 is a separate issue about the order, which also makes it tricky to skim the view for interesting things.)

@foolip

This comment has been minimized.

Copy link
Contributor Author

foolip commented Aug 8, 2018

Actually, that formula doesn't quite cut it either. There is a regression in https://staging.wpt.fyi/results/css/css-values?product=chrome[experimental]&product=chrome[stable]&diff=true (viewport-units-css2-001.html) but it wasn't possible to see that from the high-level numbers, I just happened to drill down there because the numbers for css were large and I wanted to see why.

@jgraham

This comment has been minimized.

Copy link

jgraham commented Aug 8, 2018

A comparison based on totals rather than actual differences seems rather problematic; finding that e.g. nightly is +8 passes vs stable doesn't mean there aren't regressions if it's actually +10 -2 i.e. 10 tests started passing, but two regressed.

Separate to that the view would be much clearer if it filtered the columns to only show directories/tests with changes.

@foolip

This comment has been minimized.

Copy link
Contributor Author

foolip commented Aug 8, 2018

Yep, these are both changes @lukebjerring and I have discussed. This view is based on comparing summary numbers. It would be improved by comparing the pass/total numbers on the test level and aggregating the differences from tests that could be regressions.

To make it entirely reliable the comparison has to be done using the full results including subtests, but one can probably emulate that quite well using just the summary numbers in different way.

@foolip

This comment has been minimized.

Copy link
Contributor Author

foolip commented Sep 9, 2018

The problem with comparing totals became apparent here:
https://staging.wpt.fyi/results/?product=firefox[experimental,buildbot]@8657903435&product=firefox[experimental,taskcluster]@8657903435&diff=true

For css/, the total change is +12 passing tests, but drilling down there are actually a bunch of regressions in css/selectors/ that were obscured by even more changes in the positive direction.

@foolip

This comment has been minimized.

Copy link
Contributor Author

foolip commented Oct 1, 2018

Discussed with @lukebjerring today. A simple improvement here would probably be to use three numbers in the diff column: total number of regressions, total number of "progressions" and diff of number of subtests.

That would mean that there's one of those numbers that you can always look at to look for regressions, even if the regressions and "progressions" cancel out.

@lukebjerring

This comment has been minimized.

Copy link
Collaborator

lukebjerring commented Jan 8, 2019

@foolip is this still considered open? Or resolved by the changes?

@foolip

This comment has been minimized.

Copy link
Contributor Author

foolip commented Jan 16, 2019

@lukebjerring, the filtering has made it way way better. I'll open a new issue if I still struggle next time.

@foolip foolip closed this Jan 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.