Finding regressions in diff view is too hard #411

Closed
opened this Issue Aug 8, 2018 · 7 comments

Projects
None yet
3 participants
Contributor

foolip commented Aug 8, 2018

 The numbers in the 3rd column are the differences between the pass / total numbers in columns 1 and 2. Because the number of failures isn't counted, when both the total number of tests and total number of passes has changed, it's difficult to see if the number of failures has changed. The appropriate comparison would be in the number of failures, which is `(totalA - passA) - (totalB - passB)`. That's `(totalA - totalB) - (passA - passB)`, so as it turns out one can compare the numerator and denominator in column 3 to check if there are more failures, but that is not possible to do at a glance. (#408 is a separate issue about the order, which also makes it tricky to skim the view for interesting things.)
Contributor Author

foolip commented Aug 8, 2018

 Actually, that formula doesn't quite cut it either. There is a regression in https://staging.wpt.fyi/results/css/css-values?product=chrome[experimental]&product=chrome[stable]&diff=true (viewport-units-css2-001.html) but it wasn't possible to see that from the high-level numbers, I just happened to drill down there because the numbers for css were large and I wanted to see why.

jgraham commented Aug 8, 2018

 A comparison based on totals rather than actual differences seems rather problematic; finding that e.g. nightly is +8 passes vs stable doesn't mean there aren't regressions if it's actually +10 -2 i.e. 10 tests started passing, but two regressed. Separate to that the view would be much clearer if it filtered the columns to only show directories/tests with changes.
Contributor Author

foolip commented Aug 8, 2018

 Yep, these are both changes @lukebjerring and I have discussed. This view is based on comparing summary numbers. It would be improved by comparing the pass/total numbers on the test level and aggregating the differences from tests that could be regressions. To make it entirely reliable the comparison has to be done using the full results including subtests, but one can probably emulate that quite well using just the summary numbers in different way.

foolip referenced this issue Aug 17, 2018

Open

Widespread timeouts in Edge 17 #563

Contributor Author

foolip commented Sep 9, 2018

 The problem with comparing totals became apparent here: https://staging.wpt.fyi/results/?product=firefox[experimental,buildbot]@8657903435&product=firefox[experimental,taskcluster]@8657903435&diff=true For css/, the total change is +12 passing tests, but drilling down there are actually a bunch of regressions in css/selectors/ that were obscured by even more changes in the positive direction.
Contributor Author

foolip commented Oct 1, 2018

 Discussed with @lukebjerring today. A simple improvement here would probably be to use three numbers in the diff column: total number of regressions, total number of "progressions" and diff of number of subtests. That would mean that there's one of those numbers that you can always look at to look for regressions, even if the regressions and "progressions" cancel out.

Merged

Collaborator

lukebjerring commented Jan 8, 2019

 @foolip is this still considered open? Or resolved by the changes?
Contributor Author

foolip commented Jan 16, 2019

 @lukebjerring, the filtering has made it way way better. I'll open a new issue if I still struggle next time.