Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for fuzzy matching in reftests. #12187

Merged
merged 1 commit into from Mar 7, 2019

Conversation

@jgraham
Copy link
Contributor

commented Jul 25, 2018

This allows fuzzy matching in reftests in which a comparison can
succeed if the images are different within a specified tolerance. It
is useful in the case of antialiasing, and in other scenarios where
it's not possible to make an exact match in all cases.

Differences between tests are characterised by two values:

  • The maximum difference for any pixel on any color channel (in the
    range 0 to 255)

  • The maximum total number of differing pixels

The fuzziness can be supplied in two places, according to whether it's
a property of the test or of the implementation:

  • In the reftest itself, using a tag

  • In the expectation metadata file using a fuzzy: key that takes a
    list

The general format of the fuzziness specifier is

range = [digits, "-"], digits
fuzziness = [ url, "-" ], range, ";", range

The first range represents the maximum difference of any channel per
pixel and the second represents the total number of pixel
differences. So for example a specifier could be:

  • "10;300" - meaning a difference of up to 10 per color channel and up to
    300 pixels different in total (all ranges are inclusive).

  • "5-10;200-300" - meaning a maximum difference of between 5 and 10
    per color channel and between 200 and 300 pixels differing in total

The definition of url is a little different between the meta element
and the expecation metadata. In the first case the url is resolved
against the current file, and applies to any reference in the current
file with that name. So for example

would allow a fuzziness of up to 5 on a specific channel and up to 200
opixels different for comparisons involving the file containing the
meta element and option-1-ref.html.

In the case of expectation metadata, the metadata is always associated
with the root test, so urls are always resolved relative to that. In
the case as above where only a single URL is supplied, any reference
document with that URL will have the fuzziness applied for whatever
comparisons it's involved in e.g.

[test1.html]
fuzzy: option-1-ref.html:5;200

would apply the fuziness to any comparison involving option-1-ref.html
whilst running the set of reftests rooted on test1.html. To specify an
exact comparison for the fuzziness, one can also supply a full
reference pair e.g.

[test1.html]
fuzzy: subtest.html==option-1-ref.html:5;200

in which case the fuzziness would only apply to "match" comparison
involving subtest.html on the lhs and option-1-ref.html on the
rhs (both resolved relative to test1.html).


This change is Reviewable

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2018

Seems to work, but missing testing etc. The PR is to get feedback on the general approach.

@dbaron

This comment has been minimized.

Copy link
Member

commented Aug 8, 2018

For what it's worth, I think the original design of fuzziness in Gecko was broken, because it was done in a way that never produced "unexpected pass" results to cause removal of no-longer-needed annotations. I started to fix this by introducing ranges a while ago, but many of the annotations haven't been fixed yet.

A concrete change that I would have done differently from the current Gecko design is that I would prefer that a number without an explicit range (say "10") mean the same as "10-10" rather than meaning the same as "0-10". In the Gecko reftest harness I had to do the latter, at least temporarily (although I'd like to change to the former after changing all the annotations), but I think the former (meaning "10-10") is a better default since it encourages annotations that are more likely to produce unexpected-pass results when the problem is fixed.

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Aug 8, 2018

@gsnedders you wanted to be mentioned here

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Aug 8, 2018

@dbaron What's the primary situation in which expected fuzziness is reduced, browser-side changes or test-side changes?

The reason I ask is that automatic sync typically requires us to update metadata in a conservative way i.e. if we have an expectation of 8-10 and get 6 we will probably update the metadata to be 6-10 (or maybe 5-10 or something), because we don't really have any way to automatiically differentiate between "this test changed in a way that makes the difference consistently smaller" and "this test changed in a way that increases the variance in possible results" and assuming the former results in frequent backouts and manual work and assuming the latter doesn't.

The proposal seems reasonable either way and it's not difficult to change of course, but it may not entirely solve the problem.

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Aug 8, 2018

Actually I guess it's an open question how we will deal with out-of-range fuzziness; maybe we will just mark the test as expected fail in that case any rely on a human to notice that actually the amount of fuzz decreased. But in that case the tradeoff is more clear; assuming 10 means 0-10 would mean that the sync would not set a bad test status on 8 but assuming it meant 10-10 would cause it to make the test EXPECTED-FAIL

tools/wptrunner/wptrunner/executors/base.py Outdated Show resolved Hide resolved
docs/_writing-tests/reftests.md Outdated Show resolved Hide resolved
docs/_writing-tests/reftests.md Show resolved Hide resolved
@dbaron

This comment has been minimized.

Copy link
Member

commented Aug 8, 2018

@dbaron What's the primary situation in which expected fuzziness is reduced, browser-side changes or test-side changes?

My concern is browser-side changes that fix the bugs covered by annotations.

I'd also note that part of the underlying problem here is that fuzzy matching is used for two different things:

  • tests that are not actually expected to match perfectly
  • browser bugs where the test only "fails just a little bit", and you want the test harness to ensure that it doesn't get worse

So to put it another way, my concern is about the second category: I want browser fixes to yield "unexpected pass" results so that we can remove the "only fails a little bit" annotations when the bugs are fixed, and so that we don't then regress back within that tolerance.

@jgraham jgraham force-pushed the reftest_fuzzy branch from e5b05df to dfb8dc5 Aug 29, 2018

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2018

I changed this to treat non-ranges as exact matches rather than 0-value.

@jgraham jgraham force-pushed the reftest_fuzzy branch 2 times, most recently from 408c81a to 075c960 Aug 29, 2018

@jgraham jgraham force-pushed the reftest_fuzzy branch from 075c960 to d5f6c71 Sep 25, 2018

@Hexcles

This comment has been minimized.

Copy link
Member

commented Sep 28, 2018

cc @foolip FWIW, I'm looking into the general approach here and see how Blink infra could adapt to it.

@Hexcles

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

I carefully read through the explanation (not the implementation yet) and it looks good to me (apart from one question; see below). In Blink, we will probably need to implement this in our image differ, which isn't too hard. (Of course, there's also the long-term possibility/discussion of using (more pieces from) wptrunner).

The question: why do we have a range of maximum allowed differences? What does the lower bound of a maximum value mean?

@Hexcles

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

And I'd like to introduce Ned, @nedn . He's on the Chrome Core Automation team and will also be working on Blink layout test runner. Ned, this is the fuzzy matching mechanism for reftests currently being implemented in upstream WPT.

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Oct 9, 2018

For the per channel differences we have three channels, and the algorithm is to take the channel with the maximum difference, throw away the other two values and then compare to the range.

@Hexcles

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

So, if the channel with the maximum difference has a difference smaller than the lower bound of the range, then the test fails?

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Oct 9, 2018

Or greater than the upper bound, yes.

@Hexcles

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

Or greater than the upper bound, yes.

This is intuitive, but the lower bound isn't IMHO. Would it be better to use mismatch & an upper bound instead?

@jgraham jgraham force-pushed the reftest_fuzzy branch 3 times, most recently from a99bf95 to 0d10b3e Nov 8, 2018

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

This now passes CI and has some somewhat more reasonable tests. In my mind the biggest open question is whether the syntax is OK or if it's too hard to understand. And if the latter, what it ought to look like. In any case it would be very helpful to get this landed soon since we know this helps with a major issue in result correctness.

@foolip

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

@Hexcles, will you take another review pass?

@jgraham jgraham force-pushed the reftest_fuzzy branch from 0d10b3e to 947cdf4 Nov 12, 2018

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

I made the syntax a bit more friendly by allowing the arguments to be named.

@jgraham jgraham force-pushed the reftest_fuzzy branch from 947cdf4 to 6486992 Nov 12, 2018

@Hexcles
Copy link
Member

left a comment

A bit confused by the updated doc

docs/_writing-tests/reftests.md Show resolved Hide resolved
docs/_writing-tests/reftests.md Show resolved Hide resolved

@jgraham jgraham force-pushed the reftest_fuzzy branch from 6486992 to 9454b8e Nov 14, 2018

@jgraham

This comment has been minimized.

Copy link
Contributor Author

commented Nov 15, 2018

So, does anyone want to review the actual code changes? :)

@gsnedders gsnedders self-assigned this Dec 18, 2018

@foolip

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2018

This PR is blocked on the required "Travis CI - Pull Request" check after #14499. In order to trigger it, I will close and reopen this PR.

@foolip foolip closed this Dec 20, 2018

@foolip foolip reopened this Dec 20, 2018

@gsnedders

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2019

To note, the Marionette internal reftest runner needs https://bugzilla.mozilla.org/show_bug.cgi?id=1500158 for this, which is riding the trains for Firefox 65 (which should ship this month).

characterised by two parameters:

* A maximum difference in the per-channel color value for any pixel.
* A number of total pixels that may be different

This comment has been minimized.

Copy link
@gsnedders

gsnedders Jan 8, 2019

Contributor
Suggested change
* A number of total pixels that may be different
* A number of total pixels that may be different.
docs/_writing-tests/reftests.md Show resolved Hide resolved

@jgraham jgraham force-pushed the reftest_fuzzy branch from 9454b8e to dc87d4c Jan 15, 2019

@jgraham jgraham force-pushed the reftest_fuzzy branch from dc87d4c to 561172d Jan 28, 2019

@jgraham jgraham force-pushed the reftest_fuzzy branch 2 times, most recently from 4eb2271 to 3ae7005 Mar 6, 2019

Add support for fuzzy matching in reftests.
This allows fuzzy matching in reftests in which a comparison can
succeed if the images are different within a specified tolerance. It
is useful in the case of antialiasing, and in other scenarios where
it's not possible to make an exact match in all cases.

Differences between tests are characterised by two values:

* The maximum difference for any pixel on any color channel (in the
  range 0 to 255)

* The maximum total number of differing pixels

The fuzziness can be supplied in two places, according to whether it's
a property of the test or of the implementation:

* In the reftest itself, using a <meta name=fuzzy> tag

* In the expectation metadata file using a fuzzy: key that takes a
  list

The general format of the fuzziness specifier is

range = [name "="] [digits, "-"], digits
fuzziness = [ url, "-" ], range, ";", range
name = "maxDifference" | "totalPixels"

The first range represents the maximum difference of any channel per
pixel and the second represents the total number of pixel
differences. So for example a specifier could be:

* "maxDifference=10;totalPixels=300" - meaning a difference of exactly
10 per color channel and exactly 300 pixels different in total (all
ranges are inclusive).

* "5-10;200-300" - meaning a maximum difference of between 5 and 10
  per color channel and between 200 and 300 pixels differing in total

The definition of url is a little different between the meta element
and the expecation metadata. In the first case the url is resolved
against the current file, and applies to any reference in the current
file with that name. So for example

<meta name="fuzzy" content="option-1-ref.html:5;200">

would allow a fuzziness of up to 5 on a specific channel and up to 200
opixels different for comparisons involving the file containing the
meta element and option-1-ref.html.

In the case of expectation metadata, the metadata is always associated
with the root test, so urls are always resolved relative to that. In
the case as above where only a single URL is supplied, any reference
document with that URL will have the fuzziness applied for whatever
comparisons it's involved in e.g.

[test1.html]
  fuzzy: option-1-ref.html:5;200

would apply the fuziness to any comparison involving option-1-ref.html
whilst running the set of reftests rooted on test1.html. To specify an
exact comparison for the fuzziness, one can also supply a full
reference pair e.g.

[test1.html]
  fuzzy: subtest.html==option-1-ref.html:5;200

in which case the fuzziness would only apply to "match" comparison
involving subtest.html on the lhs and option-1-ref.html on the
rhs (both resolved relative to test1.html).

@jgraham jgraham force-pushed the reftest_fuzzy branch from 3ae7005 to f477376 Mar 6, 2019

@gsnedders gsnedders merged commit 1f570a6 into master Mar 7, 2019

10 of 18 checks passed

Azure Pipelines Build #20190306.98 has failed
Details
Azure Pipelines (infrastructure/ tests (macOS)) infrastructure/ tests (macOS) was canceled
Details
staging.wpt.fyi - chrome[experimental] Chrome results
Details
staging.wpt.fyi - firefox[experimental] Firefox results
Details
staging.wpt.fyi - safari[experimental] Safari results
Details
wpt.fyi - chrome[experimental] Chrome results
Details
wpt.fyi - firefox[experimental] Firefox results
Details
wpt.fyi - safari[experimental] Safari results
Details
Azure Pipelines (./wpt test-jobs) ./wpt test-jobs succeeded
Details
Azure Pipelines (affected tests (Safari Technology Preview)) affected tests (Safari Technology Preview) succeeded
Details
Azure Pipelines (affected tests without changes (Safari Technology Preview)) affected tests without changes (Safari Technology Preview) succeeded
Details
Azure Pipelines (tools/ unittests (macOS)) tools/ unittests (macOS) succeeded
Details
Azure Pipelines (tools/wpt/ tests (macOS)) tools/wpt/ tests (macOS) succeeded
Details
Azure Pipelines (tools/wptrunner/ unittests (macOS)) tools/wptrunner/ unittests (macOS) succeeded
Details
Azure Pipelines (wpt.fyi hook: safari-preview-affected-tests) wpt.fyi hook: safari-preview-affected-tests succeeded
Details
Azure Pipelines (wpt.fyi hook: safari-preview-affected-tests-without-changes) wpt.fyi hook: safari-preview-affected-tests-without-changes succeeded
Details
Taskcluster (pull_request) TaskGroup: success
Details
Travis CI - Pull Request Build Passed
Details

@gsnedders gsnedders deleted the reftest_fuzzy branch Mar 7, 2019

marcoscaceres added a commit that referenced this pull request Jul 23, 2019
Add support for fuzzy matching in reftests (#12187)
This allows fuzzy matching in reftests in which a comparison can
succeed if the images are different within a specified tolerance. It
is useful in the case of antialiasing, and in other scenarios where
it's not possible to make an exact match in all cases.

Differences between tests are characterised by two values:

* The maximum difference for any pixel on any color channel (in the
  range 0 to 255)

* The maximum total number of differing pixels

The fuzziness can be supplied in two places, according to whether it's
a property of the test or of the implementation:

* In the reftest itself, using a <meta name=fuzzy> tag

* In the expectation metadata file using a fuzzy: key that takes a
  list

The general format of the fuzziness specifier is

range = [name "="] [digits, "-"], digits
fuzziness = [ url, "-" ], range, ";", range
name = "maxDifference" | "totalPixels"

The first range represents the maximum difference of any channel per
pixel and the second represents the total number of pixel
differences. So for example a specifier could be:

* "maxDifference=10;totalPixels=300" - meaning a difference of exactly
10 per color channel and exactly 300 pixels different in total (all
ranges are inclusive).

* "5-10;200-300" - meaning a maximum difference of between 5 and 10
  per color channel and between 200 and 300 pixels differing in total

The definition of url is a little different between the meta element
and the expecation metadata. In the first case the url is resolved
against the current file, and applies to any reference in the current
file with that name. So for example

<meta name="fuzzy" content="option-1-ref.html:5;200">

would allow a fuzziness of up to 5 on a specific channel and up to 200
opixels different for comparisons involving the file containing the
meta element and option-1-ref.html.

In the case of expectation metadata, the metadata is always associated
with the root test, so urls are always resolved relative to that. In
the case as above where only a single URL is supplied, any reference
document with that URL will have the fuzziness applied for whatever
comparisons it's involved in e.g.

[test1.html]
  fuzzy: option-1-ref.html:5;200

would apply the fuziness to any comparison involving option-1-ref.html
whilst running the set of reftests rooted on test1.html. To specify an
exact comparison for the fuzziness, one can also supply a full
reference pair e.g.

[test1.html]
  fuzzy: subtest.html==option-1-ref.html:5;200

in which case the fuzziness would only apply to "match" comparison
involving subtest.html on the lhs and option-1-ref.html on the
rhs (both resolved relative to test1.html).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.