Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show expected and actual renderings for reftests #57

Closed
foolip opened this issue Apr 17, 2018 · 24 comments
Closed

Show expected and actual renderings for reftests #57

foolip opened this issue Apr 17, 2018 · 24 comments
Assignees
Labels
enhancement New feature or request

Comments

@foolip
Copy link
Member

foolip commented Apr 17, 2018

This is the frontend part of web-platform-tests/results-collection#217. If we record the screenshots in the runners and store them somewhere, then the frontend would also need to show them, and possibly diffs between them.

@foolip
Copy link
Member Author

foolip commented Apr 17, 2018

Renaming to also track what the receiver (@Hexcles) would have to do.

(This issue is quite aspirational at this point, it'd be a lot of work, but also very valuable if we could pull it off.)

@foolip
Copy link
Member Author

foolip commented May 15, 2018

@mrego, you asked in https://chromium-review.googlesource.com/c/chromium/src/+/1054770#message-b3087aea32526011c2a83a3a094013263e2aedfb: "Where I can see the screenshots from wpt.fyi?"

The answer is that you can't and this is the issue for it. Could you describe what it is you were trying to do and how much time (if any) you ended up spending to get to the same answer without having the screenshots? Is it only for failed tests that you wanted to see screenshots, or passing ones too?

@mrego
Copy link
Member

mrego commented May 15, 2018

Ok, good to know they're nowhere, so it's expected I couldn't find them. 😄

I needed to see the output of a WPT test on Mac, I ended up using an actual Mac to check it.
In this case it was a test passing in Mac and failing on the rest of platforms, but I admit that looking for screenshots of passing tests don't seem the most common thing.
Just my two cents.

@foolip
Copy link
Member Author

foolip commented May 15, 2018

Thanks @mrego. It's interesting that you wanted to see why a test was passing, as it seems like it might be easier/faster to only record failure cases. But I think that many times when you have some browser passing and some failing, looking at the passing ones will tell you something interesting.

@mrego
Copy link
Member

mrego commented May 15, 2018

I agree it's the strange case probably, but the test was failing in all browsers and passing only in Safari on Mac. So I wanted to see a screenshot of what was going on there.

@foolip
Copy link
Member Author

foolip commented May 15, 2018

I'm not sure it is a strange case, will keep asking people who ask about screenshots to see if they're normally interested in the passing or the failing cases.

@jgraham
Copy link
Contributor

jgraham commented Aug 8, 2018

For gecko we only record cases where the result is "unexpected". For wpt.fyi that would always correspond to not passing, since the expectations are not set and we default to assuming all tests pass. The way this works is that the logger records a base64-encoded PNG for the last two screenshots it took (typically the test and the ref; things are more complex with chained references). That can be dumped to the log. So for example this log shows a failure. That data is interpreted by the reftest analyzer (links may break after a while as logs are expired).

For wpt.fyi the main challenge is that we have many many failures per run. Dumping every failing screenshot pair to a log would be somewhere on the order of 100Mb of data per browser per run. Which is not untenable, but itsn't small either. Then wpt.fyi could either directly store the log links, or could import the screenshots from the logs somehow, and provide links to the reftest analyzer (possibly forked).

A more efficient approach in terms of storage would be to publish some kind of filter matching known test+screenshot ids so that we would only store new screenshots explicitly and would otherwise just record the id for the screenshot pair. But that itself may be somewhat complex.

@foolip
Copy link
Member Author

foolip commented Aug 8, 2018

IRC between me and @jgraham starts at https://w3.logbot.info/testing/20180808#c430986

There I suggest that we'd want to get the screenshots into wpt.fyi so that old links keep working, although it isn't strictly speaking necessary to keep all screenshots forever. However, I suspect that if we deduplicate, perhaps by hashing the decoded size+RGBA data, that the size of the screenshot bucket wouldn't grow very fast. (Even flaky tests won't have very many different renderings, except perhaps animation tests.)

@foolip
Copy link
Member Author

foolip commented Oct 12, 2018

Yesterday, I sent reports of WPT failurs to webkit-dev, dev-platform, and blink-dev. With that, @jgraham noted the obvious need to look at screenshots to understand failures.

First, two goals that lead to some complexity:

  • Possible to deploy on Taskcluster or other CI where we store no secrets, but where the results are extracted from the logs and submitted to the results receiver. (@Hexcles)
  • Don't accidentally build free image hosting.
  • Make it feasible to eventually store all screenshots, not just failures.

Here's my proposal for how to achieve this:

  • Add an /api/screenshot POST endpoint to wpt.fyi where anyone can upload PNG images unauthenticated. You get back a new unique token every time, and it's not possible to get the data back.
  • Make ./wpt run use that endpoint either during execution or at the end, and include the screenshot tokens in the wpt_report.json output.
  • When the results receiver gets the wpt_report.json, it uses the token to get the screenshots.

This is probably enough to store just failure screenshots. As an optimization, which should also make it possible to eventually store all screenshots:

  • Add a /api/screenshot/dowehaveit API which takes a list of image checksums (not tokens) and tells you which ones are not already in storage.
  • ./wpt run uses this API to avoid uploading screenshots which are already in storage.

@lukebjerring, can we build it? :)

@foolip
Copy link
Member Author

foolip commented Oct 12, 2018

@jgraham points out on IRC that we can simplify this to just use checksums of the images, by not making it possible to download images by sha, or at least not until those shas have passed through the results collector once.

I imagine that @Hexcles also has ideas here.

@gsnedders
Copy link
Member

However, I suspect that if we deduplicate, perhaps by hashing the decoded size+RGBA data, that the size of the screenshot bucket wouldn't grow very fast. (Even flaky tests won't have very many different renderings, except perhaps animation tests.)

SPARTAN used to store all screenshots ever seen deduplicating based on the entire file (i.e., including any metadata in the image), and that ultimately ended up with a few hundred megabytes of screenshots IIRC. Note that some browsers are pretty poor at compressing PNGs and even a relatively computationally cheap compression tools can massively reduce the file size (of our often largely white images).

When the results receiver gets the wpt_report.json, it uses the token to get the screenshots.

Is the results receiver authenticated? (It is, right?) If not, can we not just upload images by uploading the "screenshot" and then uploading a bogus result set?

Add a /api/screenshot/dowehaveit API which takes a list of image checksums (not tokens) and tells you which ones are not already in storage.

@jgraham's suggestion of using a client-side Bloom filter seems sensible, but let's actually think about this for a second: in the common case, we'll have almost no new screenshots, so almost everything will maybe be in the set. We can unconditionally upload things that aren't in the set, but we need to check for everything that might be in the set as to whether it is. So using a Bloom filter seems to give very little gain here.

The list of image checksums is still likely to be pretty big (and pretty incompressible), which is awkward.

IIRC, SPARTAN provided with the test allocation the last n checksums, so in the common case of it not changing from one run to the next you don't need to go back to the server to check if the screenshot has previously been seen. Simply exposing the last run's checksums would probably work as a decent optimisation here as it means the majority of hashes can be excluded quickly.

@jgraham
Copy link
Contributor

jgraham commented Oct 12, 2018

Yes, I agree that the Bloom filter has the opposite sematics to the ones we want. But maybe we shouldn't have anything on the client side; we upload a list of hashes we have and get a list back of hashes to upload. That seems cheaper than keeping any client-side list up to date and can be handled by whatever technique on the server side (a database for example).

In terms of compression I don't know how much we should worry; hopefully the screenshots don't change that often (although I guess that antialisaing &c. might disagree with that). If the typical number of screenshots to upload on a run is in the single digits, trying to save some percentage of the bandwidth by recompressing the images isn't that important. But if we do want to do that, I think Pillow will be a dependency for fuzzy reftests anyway so we could recompress all the images. But it would have to be all of them, not just the ones we upload, so I'm a little concerned about the cost of doing that.

@foolip
Copy link
Member Author

foolip commented Oct 16, 2018

The need for this to understand reftest failures, while perhaps already obvious, became very clear on the blink-dev thread where I had to run the tests again locally and use reftest-analyzer. It was actually pretty cool, and we should definitely take inspiration from it:
image

@gsnedders
Copy link
Member

Talking to @jgraham, it sounded like his envisioned approach requires some changes to wptrunner:

  • If the results collector is going to continue to use the JSON format, we need to output screenshots in the JSON output (or store them somewhere with paths in the JSON file).
  • Some way of filtering what screenshots get outputted (though that sounds contrary to what he concluded above about client-side filtering?).

I feel like he listed more last week and I've forgotten.

@foolip
Copy link
Member Author

foolip commented Jan 15, 2019

Some filtering seems necessary if we're also to get screenshots of passing tests, which is useful when some browsers are passing and some are failing the same test.

@jgraham
Copy link
Contributor

jgraham commented Jan 15, 2019

Screenshots of passing tests should be out of scope for the initial work; that is a significant performance concern for the runner since actually generating the screenshot as a PNG is quite a lot of work that's often unnecessary. In addition screenshots of passing tests are usually not that helpful; it's only interesting if it happens to pass in a bogus way (like rendering nothing). But that's usually possible to determine easily just by loading the test in the browser.

I don't think my current thinking is very different to the above:

  • We generate screenshots for failing tests
  • We pass a list of hashes to wpt.fyi and it returns a list of hashes that it doesn't already have (this is a new endpoint)
  • We add the hashes of failing tests to the wptreport.json data, and add a top level key with a list of screenshots (which can then be hashed on the server to fill in missing data).

The main issue with this approach is that we have to store all the screenshots somehow until the end of the test run. But the alternative where we prefilter means that we have to get a list of every hash the server has ever seen. The more efficient alternative requires us to accept some probability of not uploading a screenshot that is actually new.

@foolip
Copy link
Member Author

foolip commented Jan 15, 2019

I think we should leave it up to the people who design and implement this to recommend whether to include passing tests or not. For executors other than ExecutorMarionette there's already PNG encoding involved, and whether or not sending the screenshots for passing tests is practical or not depends on the choice of where/when to check if a screenshot already is uploaded.

@jgraham
Copy link
Contributor

jgraham commented Jan 15, 2019

What is this thread for other than discussing requirements and constraints? I think that "leave it up to the people implementing it to decide" sounds reasonable but is actually dangerously close to "we will make the actual decisions in internal meetings, not in public".

Of course if there's some clever way to make passing screenshots work without much extra effort, that would be nice. But I still think that the utility of that is relatively low compared to seeing what's happening when things fail and we shouldn't regard it as a requirement for this work.

@foolip
Copy link
Member Author

foolip commented Jan 15, 2019

I don't know who will be working on this yet, but I expect it will end up in https://github.com/web-platform-tests/rfcs for discussion with the design evolving in public with you as the main reviewer. I don't know what the best trade-off is, perhaps it is obvious with your knowledge of certain internals, but plainly declaring it out of scope is what I reacted to.

@lukebjerring
Copy link
Contributor

A design doc has been created to begin capturing requirements/constraints/options, which might be a more appropriate place to dump details, and discuss tradeoffs.
https://docs.google.com/document/d/1IhZa4mrjK1msUMhtamKwKJ_HhXD-nqh_4-BcPWM6soQ/edit#

As noted in the doc, there's a few deliberately separate areas of work:

  • Either,
    • Ability to upload a recorded screenshot (wptrunner)
    • or, Ability to scrape a recorded screenshot (wpt.fyi)
  • UI for viewing/comparing said screenshots (wpt.fyi)
  • Filtering known items / deduplication optimization (wptrunner + wpt.fyi)

"Out of scope" or "defer til later", I agree that we can't just turn turn on screenshots for everything without the last item being completed (and, not even necessarily with it completed).

@foolip
Copy link
Member Author

foolip commented Jan 16, 2019

@lukebjerring @Hexcles, cool, I didn't know design work had started :D

I'm not able to comment on the doc, can you fiddle with permissions?

@gsnedders
Copy link
Member

@lukebjerring @Hexcles I can't even see the doc. This ain't great as a design doc for an change to an openly run project…

@Hexcles
Copy link
Member

Hexcles commented Jan 16, 2019

I apologize on behalf of Luke. He clicked a wrong button in the permission dialog. The link should be working now.

(The doc is barely started and only has a skeleton. I plan to flesh it out in the next couple days.)

Hexcles added a commit to web-platform-tests/wpt that referenced this issue Feb 27, 2019
This is working towards
web-platform-tests/wpt.fyi#57 (displaying
reftest screenshots on wpt.fyi).

This change includes the hashes of screenshots (failing reftests only
for now, but it can be easily extended to include all screenshots
later) in a new field in test results:

test:
  status:
  ...
  screenshots:
    url_of_test: sha1:checksum
    url_of_ref1: sha1:checksum
    ...

(Because of an implementation detail, we currently only include the last
two screenshots taken during a complex reftest.)

Another important behaviour change is that internally we are now hashing
the image *after* base64-decoding it, but this change should not affect
any existing infra since checksums were not externally visible.
Hexcles added a commit to web-platform-tests/wpt that referenced this issue Feb 28, 2019
This is working towards
web-platform-tests/wpt.fyi#57 (displaying
reftest screenshots on wpt.fyi).

This change includes the hashes of screenshots (failing reftests only
for now, but it can be easily extended to include all screenshots
later) in a new field in test results:

test:
  status:
  ...
  screenshots:
    url_of_test: sha1:checksum
    url_of_ref1: sha1:checksum
    ...

(Because of an implementation detail, we currently only include the last
two screenshots taken during a complex reftest.)

Another important behaviour change is that internally we are now hashing
the image *after* base64-decoding it, but this change should not affect
any existing infra since checksums were not externally visible.
pull bot pushed a commit to hoojaoh/web-platform-tests that referenced this issue Mar 1, 2019
This is working towards
web-platform-tests/wpt.fyi#57 (displaying
reftest screenshots on wpt.fyi).

This change includes the hashes of screenshots (failing reftests only
for now, but it can be easily extended to include all screenshots
later) in a new field in test results:

test:
  status:
  ...
  screenshots:
    url_of_test: sha1:checksum
    url_of_ref1: sha1:checksum
    ...

(Because of an implementation detail, we currently only include the last
two screenshots taken during a complex reftest.)

Another important behaviour change is that internally we are now hashing
the image *after* base64-decoding it, but this change should not affect
any existing infra since checksums were not externally visible.
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Apr 1, 2019
…report JSON along with a new formatter for screenshot data URIs, a=testonly

Automatic update from web-platform-tests
[wptrunner] Include screenshot hashes in JSON

This is working towards
web-platform-tests/wpt.fyi#57 (displaying
reftest screenshots on wpt.fyi).

This change includes the hashes of screenshots (failing reftests only
for now, but it can be easily extended to include all screenshots
later) in a new field in test results:

test:
  status:
  ...
  screenshots:
    url_of_test: sha1:checksum
    url_of_ref1: sha1:checksum
    ...

(Because of an implementation detail, we currently only include the last
two screenshots taken during a complex reftest.)

Another important behaviour change is that internally we are now hashing
the image *after* base64-decoding it, but this change should not affect
any existing infra since checksums were not externally visible.

--
Add a new formatter for screenshots

Add --log-wptscreenshot to output screenshots not found in cache into
the line-based format (one data URI per line) expected by wpt.fyi.

--
Consolidate with the existing reftest_screenshots

We don't need a new field in `extra`. And the converter is moved from
the formatter into executors/base.py to avoid processing the same log
multiple times.

In addition, strip server from test URLs and fix data URIs.

--

wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5
wpt-pr: 15570
mykmelez pushed a commit to mykmelez/gecko that referenced this issue Apr 2, 2019
…report JSON along with a new formatter for screenshot data URIs, a=testonly

Automatic update from web-platform-tests
[wptrunner] Include screenshot hashes in JSON

This is working towards
web-platform-tests/wpt.fyi#57 (displaying
reftest screenshots on wpt.fyi).

This change includes the hashes of screenshots (failing reftests only
for now, but it can be easily extended to include all screenshots
later) in a new field in test results:

test:
  status:
  ...
  screenshots:
    url_of_test: sha1:checksum
    url_of_ref1: sha1:checksum
    ...

(Because of an implementation detail, we currently only include the last
two screenshots taken during a complex reftest.)

Another important behaviour change is that internally we are now hashing
the image *after* base64-decoding it, but this change should not affect
any existing infra since checksums were not externally visible.

--
Add a new formatter for screenshots

Add --log-wptscreenshot to output screenshots not found in cache into
the line-based format (one data URI per line) expected by wpt.fyi.

--
Consolidate with the existing reftest_screenshots

We don't need a new field in `extra`. And the converter is moved from
the formatter into executors/base.py to avoid processing the same log
multiple times.

In addition, strip server from test URLs and fix data URIs.

--

wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5
wpt-pr: 15570
@Hexcles
Copy link
Member

Hexcles commented Apr 4, 2019

Closing as fixed. Woohoo!

@Hexcles Hexcles closed this as completed Apr 4, 2019
marcoscaceres pushed a commit to web-platform-tests/wpt that referenced this issue Jul 23, 2019
This is working towards
web-platform-tests/wpt.fyi#57 (displaying
reftest screenshots on wpt.fyi).

This change includes the hashes of screenshots (failing reftests only
for now, but it can be easily extended to include all screenshots
later) in a new field in test results:

test:
  status:
  ...
  screenshots:
    url_of_test: sha1:checksum
    url_of_ref1: sha1:checksum
    ...

(Because of an implementation detail, we currently only include the last
two screenshots taken during a complex reftest.)

Another important behaviour change is that internally we are now hashing
the image *after* base64-decoding it, but this change should not affect
any existing infra since checksums were not externally visible.
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Oct 4, 2019
…report JSON along with a new formatter for screenshot data URIs, a=testonly

Automatic update from web-platform-tests
[wptrunner] Include screenshot hashes in JSON

This is working towards
web-platform-tests/wpt.fyi#57 (displaying
reftest screenshots on wpt.fyi).

This change includes the hashes of screenshots (failing reftests only
for now, but it can be easily extended to include all screenshots
later) in a new field in test results:

test:
  status:
  ...
  screenshots:
    url_of_test: sha1:checksum
    url_of_ref1: sha1:checksum
    ...

(Because of an implementation detail, we currently only include the last
two screenshots taken during a complex reftest.)

Another important behaviour change is that internally we are now hashing
the image *after* base64-decoding it, but this change should not affect
any existing infra since checksums were not externally visible.

--
Add a new formatter for screenshots

Add --log-wptscreenshot to output screenshots not found in cache into
the line-based format (one data URI per line) expected by wpt.fyi.

--
Consolidate with the existing reftest_screenshots

We don't need a new field in `extra`. And the converter is moved from
the formatter into executors/base.py to avoid processing the same log
multiple times.

In addition, strip server from test URLs and fix data URIs.

--

wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5
wpt-pr: 15570

UltraBlame original commit: 7ccdd0077950dace9fa7243679e10faa058953a5
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Oct 4, 2019
…report JSON along with a new formatter for screenshot data URIs, a=testonly

Automatic update from web-platform-tests
[wptrunner] Include screenshot hashes in JSON

This is working towards
web-platform-tests/wpt.fyi#57 (displaying
reftest screenshots on wpt.fyi).

This change includes the hashes of screenshots (failing reftests only
for now, but it can be easily extended to include all screenshots
later) in a new field in test results:

test:
  status:
  ...
  screenshots:
    url_of_test: sha1:checksum
    url_of_ref1: sha1:checksum
    ...

(Because of an implementation detail, we currently only include the last
two screenshots taken during a complex reftest.)

Another important behaviour change is that internally we are now hashing
the image *after* base64-decoding it, but this change should not affect
any existing infra since checksums were not externally visible.

--
Add a new formatter for screenshots

Add --log-wptscreenshot to output screenshots not found in cache into
the line-based format (one data URI per line) expected by wpt.fyi.

--
Consolidate with the existing reftest_screenshots

We don't need a new field in `extra`. And the converter is moved from
the formatter into executors/base.py to avoid processing the same log
multiple times.

In addition, strip server from test URLs and fix data URIs.

--

wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5
wpt-pr: 15570

UltraBlame original commit: 7ccdd0077950dace9fa7243679e10faa058953a5
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Oct 4, 2019
…report JSON along with a new formatter for screenshot data URIs, a=testonly

Automatic update from web-platform-tests
[wptrunner] Include screenshot hashes in JSON

This is working towards
web-platform-tests/wpt.fyi#57 (displaying
reftest screenshots on wpt.fyi).

This change includes the hashes of screenshots (failing reftests only
for now, but it can be easily extended to include all screenshots
later) in a new field in test results:

test:
  status:
  ...
  screenshots:
    url_of_test: sha1:checksum
    url_of_ref1: sha1:checksum
    ...

(Because of an implementation detail, we currently only include the last
two screenshots taken during a complex reftest.)

Another important behaviour change is that internally we are now hashing
the image *after* base64-decoding it, but this change should not affect
any existing infra since checksums were not externally visible.

--
Add a new formatter for screenshots

Add --log-wptscreenshot to output screenshots not found in cache into
the line-based format (one data URI per line) expected by wpt.fyi.

--
Consolidate with the existing reftest_screenshots

We don't need a new field in `extra`. And the converter is moved from
the formatter into executors/base.py to avoid processing the same log
multiple times.

In addition, strip server from test URLs and fix data URIs.

--

wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5
wpt-pr: 15570

UltraBlame original commit: 7ccdd0077950dace9fa7243679e10faa058953a5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants