Show expected and actual renderings for reftests #57

foolip · 2018-04-17T09:10:36Z

This is the frontend part of web-platform-tests/results-collection#217. If we record the screenshots in the runners and store them somewhere, then the frontend would also need to show them, and possibly diffs between them.

foolip · 2018-04-17T09:12:49Z

Renaming to also track what the receiver (@Hexcles) would have to do.

(This issue is quite aspirational at this point, it'd be a lot of work, but also very valuable if we could pull it off.)

foolip · 2018-05-15T08:12:17Z

@mrego, you asked in https://chromium-review.googlesource.com/c/chromium/src/+/1054770#message-b3087aea32526011c2a83a3a094013263e2aedfb: "Where I can see the screenshots from wpt.fyi?"

The answer is that you can't and this is the issue for it. Could you describe what it is you were trying to do and how much time (if any) you ended up spending to get to the same answer without having the screenshots? Is it only for failed tests that you wanted to see screenshots, or passing ones too?

mrego · 2018-05-15T19:18:04Z

Ok, good to know they're nowhere, so it's expected I couldn't find them. 😄

I needed to see the output of a WPT test on Mac, I ended up using an actual Mac to check it.
In this case it was a test passing in Mac and failing on the rest of platforms, but I admit that looking for screenshots of passing tests don't seem the most common thing.
Just my two cents.

foolip · 2018-05-15T19:38:44Z

Thanks @mrego. It's interesting that you wanted to see why a test was passing, as it seems like it might be easier/faster to only record failure cases. But I think that many times when you have some browser passing and some failing, looking at the passing ones will tell you something interesting.

mrego · 2018-05-15T20:19:26Z

I agree it's the strange case probably, but the test was failing in all browsers and passing only in Safari on Mac. So I wanted to see a screenshot of what was going on there.

foolip · 2018-05-15T21:01:19Z

I'm not sure it is a strange case, will keep asking people who ask about screenshots to see if they're normally interested in the passing or the failing cases.

jgraham · 2018-08-08T11:20:13Z

For gecko we only record cases where the result is "unexpected". For wpt.fyi that would always correspond to not passing, since the expectations are not set and we default to assuming all tests pass. The way this works is that the logger records a base64-encoded PNG for the last two screenshots it took (typically the test and the ref; things are more complex with chained references). That can be dumped to the log. So for example this log shows a failure. That data is interpreted by the reftest analyzer (links may break after a while as logs are expired).

For wpt.fyi the main challenge is that we have many many failures per run. Dumping every failing screenshot pair to a log would be somewhere on the order of 100Mb of data per browser per run. Which is not untenable, but itsn't small either. Then wpt.fyi could either directly store the log links, or could import the screenshots from the logs somehow, and provide links to the reftest analyzer (possibly forked).

A more efficient approach in terms of storage would be to publish some kind of filter matching known test+screenshot ids so that we would only store new screenshots explicitly and would otherwise just record the id for the screenshot pair. But that itself may be somewhat complex.

foolip · 2018-08-08T12:14:16Z

IRC between me and @jgraham starts at https://w3.logbot.info/testing/20180808#c430986

There I suggest that we'd want to get the screenshots into wpt.fyi so that old links keep working, although it isn't strictly speaking necessary to keep all screenshots forever. However, I suspect that if we deduplicate, perhaps by hashing the decoded size+RGBA data, that the size of the screenshot bucket wouldn't grow very fast. (Even flaky tests won't have very many different renderings, except perhaps animation tests.)

foolip · 2018-10-12T10:36:24Z

Yesterday, I sent reports of WPT failurs to webkit-dev, dev-platform, and blink-dev. With that, @jgraham noted the obvious need to look at screenshots to understand failures.

First, two goals that lead to some complexity:

Possible to deploy on Taskcluster or other CI where we store no secrets, but where the results are extracted from the logs and submitted to the results receiver. (@Hexcles)
Don't accidentally build free image hosting.
Make it feasible to eventually store all screenshots, not just failures.

Here's my proposal for how to achieve this:

Add an /api/screenshot POST endpoint to wpt.fyi where anyone can upload PNG images unauthenticated. You get back a new unique token every time, and it's not possible to get the data back.
Make ./wpt run use that endpoint either during execution or at the end, and include the screenshot tokens in the wpt_report.json output.
When the results receiver gets the wpt_report.json, it uses the token to get the screenshots.

This is probably enough to store just failure screenshots. As an optimization, which should also make it possible to eventually store all screenshots:

Add a /api/screenshot/dowehaveit API which takes a list of image checksums (not tokens) and tells you which ones are not already in storage.
./wpt run uses this API to avoid uploading screenshots which are already in storage.

@lukebjerring, can we build it? :)

foolip · 2018-10-12T11:06:08Z

@jgraham points out on IRC that we can simplify this to just use checksums of the images, by not making it possible to download images by sha, or at least not until those shas have passed through the results collector once.

I imagine that @Hexcles also has ideas here.

gsnedders · 2018-10-12T16:42:37Z

However, I suspect that if we deduplicate, perhaps by hashing the decoded size+RGBA data, that the size of the screenshot bucket wouldn't grow very fast. (Even flaky tests won't have very many different renderings, except perhaps animation tests.)

SPARTAN used to store all screenshots ever seen deduplicating based on the entire file (i.e., including any metadata in the image), and that ultimately ended up with a few hundred megabytes of screenshots IIRC. Note that some browsers are pretty poor at compressing PNGs and even a relatively computationally cheap compression tools can massively reduce the file size (of our often largely white images).

When the results receiver gets the wpt_report.json, it uses the token to get the screenshots.

Is the results receiver authenticated? (It is, right?) If not, can we not just upload images by uploading the "screenshot" and then uploading a bogus result set?

Add a /api/screenshot/dowehaveit API which takes a list of image checksums (not tokens) and tells you which ones are not already in storage.

@jgraham's suggestion of using a client-side Bloom filter seems sensible, but let's actually think about this for a second: in the common case, we'll have almost no new screenshots, so almost everything will maybe be in the set. We can unconditionally upload things that aren't in the set, but we need to check for everything that might be in the set as to whether it is. So using a Bloom filter seems to give very little gain here.

The list of image checksums is still likely to be pretty big (and pretty incompressible), which is awkward.

IIRC, SPARTAN provided with the test allocation the last n checksums, so in the common case of it not changing from one run to the next you don't need to go back to the server to check if the screenshot has previously been seen. Simply exposing the last run's checksums would probably work as a decent optimisation here as it means the majority of hashes can be excluded quickly.

jgraham · 2018-10-12T17:03:25Z

Yes, I agree that the Bloom filter has the opposite sematics to the ones we want. But maybe we shouldn't have anything on the client side; we upload a list of hashes we have and get a list back of hashes to upload. That seems cheaper than keeping any client-side list up to date and can be handled by whatever technique on the server side (a database for example).

In terms of compression I don't know how much we should worry; hopefully the screenshots don't change that often (although I guess that antialisaing &c. might disagree with that). If the typical number of screenshots to upload on a run is in the single digits, trying to save some percentage of the bandwidth by recompressing the images isn't that important. But if we do want to do that, I think Pillow will be a dependency for fuzzy reftests anyway so we could recompress all the images. But it would have to be all of them, not just the ones we upload, so I'm a little concerned about the cost of doing that.

foolip · 2018-10-16T10:22:03Z

The need for this to understand reftest failures, while perhaps already obvious, became very clear on the blink-dev thread where I had to run the tests again locally and use reftest-analyzer. It was actually pretty cool, and we should definitely take inspiration from it:

gsnedders · 2019-01-15T00:12:20Z

Talking to @jgraham, it sounded like his envisioned approach requires some changes to wptrunner:

If the results collector is going to continue to use the JSON format, we need to output screenshots in the JSON output (or store them somewhere with paths in the JSON file).
Some way of filtering what screenshots get outputted (though that sounds contrary to what he concluded above about client-side filtering?).

I feel like he listed more last week and I've forgotten.

foolip · 2019-01-15T10:22:47Z

Some filtering seems necessary if we're also to get screenshots of passing tests, which is useful when some browsers are passing and some are failing the same test.

jgraham · 2019-01-15T10:42:44Z

Screenshots of passing tests should be out of scope for the initial work; that is a significant performance concern for the runner since actually generating the screenshot as a PNG is quite a lot of work that's often unnecessary. In addition screenshots of passing tests are usually not that helpful; it's only interesting if it happens to pass in a bogus way (like rendering nothing). But that's usually possible to determine easily just by loading the test in the browser.

I don't think my current thinking is very different to the above:

We generate screenshots for failing tests
We pass a list of hashes to wpt.fyi and it returns a list of hashes that it doesn't already have (this is a new endpoint)
We add the hashes of failing tests to the wptreport.json data, and add a top level key with a list of screenshots (which can then be hashed on the server to fill in missing data).

The main issue with this approach is that we have to store all the screenshots somehow until the end of the test run. But the alternative where we prefilter means that we have to get a list of every hash the server has ever seen. The more efficient alternative requires us to accept some probability of not uploading a screenshot that is actually new.

foolip · 2019-01-15T11:19:38Z

I think we should leave it up to the people who design and implement this to recommend whether to include passing tests or not. For executors other than ExecutorMarionette there's already PNG encoding involved, and whether or not sending the screenshots for passing tests is practical or not depends on the choice of where/when to check if a screenshot already is uploaded.

jgraham · 2019-01-15T11:38:42Z

What is this thread for other than discussing requirements and constraints? I think that "leave it up to the people implementing it to decide" sounds reasonable but is actually dangerously close to "we will make the actual decisions in internal meetings, not in public".

Of course if there's some clever way to make passing screenshots work without much extra effort, that would be nice. But I still think that the utility of that is relatively low compared to seeing what's happening when things fail and we shouldn't regard it as a requirement for this work.

foolip · 2019-01-15T11:48:57Z

I don't know who will be working on this yet, but I expect it will end up in https://github.com/web-platform-tests/rfcs for discussion with the design evolving in public with you as the main reviewer. I don't know what the best trade-off is, perhaps it is obvious with your knowledge of certain internals, but plainly declaring it out of scope is what I reacted to.

lukebjerring · 2019-01-15T22:02:40Z

A design doc has been created to begin capturing requirements/constraints/options, which might be a more appropriate place to dump details, and discuss tradeoffs.
https://docs.google.com/document/d/1IhZa4mrjK1msUMhtamKwKJ_HhXD-nqh_4-BcPWM6soQ/edit#

As noted in the doc, there's a few deliberately separate areas of work:

Either,
- Ability to upload a recorded screenshot (wptrunner)
- or, Ability to scrape a recorded screenshot (wpt.fyi)
UI for viewing/comparing said screenshots (wpt.fyi)
Filtering known items / deduplication optimization (wptrunner + wpt.fyi)

"Out of scope" or "defer til later", I agree that we can't just turn turn on screenshots for everything without the last item being completed (and, not even necessarily with it completed).

foolip · 2019-01-16T14:26:56Z

@lukebjerring @Hexcles, cool, I didn't know design work had started :D

I'm not able to comment on the doc, can you fiddle with permissions?

gsnedders · 2019-01-16T23:12:26Z

@lukebjerring @Hexcles I can't even see the doc. This ain't great as a design doc for an change to an openly run project…

Hexcles · 2019-01-16T23:20:35Z

I apologize on behalf of Luke. He clicked a wrong button in the permission dialog. The link should be working now.

(The doc is barely started and only has a skeleton. I plan to flesh it out in the next couple days.)

This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible.

…report JSON along with a new formatter for screenshot data URIs, a=testonly Automatic update from web-platform-tests [wptrunner] Include screenshot hashes in JSON This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible. -- Add a new formatter for screenshots Add --log-wptscreenshot to output screenshots not found in cache into the line-based format (one data URI per line) expected by wpt.fyi. -- Consolidate with the existing reftest_screenshots We don't need a new field in `extra`. And the converter is moved from the formatter into executors/base.py to avoid processing the same log multiple times. In addition, strip server from test URLs and fix data URIs. -- wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5 wpt-pr: 15570

Hexcles · 2019-04-04T18:29:11Z

Closing as fixed. Woohoo!

This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible.

…report JSON along with a new formatter for screenshot data URIs, a=testonly Automatic update from web-platform-tests [wptrunner] Include screenshot hashes in JSON This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible. -- Add a new formatter for screenshots Add --log-wptscreenshot to output screenshots not found in cache into the line-based format (one data URI per line) expected by wpt.fyi. -- Consolidate with the existing reftest_screenshots We don't need a new field in `extra`. And the converter is moved from the formatter into executors/base.py to avoid processing the same log multiple times. In addition, strip server from test URLs and fix data URIs. -- wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5 wpt-pr: 15570 UltraBlame original commit: 7ccdd0077950dace9fa7243679e10faa058953a5

foolip mentioned this issue Apr 17, 2018

Record screenshots for reftests and their references web-platform-tests/results-collection#217

Open

foolip mentioned this issue May 15, 2018

Also link to reference on w3c-test.org #149

Closed

foolip mentioned this issue Aug 9, 2018

For reftests, inline PASS / FAIL into directory views #421

Open

foolip added the enhancement New feature or request label Aug 17, 2018

foolip assigned Hexcles Feb 25, 2019

Hexcles mentioned this issue Feb 27, 2019

[wptrunner] Add screenshot hashes to wptreport JSON along with a new formatter for screenshot data URIs web-platform-tests/wpt#15570

Merged

This was referenced Feb 28, 2019

Add feature for exposing a compare button and /analyzer endpoint #1189

Merged

Show both before/after pixel colors in the reftest-analyzer #1188

Merged

This was referenced Mar 7, 2019

[wptrunner] Fetch screenshot cache from staging.wpt.fyi web-platform-tests/wpt#15730

Merged

Send screenshots from Azure to wpt.fyi web-platform-tests/wpt#15765

Merged

Send screenshots from Taskcluster to wpt.fyi web-platform-tests/wpt#15771

Merged

Hexcles mentioned this issue Mar 20, 2019

Reland #15765: Send screenshots from Azure to wpt.fyi web-platform-tests/wpt#15956

Merged

Hexcles closed this as completed Apr 4, 2019

foolip mentioned this issue Oct 29, 2020

Store screenshots for references even when passing #2231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show expected and actual renderings for reftests #57

Show expected and actual renderings for reftests #57

foolip commented Apr 17, 2018

foolip commented Apr 17, 2018

foolip commented May 15, 2018

mrego commented May 15, 2018

foolip commented May 15, 2018

mrego commented May 15, 2018

foolip commented May 15, 2018

jgraham commented Aug 8, 2018

foolip commented Aug 8, 2018

foolip commented Oct 12, 2018

foolip commented Oct 12, 2018

gsnedders commented Oct 12, 2018

jgraham commented Oct 12, 2018

foolip commented Oct 16, 2018

gsnedders commented Jan 15, 2019

foolip commented Jan 15, 2019

jgraham commented Jan 15, 2019

foolip commented Jan 15, 2019

jgraham commented Jan 15, 2019 •

edited

Loading

foolip commented Jan 15, 2019

lukebjerring commented Jan 15, 2019

foolip commented Jan 16, 2019

gsnedders commented Jan 16, 2019

Hexcles commented Jan 16, 2019 •

edited

Loading

Hexcles commented Apr 4, 2019

Show expected and actual renderings for reftests #57

Show expected and actual renderings for reftests #57

Comments

foolip commented Apr 17, 2018

foolip commented Apr 17, 2018

foolip commented May 15, 2018

mrego commented May 15, 2018

foolip commented May 15, 2018

mrego commented May 15, 2018

foolip commented May 15, 2018

jgraham commented Aug 8, 2018

foolip commented Aug 8, 2018

foolip commented Oct 12, 2018

foolip commented Oct 12, 2018

gsnedders commented Oct 12, 2018

jgraham commented Oct 12, 2018

foolip commented Oct 16, 2018

gsnedders commented Jan 15, 2019

foolip commented Jan 15, 2019

jgraham commented Jan 15, 2019

foolip commented Jan 15, 2019

jgraham commented Jan 15, 2019 • edited Loading

foolip commented Jan 15, 2019

lukebjerring commented Jan 15, 2019

foolip commented Jan 16, 2019

gsnedders commented Jan 16, 2019

Hexcles commented Jan 16, 2019 • edited Loading

Hexcles commented Apr 4, 2019

jgraham commented Jan 15, 2019 •

edited

Loading

Hexcles commented Jan 16, 2019 •

edited

Loading