-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Show expected and actual renderings for reftests #57
Comments
Renaming to also track what the receiver (@Hexcles) would have to do. (This issue is quite aspirational at this point, it'd be a lot of work, but also very valuable if we could pull it off.) |
@mrego, you asked in https://chromium-review.googlesource.com/c/chromium/src/+/1054770#message-b3087aea32526011c2a83a3a094013263e2aedfb: "Where I can see the screenshots from wpt.fyi?" The answer is that you can't and this is the issue for it. Could you describe what it is you were trying to do and how much time (if any) you ended up spending to get to the same answer without having the screenshots? Is it only for failed tests that you wanted to see screenshots, or passing ones too? |
Ok, good to know they're nowhere, so it's expected I couldn't find them. 😄 I needed to see the output of a WPT test on Mac, I ended up using an actual Mac to check it. |
Thanks @mrego. It's interesting that you wanted to see why a test was passing, as it seems like it might be easier/faster to only record failure cases. But I think that many times when you have some browser passing and some failing, looking at the passing ones will tell you something interesting. |
I agree it's the strange case probably, but the test was failing in all browsers and passing only in Safari on Mac. So I wanted to see a screenshot of what was going on there. |
I'm not sure it is a strange case, will keep asking people who ask about screenshots to see if they're normally interested in the passing or the failing cases. |
For gecko we only record cases where the result is "unexpected". For wpt.fyi that would always correspond to not passing, since the expectations are not set and we default to assuming all tests pass. The way this works is that the logger records a base64-encoded PNG for the last two screenshots it took (typically the test and the ref; things are more complex with chained references). That can be dumped to the log. So for example this log shows a failure. That data is interpreted by the reftest analyzer (links may break after a while as logs are expired). For wpt.fyi the main challenge is that we have many many failures per run. Dumping every failing screenshot pair to a log would be somewhere on the order of 100Mb of data per browser per run. Which is not untenable, but itsn't small either. Then wpt.fyi could either directly store the log links, or could import the screenshots from the logs somehow, and provide links to the reftest analyzer (possibly forked). A more efficient approach in terms of storage would be to publish some kind of filter matching known test+screenshot ids so that we would only store new screenshots explicitly and would otherwise just record the id for the screenshot pair. But that itself may be somewhat complex. |
IRC between me and @jgraham starts at https://w3.logbot.info/testing/20180808#c430986 There I suggest that we'd want to get the screenshots into wpt.fyi so that old links keep working, although it isn't strictly speaking necessary to keep all screenshots forever. However, I suspect that if we deduplicate, perhaps by hashing the decoded size+RGBA data, that the size of the screenshot bucket wouldn't grow very fast. (Even flaky tests won't have very many different renderings, except perhaps animation tests.) |
Yesterday, I sent reports of WPT failurs to webkit-dev, dev-platform, and blink-dev. With that, @jgraham noted the obvious need to look at screenshots to understand failures. First, two goals that lead to some complexity:
Here's my proposal for how to achieve this:
This is probably enough to store just failure screenshots. As an optimization, which should also make it possible to eventually store all screenshots:
@lukebjerring, can we build it? :) |
SPARTAN used to store all screenshots ever seen deduplicating based on the entire file (i.e., including any metadata in the image), and that ultimately ended up with a few hundred megabytes of screenshots IIRC. Note that some browsers are pretty poor at compressing PNGs and even a relatively computationally cheap compression tools can massively reduce the file size (of our often largely white images).
Is the results receiver authenticated? (It is, right?) If not, can we not just upload images by uploading the "screenshot" and then uploading a bogus result set?
@jgraham's suggestion of using a client-side Bloom filter seems sensible, but let's actually think about this for a second: in the common case, we'll have almost no new screenshots, so almost everything will maybe be in the set. We can unconditionally upload things that aren't in the set, but we need to check for everything that might be in the set as to whether it is. So using a Bloom filter seems to give very little gain here. The list of image checksums is still likely to be pretty big (and pretty incompressible), which is awkward. IIRC, SPARTAN provided with the test allocation the last n checksums, so in the common case of it not changing from one run to the next you don't need to go back to the server to check if the screenshot has previously been seen. Simply exposing the last run's checksums would probably work as a decent optimisation here as it means the majority of hashes can be excluded quickly. |
Yes, I agree that the Bloom filter has the opposite sematics to the ones we want. But maybe we shouldn't have anything on the client side; we upload a list of hashes we have and get a list back of hashes to upload. That seems cheaper than keeping any client-side list up to date and can be handled by whatever technique on the server side (a database for example). In terms of compression I don't know how much we should worry; hopefully the screenshots don't change that often (although I guess that antialisaing &c. might disagree with that). If the typical number of screenshots to upload on a run is in the single digits, trying to save some percentage of the bandwidth by recompressing the images isn't that important. But if we do want to do that, I think Pillow will be a dependency for fuzzy reftests anyway so we could recompress all the images. But it would have to be all of them, not just the ones we upload, so I'm a little concerned about the cost of doing that. |
The need for this to understand reftest failures, while perhaps already obvious, became very clear on the blink-dev thread where I had to run the tests again locally and use reftest-analyzer. It was actually pretty cool, and we should definitely take inspiration from it: |
Talking to @jgraham, it sounded like his envisioned approach requires some changes to wptrunner:
I feel like he listed more last week and I've forgotten. |
Some filtering seems necessary if we're also to get screenshots of passing tests, which is useful when some browsers are passing and some are failing the same test. |
Screenshots of passing tests should be out of scope for the initial work; that is a significant performance concern for the runner since actually generating the screenshot as a PNG is quite a lot of work that's often unnecessary. In addition screenshots of passing tests are usually not that helpful; it's only interesting if it happens to pass in a bogus way (like rendering nothing). But that's usually possible to determine easily just by loading the test in the browser. I don't think my current thinking is very different to the above:
The main issue with this approach is that we have to store all the screenshots somehow until the end of the test run. But the alternative where we prefilter means that we have to get a list of every hash the server has ever seen. The more efficient alternative requires us to accept some probability of not uploading a screenshot that is actually new. |
I think we should leave it up to the people who design and implement this to recommend whether to include passing tests or not. For executors other than ExecutorMarionette there's already PNG encoding involved, and whether or not sending the screenshots for passing tests is practical or not depends on the choice of where/when to check if a screenshot already is uploaded. |
What is this thread for other than discussing requirements and constraints? I think that "leave it up to the people implementing it to decide" sounds reasonable but is actually dangerously close to "we will make the actual decisions in internal meetings, not in public". Of course if there's some clever way to make passing screenshots work without much extra effort, that would be nice. But I still think that the utility of that is relatively low compared to seeing what's happening when things fail and we shouldn't regard it as a requirement for this work. |
I don't know who will be working on this yet, but I expect it will end up in https://github.com/web-platform-tests/rfcs for discussion with the design evolving in public with you as the main reviewer. I don't know what the best trade-off is, perhaps it is obvious with your knowledge of certain internals, but plainly declaring it out of scope is what I reacted to. |
A design doc has been created to begin capturing requirements/constraints/options, which might be a more appropriate place to dump details, and discuss tradeoffs. As noted in the doc, there's a few deliberately separate areas of work:
"Out of scope" or "defer til later", I agree that we can't just turn turn on screenshots for everything without the last item being completed (and, not even necessarily with it completed). |
@lukebjerring @Hexcles, cool, I didn't know design work had started :D I'm not able to comment on the doc, can you fiddle with permissions? |
@lukebjerring @Hexcles I can't even see the doc. This ain't great as a design doc for an change to an openly run project… |
I apologize on behalf of Luke. He clicked a wrong button in the permission dialog. The link should be working now. (The doc is barely started and only has a skeleton. I plan to flesh it out in the next couple days.) |
This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible.
This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible.
This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible.
…report JSON along with a new formatter for screenshot data URIs, a=testonly Automatic update from web-platform-tests [wptrunner] Include screenshot hashes in JSON This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible. -- Add a new formatter for screenshots Add --log-wptscreenshot to output screenshots not found in cache into the line-based format (one data URI per line) expected by wpt.fyi. -- Consolidate with the existing reftest_screenshots We don't need a new field in `extra`. And the converter is moved from the formatter into executors/base.py to avoid processing the same log multiple times. In addition, strip server from test URLs and fix data URIs. -- wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5 wpt-pr: 15570
…report JSON along with a new formatter for screenshot data URIs, a=testonly Automatic update from web-platform-tests [wptrunner] Include screenshot hashes in JSON This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible. -- Add a new formatter for screenshots Add --log-wptscreenshot to output screenshots not found in cache into the line-based format (one data URI per line) expected by wpt.fyi. -- Consolidate with the existing reftest_screenshots We don't need a new field in `extra`. And the converter is moved from the formatter into executors/base.py to avoid processing the same log multiple times. In addition, strip server from test URLs and fix data URIs. -- wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5 wpt-pr: 15570
Closing as fixed. Woohoo! |
This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible.
…report JSON along with a new formatter for screenshot data URIs, a=testonly Automatic update from web-platform-tests [wptrunner] Include screenshot hashes in JSON This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible. -- Add a new formatter for screenshots Add --log-wptscreenshot to output screenshots not found in cache into the line-based format (one data URI per line) expected by wpt.fyi. -- Consolidate with the existing reftest_screenshots We don't need a new field in `extra`. And the converter is moved from the formatter into executors/base.py to avoid processing the same log multiple times. In addition, strip server from test URLs and fix data URIs. -- wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5 wpt-pr: 15570 UltraBlame original commit: 7ccdd0077950dace9fa7243679e10faa058953a5
…report JSON along with a new formatter for screenshot data URIs, a=testonly Automatic update from web-platform-tests [wptrunner] Include screenshot hashes in JSON This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible. -- Add a new formatter for screenshots Add --log-wptscreenshot to output screenshots not found in cache into the line-based format (one data URI per line) expected by wpt.fyi. -- Consolidate with the existing reftest_screenshots We don't need a new field in `extra`. And the converter is moved from the formatter into executors/base.py to avoid processing the same log multiple times. In addition, strip server from test URLs and fix data URIs. -- wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5 wpt-pr: 15570 UltraBlame original commit: 7ccdd0077950dace9fa7243679e10faa058953a5
…report JSON along with a new formatter for screenshot data URIs, a=testonly Automatic update from web-platform-tests [wptrunner] Include screenshot hashes in JSON This is working towards web-platform-tests/wpt.fyi#57 (displaying reftest screenshots on wpt.fyi). This change includes the hashes of screenshots (failing reftests only for now, but it can be easily extended to include all screenshots later) in a new field in test results: test: status: ... screenshots: url_of_test: sha1:checksum url_of_ref1: sha1:checksum ... (Because of an implementation detail, we currently only include the last two screenshots taken during a complex reftest.) Another important behaviour change is that internally we are now hashing the image *after* base64-decoding it, but this change should not affect any existing infra since checksums were not externally visible. -- Add a new formatter for screenshots Add --log-wptscreenshot to output screenshots not found in cache into the line-based format (one data URI per line) expected by wpt.fyi. -- Consolidate with the existing reftest_screenshots We don't need a new field in `extra`. And the converter is moved from the formatter into executors/base.py to avoid processing the same log multiple times. In addition, strip server from test URLs and fix data URIs. -- wpt-commits: 2403a22349e1c5363595765c6462247a27a0308d, 68c4978923670012ec96fbeb840469545df0231d, d1daf5e3d3f0526a067e7df19fc0fbd768dcfaa5 wpt-pr: 15570 UltraBlame original commit: 7ccdd0077950dace9fa7243679e10faa058953a5
This is the frontend part of web-platform-tests/results-collection#217. If we record the screenshots in the runners and store them somewhere, then the frontend would also need to show them, and possibly diffs between them.
The text was updated successfully, but these errors were encountered: