Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larger-than-ideal B_Solid times on github on my integrated GPU (Intel HD Graphics 530) #2178

Closed
mstange opened this issue Dec 6, 2017 · 12 comments

Comments

@mstange
Copy link
Contributor

@mstange mstange commented Dec 6, 2017

screen shot 2017-12-06 at 1 09 25 pm

@kvark
Copy link
Member

@kvark kvark commented Dec 6, 2017

Given the overdraw of opaque to be 1.0 and transparent to be 0.21, the large GPU time of "B_Solid" means we are hitting the rasterizer/depth-test bottleneck of fixed-function part of the hardware. This may happen if we try to draw too many opaque rectangles with pixels mostly discarded and not reaching the fragment shaders.

A similar situation is described in https://bugzilla.mozilla.org/show_bug.cgi?id=1419863

Investigation of a RenderDoc capture shows that we have 3 large full-screen-ish rectangles: one of the whole screen, one of the content plus the scroll bar, and one of the content.

Interestingly, each of those is rendered twice (resulting in 6 full-screen rects). Hopefully, this can be addressed by the Gecko side by looking at the background setting and discarding it if the content is fully opaque. If not, we'll need to come up with a better culling heuristics in WR.

@kvark
Copy link
Member

@kvark kvark commented Dec 6, 2017

Note that something needs to split the container rectangles to avoid overdraw. Whether it's WR or Gecko is still to be figured out.

Another thing that would help is using a separate document for the chrome stuff.

@glennw
Copy link
Member

@glennw glennw commented Dec 7, 2017

Briefly recounting what I've been discussing in IRC with @kvark. That very roughly looks like ~2ms in opaque to me. On an integrated GPU, at ~4k resolution, that seems possible / reasonable amount of time to fill a rect that large.

It'd be great if we find that we are suffering costs from rejecting those extra large quads (there's probably some cost at least), but I'd be surprised if we saw a massive GPU time win over what we're currently seeing for that resolution and that GPU.

One possible improvement to this is using partial raster / present - it's certainly the case that on almost all websites you view at 4k resolution, the sides are largely empty.

So, I think it makes sense to do some measurements to see if we're paying much cost for rejecting those extra quads, but I'm not super-surprised that it's ~2ms to fill a 4k rect with the opaque pass.

@glennw
Copy link
Member

@glennw glennw commented Dec 7, 2017

Following up on that, if we do the following, which are all relatively straightforward:

  • Reducing the time spent in clips (clip shader is really expensive still).
  • Fixing the batching (getting close to being able to share a lot more shaders).
  • Reducing the time spend in border rendering (border shader is really expensive for common path borders).
  • Considering partial present / raster extensions to avoid drawing to the outer edges of this page.

I expect that we'd probably be at ~3.5 - 4ms for this page at 4k resolution, which seems quite close to optimal, I think.

Thoughts?

@glennw
Copy link
Member

@glennw glennw commented Dec 7, 2017

Oh, and also using dual-source blending for the subpx text rendering should give a significant win for this page, perhaps another 0.5ms off the profile.

@kvark
Copy link
Member

@kvark kvark commented Dec 7, 2017

@glennw raised a concern that maybe it's not about rasterization/depth-test but rather the sheer number of pixels we draw in the opaque pass. We decided to benchmark it separately for a baseline, which corresponds to an ideal case where we only draw every pixel once and the pixel shader does nothing.

I made this little tool - https://github.com/kvark/gl-bench
Try running it on your setup, exit with Esc and see the console output.

I'm getting this on my Intel 520 with Linux/GLX:

Avg draw time: 0.56 ms for 2560x1440 resolution

A true 4K screen has 2.25 more pixels, so it would estimate to spend 1.26 ms just to fill it on that GPU.

@glennw
Copy link
Member

@glennw glennw commented Dec 7, 2017

Thanks @kvark for setting that up! This is quite promising - it may well mean that my assumption that we aren't paying any real z-reject cost was wrong.

We should further investigate what happens if we remove those extra full pass opaque quads (even just removing them manually as a test case to benchmark).

@kvark
Copy link
Member

@kvark kvark commented Dec 7, 2017

@mstange
Copy link
Contributor Author

@mstange mstange commented Dec 7, 2017

I've added the data from my machine. I used RDM to switch to a non-retina resolution (2880x1800 at native 1:1 scale) so that the megapixel arithmetic worked out, because it's not HiDPI-aware. I also used gfxCardStatus to pin my machine to the integrated GPU.

@kvark
Copy link
Member

@kvark kvark commented Dec 7, 2017

Numbers tell that the cost of rejecting a quad is roughly 7% of the cost of drawing it. And the cost of drawing it is significantly smaller than what we spend in the opaque pass. This means we don't understand yet where all the time has gone.

@dralley
Copy link

@dralley dralley commented May 31, 2019

The bugzilla is marked "resolved fixed" FWIW, unsure based on the discussion whether it is the same issue but it may be able to be closed

@kvark
Copy link
Member

@kvark kvark commented May 31, 2019

Thank you!

@kvark kvark closed this May 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.