Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upBetter inner and outer rect calculation for ClipChains #2062
Conversation
|
Here is how this looks on the W3 testcase from #1817.
I'm still looking at the regression in CPU time, but I'd love to get general comments on this approach. I did a Gecko try push as well and the only issues I see are some GTest failures, which I think are unrelated. |
|
I updated the reference image for one test that had a small number of subpixel differences. |
|
I'll look at this in detail in the morning. One question though - do you have a sense of what the profile numbers would be like with the improved inner / outer rect calculation and filtering, but without the shader changes? I wonder if the majority of the wins might be from the CPU-side filtering? (I thought that the clip masks were already intersected with the screen rect of the clip node - since we support the clip mask being a portion of the primitive rect - but maybe that's not working / slightly different). |
|
@glennw So I took another look at this. I think we need to do this with the change to the shader. The case (and I suspect it might be the important case) is when we can avoid masking at all. This happens when the screen space rectangle of the ClipChain is sufficient to clip the primitive. in this case we can just skip creating the task entirely and rely on the changes to the fragment shader. Maybe there is a better way to do this, but I don't think we can use local geometry in this case. |
|
I pushed a new version of this PR which moves outer rect calculation to the ClipScrollTree update. This moves a little bit of work from the per-primitive parts to the per-ClipScrollNode part. I'm not seeing a CPU speedup for the W3C case, but I think that this cleaner in general. I am also thinking about extending this sort of thing to handling some situations where we can generate shorter ClipChains in general. |
|
|
|
Thanks @mrobinson, impressive work! Reviewed 22 of 22 files at r1. webrender/res/prim_shared.glsl, line 615 at r1 (raw file):
why are we doing this? webrender/res/prim_shared.glsl, line 889 at r1 (raw file):
any reason not to use webrender/res/prim_shared.glsl, line 893 at r1 (raw file):
nit: could use swizzling webrender/src/clip.rs, line 198 at r1 (raw file):
isn't setting this to webrender/src/clip.rs, line 203 at r1 (raw file):
similarly, setting webrender/src/clip.rs, line 216 at r1 (raw file):
I don't think the early out is useful here since all the function does is iterating the list twice webrender/src/clip_scroll_node.rs, line 309 at r1 (raw file):
could just do webrender/src/clip_scroll_node.rs, line 352 at r1 (raw file):
no need for webrender/src/clip_scroll_node.rs, line 359 at r1 (raw file):
hmm, the old code doesn't modify webrender/src/frame_builder.rs, line 1536 at r1 (raw file):
there is already webrender/src/prim_store.rs, line 1256 at r1 (raw file):
I don't think webrender/src/prim_store.rs, line 1266 at r1 (raw file):
leaving a note for myself that this is related to the webrender/src/prim_store.rs, line 1267 at r1 (raw file):
nit: let's move this assignment after the webrender/src/prim_store.rs, line 1278 at r1 (raw file):
nit: could be webrender/src/renderer.rs, line 1106 at r1 (raw file):
I'd find it easier to understand if it was named "target_height" instead. webrender/src/util.rs, line 442 at r1 (raw file):
would this guarantee that origin + size is not going to end up at infinity? Comments from Reviewable |
|
OK, I think I have worked out in my head what my concern is with this approach. I get easily confused by the various permutations, so I'll draw up a diagram as reference and try to explain shortly - it's quite likely I am not thinking it all through properly :) |
|
OK, my thoughts below. I pre-emptively apologize if the stuff below sounds a bit short - I'm basically just trying to list a series of statements, so that we can easily discuss which bits below are correct and which bits are wrong, or missing some context / information. Let me know if that makes any sense (or completely misses the point!) - we can also set up a Vidyo / Skype call to discuss in detail, if that's easier? This is a simplified set of examples - but I think it extends to cover more complex cases (e.g. when there are multiple clips in a clip chain). Red blocks are clips. Green blocks are primitives. In each of the four scenarios, we have (1) primitive completely inside clip (2) primitive partially clipped (3) primitive not affected by clip. In (A) both clip and primitive are axis-aligned - the coordinate systems are considered compatible. I think the following statements are true: If the above is correct, what we get from this PR is to skip the clip mask in (A) and (B) when we detect that the clip region of interest intersects with the primitive. However, in the case of (A), we can apply the same logic as (D) - that is, the coordinate systems are compatible, and therefore we can skip the clip mask and apply this as a local clip in the vertex shader. The point above is important, for two reasons: So, it does seem like this is a potential win for the intersecting case in scenario (B). My guess is that we almost never actually see this in real world, and that the benefits we're seeing are because we're not filtering clip masks out (and into a local clip when compatible) as much as we could be. There's also a slight downside that we have extra instructions running for every fragment shader to apply that screen space clip rect. Apologies for the wall of text. Thoughts? |
|
Review status: all files reviewed at latest revision, 16 unresolved discussions, some commit checks failed. webrender/res/prim_shared.glsl, line 615 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
This is because we need to flip the y axis when drawing the screen. I've left a comment here, because it is indeed pretty confusing. webrender/res/prim_shared.glsl, line 889 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Nope. No reason. I'll make that change. webrender/res/prim_shared.glsl, line 893 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Nice catch. This was due to organic evolution of the code. webrender/src/clip.rs, line 198 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
It is equivalent, but I think the extra variables make the code easier to read here. webrender/src/clip.rs, line 203 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Ditto. webrender/src/clip.rs, line 216 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Okay. That makes sense. I'll also do all the work in a single loop since I think that's a bit cleaner. webrender/src/clip_scroll_node.rs, line 309 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Yep! With an addition change which is to make this member a DeviceRect (which it should have been anyway). webrender/src/clip_scroll_node.rs, line 352 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Thanks! webrender/src/clip_scroll_node.rs, line 359 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Oh. It is also modified below, but I had forgotten to remove the previous statement. I'll fix that. Nice catch! webrender/src/frame_builder.rs, line 1536 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
There is self.screen_size, but I'll add a method to generate a screen_rect easily from that. That will allow us to avoid passing the parameter here. webrender/src/prim_store.rs, line 1256 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
So we need to represent two cases:
Currently we use None to mean when the outer bounds are undefined. In this case, they cannot be undefined (which is why this parameter is not an Option), so I have used the empty rect here to mean that we are entirely clipped out. webrender/src/prim_store.rs, line 1267 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Okay. webrender/src/prim_store.rs, line 1278 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Okay. That's a bit nicer. webrender/src/renderer.rs, line 1106 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Okay. No problem. webrender/src/util.rs, line 442 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
I think that dividing by two does this, but I have to admit I am not certain. I followed the same strategy used for other implementations of Comments from Reviewable |
|
@glennw
Yes, except for the fact A is already classified as D (without the need of this PR), as you mentioned later:
So yes, the PR only addresses use-case (B), and it does add a bit of overhead for the general case. How much though, is still to be figure out.
I don't think this PR makes us process more fragments than before. Instead of drawing a bunch of stuff into the clip mask, we are now drawing less fragments of the main primitive. So, it appears to me that we'd be processing less fragments for (B) with this PR, and same number of fragments for (A), (C), and (D) with a few added instructions.
Well, that part needs some elaboration :) Reason we can't do it is because the local clip is in the local space, and the clip we are trying to process here is in screen space. So vertex shader would only be able to help if we had |
Totally agree! Those are really nice. I would love to put together some documentation with these types of diagrams.
This brings up a good point which is to test the impact of using discard instead of 0 alpha. |
Another alternative is to have scissor-enabled batches. This would work best in terms of number of fragments and their processing cost, would be fairly straightforward to put in (least intrusive of all proposals) but would introduce batch breaks. If what @glennw was saying is correct ("we almost never actually see this in real world"), then we can totally afford this. |
|
I just tested this, and interestingly enough using an early branch/ |
We calculate the real inner and outer boundaries of clips in screen space and use it to discard clips and generate masks that are a little smaller in some cases. This results in a small performance improvement on pages with complicated clipping.
|
I've pushed a new version of this patch which seems to slightly improve GPU performance and decrease the regression to CPU performance. It also makes WebRender a bit smarter about creating shorter ClipChains. Before After I'm pretty sure the compositor differences are just noise here. |
|
@mrobinson In terms of GPU timing - could you try with vsync disabled? That will certainly be the cause of the compositor timing being weird (it includes the wait for vsync time), and also has a significant effect on the GPU timer queries (on my machine, anyway). |
|
I'll have a proper read through this today (thanks for the detailed replies!) but I do quite like the idea of scissor enabled batches - it'd be interesting to see what effect that has on timing (if it is indeed quick to hack in and test). |
|
|
That's correct - I didn't mean to imply that we're drawing more fragments than previously. What I meant was that if we're using this technique to mask out fragments in the (A) case, we'd be drawing more fragments than if we were to use the local clip rect in that case. But if I understand the PR, we're not processing less fragments either - since we end up still running the FS on all the pixels, just masking out the result at the end of the shader?
Yea, I was discussing that in the context of (A). Certainly if we are encountering a lot of the (B) cases, we either need this technique or clip masks (assuming we don't have gl_ClipDistance etc). To try and sum that up:
There's a lot of ifs, buts and maybes there - perhaps you guys know off hand if some of those guesses above are correct or not? |
|
@glennw My understanding is that we are already handling (A) optimally, since the local clip does the job. Let's verify! |
|
If I understand correctly, our |
|
@mrobinson I was working on some MotionMark benchmarks today. While looking at #1648 it seemed that we were drawing a huge number of redundant clip masks. I took your PR here, and removed the parts we're still discussing (the GPU parts). You can see what I ended up with here - glennw@c9ce44b. The good news is that those improvements to clip chain filtering make a huge improvement to #1648. These changes remove all the redundant clip masks, and take the score on my machine from 165 to 472. Would you be happy to follow up on that and we could land this PR with the CPU-side improvements now, and then continue investigating the GPU-side improvements as a separate issue? cc @kvark |
|
Sure. I'll try to pull apart the CPU and GPU bits and open a new PR to follow up on the screen space clipping idea. |

mrobinson commentedNov 20, 2017
•
edited by larsbergstrom
We calculate the real inner and outer boundaries of clips in screen
space and use it to discard clips and generate masks that are a little
smaller in some cases. This results in a small performance improvement
on pages with complicated clipping.
This change is