Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better inner and outer rect calculation for ClipChains #2062

Closed
wants to merge 1 commit into from

Conversation

@mrobinson
Copy link
Member

mrobinson commented Nov 20, 2017

We calculate the real inner and outer boundaries of clips in screen
space and use it to discard clips and generate masks that are a little
smaller in some cases. This results in a small performance improvement
on pages with complicated clipping.


This change is Reviewable

@mrobinson mrobinson requested a review from glennw Nov 20, 2017
@mrobinson
Copy link
Member Author

mrobinson commented Nov 20, 2017

Here is how this looks on the W3 testcase from #1817.

Before                                  After
CPU: 1.54                               CPU: 1.87
Compositor: 15.39                       Compositor: 15.33
GPU: 11.95                              GPU: 10.66

I'm still looking at the regression in CPU time, but I'd love to get general comments on this approach.

I did a Gecko try push as well and the only issues I see are some GTest failures, which I think are unrelated.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3e470a85a5a66cf9e3ae4a97b70e85ba4b35da67

@mrobinson mrobinson force-pushed the mrobinson:optimize-clipping branch from 5cb1272 to 117419e Nov 20, 2017
@mrobinson
Copy link
Member Author

mrobinson commented Nov 20, 2017

I updated the reference image for one test that had a small number of subpixel differences.

@glennw
Copy link
Member

glennw commented Nov 20, 2017

I'll look at this in detail in the morning.

One question though - do you have a sense of what the profile numbers would be like with the improved inner / outer rect calculation and filtering, but without the shader changes?

I wonder if the majority of the wins might be from the CPU-side filtering? (I thought that the clip masks were already intersected with the screen rect of the clip node - since we support the clip mask being a portion of the primitive rect - but maybe that's not working / slightly different).

@mrobinson
Copy link
Member Author

mrobinson commented Nov 20, 2017

@glennw So I took another look at this. I think we need to do this with the change to the shader. The case (and I suspect it might be the important case) is when we can avoid masking at all. This happens when the screen space rectangle of the ClipChain is sufficient to clip the primitive. in this case we can just skip creating the task entirely and rely on the changes to the fragment shader. Maybe there is a better way to do this, but I don't think we can use local geometry in this case.

@mrobinson mrobinson force-pushed the mrobinson:optimize-clipping branch from 117419e to 7b2abb1 Nov 20, 2017
@mrobinson
Copy link
Member Author

mrobinson commented Nov 20, 2017

I pushed a new version of this PR which moves outer rect calculation to the ClipScrollTree update. This moves a little bit of work from the per-primitive parts to the per-ClipScrollNode part. I'm not seeing a CPU speedup for the W3C case, but I think that this cleaner in general.

I am also thinking about extending this sort of thing to handling some situations where we can generate shorter ClipChains in general.

@bors-servo
Copy link
Contributor

bors-servo commented Nov 20, 2017

The latest upstream changes (presumably #2052) made this pull request unmergeable. Please resolve the merge conflicts.

@kvark kvark self-requested a review Nov 20, 2017
@kvark
Copy link
Member

kvark commented Nov 20, 2017

Thanks @mrobinson, impressive work!
I expressed the current questions/concerns. Will want to take another look after the first wave is addressed.


Reviewed 22 of 22 files at r1.
Review status: all files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.


webrender/res/prim_shared.glsl, line 615 at r1 (raw file):

    vScreenClipRect.xy += task_offset;
    if (task.size == vec2(0.0)) {
        vScreenClipRect.y = (uBufferHeight - vScreenClipRect.y) - vScreenClipRect.w;

why are we doing this?


webrender/res/prim_shared.glsl, line 889 at r1 (raw file):

}

float apply_rectangular_screen_space_clip(vec4 screenSpaceClipRect) {

any reason not to use vScreenClipRect directly here?


webrender/res/prim_shared.glsl, line 893 at r1 (raw file):

    // is inside the screen space clipping box.
    vec2 topLeft = screenSpaceClipRect.xy;
    vec2 bottomRight = screenSpaceClipRect.xy + vec2(screenSpaceClipRect.z, screenSpaceClipRect.w);

nit: could use swizzling screenSpaceClipRect.zw


webrender/src/clip.rs, line 198 at r1 (raw file):

        }

        let outer = match can_calculate_outer_rect {

isn't setting this to true equivalent to saying local_outer = None?


webrender/src/clip.rs, line 203 at r1 (raw file):

        };

        let inner = match can_calculate_inner_rect {

similarly, setting local_inner to zero() should do the same


webrender/src/clip.rs, line 216 at r1 (raw file):

        resource_cache: &mut ResourceCache,
    ) {
        if self.clips.is_empty() {

I don't think the early out is useful here since all the function does is iterating the list twice


webrender/src/clip_scroll_node.rs, line 309 at r1 (raw file):

                        self.reference_frame_relative_scroll_offset,
                    scroll_offset: self.scroll_offset(),
                    combined_clip_outer_bounds: LayerRect::new(

could just do self.combined_clip_outer_bounds.to_f32()


webrender/src/clip_scroll_node.rs, line 352 at r1 (raw file):

            clip_sources.get_screen_bounds(&self.world_viewport_transform, device_pixel_ratio);

        let combined_outer_screen_rect = match screen_outer_rect.as_ref() {

no need for as_ref() in a match


webrender/src/clip_scroll_node.rs, line 359 at r1 (raw file):

            None => state.combined_outer_clip_bounds,
        };
        state.combined_outer_clip_bounds = combined_outer_screen_rect;

hmm, the old code doesn't modify state.combined_outer_clip_bounds. Why are we doing it now?


webrender/src/frame_builder.rs, line 1536 at r1 (raw file):

        device_pixel_ratio: f32,
        scene_properties: &SceneProperties,
        screen_rect: &DeviceIntRect,

there is already self.screen_rect, can we use it?


webrender/src/prim_store.rs, line 1256 at r1 (raw file):

                    screen_inner_rect,
                    combined_outer_screen_rect:
                        combined_outer_rect.unwrap_or_else(DeviceIntRect::zero),

I don't think zero() is appropriate replacement for the lack of the outer bound


webrender/src/prim_store.rs, line 1266 at r1 (raw file):

        // If everything is clipped out, then we don't need to render this primitive.
        let combined_outer_rect = match combined_outer_rect {
            Some(rect) if !rect.is_empty() => {

leaving a note for myself that this is related to the zero() assignment somewhere above


webrender/src/prim_store.rs, line 1267 at r1 (raw file):

        let combined_outer_rect = match combined_outer_rect {
            Some(rect) if !rect.is_empty() => {
                metadata.screen_rect = Some(rect);

nit: let's move this assignment after the match, so that the match becomes non-mutating


webrender/src/prim_store.rs, line 1278 at r1 (raw file):

            .chain(ClipChainNodeIter { current: extra_clip })
            .filter_map(|node| {
                if !node.screen_inner_rect.is_empty() {

nit: could be combined_inner_rect = if ... {} else {..};


webrender/src/renderer.rs, line 1106 at r1 (raw file):

        projection: &Transform3D<f32>,
        mode: M,
        buffer_height: u32,

I'd find it easier to understand if it was named "target_height" instead.


webrender/src/util.rs, line 442 at r1 (raw file):

        LayerRect::new(
            LayerPoint::new(f32::MIN / 2.0, f32::MIN / 2.0),
            LayerSize::new(f32::MAX, f32::MAX),

would this guarantee that origin + size is not going to end up at infinity?


Comments from Reviewable

@glennw
Copy link
Member

glennw commented Nov 21, 2017

OK, I think I have worked out in my head what my concern is with this approach. I get easily confused by the various permutations, so I'll draw up a diagram as reference and try to explain shortly - it's quite likely I am not thinking it all through properly :)

@glennw
Copy link
Member

glennw commented Nov 21, 2017

OK, my thoughts below. I pre-emptively apologize if the stuff below sounds a bit short - I'm basically just trying to list a series of statements, so that we can easily discuss which bits below are correct and which bits are wrong, or missing some context / information. Let me know if that makes any sense (or completely misses the point!) - we can also set up a Vidyo / Skype call to discuss in detail, if that's easier?

This is a simplified set of examples - but I think it extends to cover more complex cases (e.g. when there are multiple clips in a clip chain).

Red blocks are clips. Green blocks are primitives.

untitled diagram

In each of the four scenarios, we have (1) primitive completely inside clip (2) primitive partially clipped (3) primitive not affected by clip.

In (A) both clip and primitive are axis-aligned - the coordinate systems are considered compatible.
In (B) the primitive has a rotation - the coordinate systems are considered incompatible.
In (C) the clip has a rotation - the coordinate systems are considered incompatible.
In (D) both clip and primitive are rotated by a parent node - the coordinate systems are considered compatible.

I think the following statements are true:
- This change helps with avoiding a clip mask in scenario (A) and (B).
- (C) still requires a mask, but only in the overlapping case. If we detect the primitive is inside/outside the clip region, we can skip the mask. We can do a cheap check first, with AABB checks. If the potential clip mask is large enough to warrant it, and AABB check fails, we can use SAT or similar.
- (D) doesn't require a mask with or without this change (since we can apply it as a local clip in the vertex shader).
- In both (A) and (B) we can detect that the primitive is inside or outside the region (using AABB / SAT), we can skip the clip mask for those on CPU.

If the above is correct, what we get from this PR is to skip the clip mask in (A) and (B) when we detect that the clip region of interest intersects with the primitive.

However, in the case of (A), we can apply the same logic as (D) - that is, the coordinate systems are compatible, and therefore we can skip the clip mask and apply this as a local clip in the vertex shader.

The point above is important, for two reasons:
(1) Scenario (A) is certainly the vast majority of cases we will encounter on real web pages (but it's still important to handle (B) efficiently).
(2) In terms of GPU time, we are always limited by the number of fragments we process. This solution still invokes the fragment shader for those extra pixels (when drawing the primitive). Although it does save drawing a clip mask, I think we can do better by skipping the clip mask and applying a local clip in the vertex shader in these cases - this means we don't invoke the FS at all for the clipped fragments.

So, it does seem like this is a potential win for the intersecting case in scenario (B). My guess is that we almost never actually see this in real world, and that the benefits we're seeing are because we're not filtering clip masks out (and into a local clip when compatible) as much as we could be. There's also a slight downside that we have extra instructions running for every fragment shader to apply that screen space clip rect.

Apologies for the wall of text. Thoughts?

@mrobinson
Copy link
Member Author

mrobinson commented Nov 21, 2017

Review status: all files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.


webrender/res/prim_shared.glsl, line 615 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

why are we doing this?

This is because we need to flip the y axis when drawing the screen. I've left a comment here, because it is indeed pretty confusing.


webrender/res/prim_shared.glsl, line 889 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

any reason not to use vScreenClipRect directly here?

Nope. No reason. I'll make that change.


webrender/res/prim_shared.glsl, line 893 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

nit: could use swizzling screenSpaceClipRect.zw

Nice catch. This was due to organic evolution of the code.


webrender/src/clip.rs, line 198 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

isn't setting this to true equivalent to saying local_outer = None?

It is equivalent, but I think the extra variables make the code easier to read here.


webrender/src/clip.rs, line 203 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

similarly, setting local_inner to zero() should do the same

Ditto.


webrender/src/clip.rs, line 216 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

I don't think the early out is useful here since all the function does is iterating the list twice

Okay. That makes sense. I'll also do all the work in a single loop since I think that's a bit cleaner.


webrender/src/clip_scroll_node.rs, line 309 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

could just do self.combined_clip_outer_bounds.to_f32()

Yep! With an addition change which is to make this member a DeviceRect (which it should have been anyway).


webrender/src/clip_scroll_node.rs, line 352 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

no need for as_ref() in a match

Thanks!


webrender/src/clip_scroll_node.rs, line 359 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

hmm, the old code doesn't modify state.combined_outer_clip_bounds. Why are we doing it now?

Oh. It is also modified below, but I had forgotten to remove the previous statement. I'll fix that. Nice catch!


webrender/src/frame_builder.rs, line 1536 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

there is already self.screen_rect, can we use it?

There is self.screen_size, but I'll add a method to generate a screen_rect easily from that. That will allow us to avoid passing the parameter here.


webrender/src/prim_store.rs, line 1256 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

I don't think zero() is appropriate replacement for the lack of the outer bound

So we need to represent two cases:

  1. When the outer bounds are undefined.
  2. When the outer bounds are an empty rect (ie clipped out entirely).

Currently we use None to mean when the outer bounds are undefined. In this case, they cannot be undefined (which is why this parameter is not an Option), so I have used the empty rect here to mean that we are entirely clipped out.


webrender/src/prim_store.rs, line 1267 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

nit: let's move this assignment after the match, so that the match becomes non-mutating

Okay.


webrender/src/prim_store.rs, line 1278 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

nit: could be combined_inner_rect = if ... {} else {..};

Okay. That's a bit nicer.


webrender/src/renderer.rs, line 1106 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

I'd find it easier to understand if it was named "target_height" instead.

Okay. No problem.


webrender/src/util.rs, line 442 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

would this guarantee that origin + size is not going to end up at infinity?

I think that dividing by two does this, but I have to admit I am not certain. I followed the same strategy used for other implementations of max_rect().


Comments from Reviewable

@kvark
Copy link
Member

kvark commented Nov 21, 2017

@glennw
Great pictures showing different classes of use-cases for clipping! 👍

I think the following statements are true:

Yes, except for the fact A is already classified as D (without the need of this PR), as you mentioned later:

However, in the case of (A), we can apply the same logic as (D) - that is, the coordinate systems are compatible, and therefore we can skip the clip mask and apply this as a local clip in the vertex shader.

So yes, the PR only addresses use-case (B), and it does add a bit of overhead for the general case. How much though, is still to be figure out.

(2) In terms of GPU time, we are always limited by the number of fragments we process. This solution still invokes the fragment shader for those extra pixels (when drawing the primitive). Although it does save drawing a clip mask,

I don't think this PR makes us process more fragments than before. Instead of drawing a bunch of stuff into the clip mask, we are now drawing less fragments of the main primitive. So, it appears to me that we'd be processing less fragments for (B) with this PR, and same number of fragments for (A), (C), and (D) with a few added instructions.

I think we can do better by skipping the clip mask and applying a local clip in the vertex shader in these cases - this means we don't invoke the FS at all for the clipped fragments.

Well, that part needs some elaboration :) Reason we can't do it is because the local clip is in the local space, and the clip we are trying to process here is in screen space. So vertex shader would only be able to help if we had gl_ClipDistance, and even then it would likely to turn out into simple discard() calls on the modern HW, which we can do already ourselves.

@mrobinson
Copy link
Member Author

mrobinson commented Nov 21, 2017

Great pictures showing different classes of use-cases for clipping! 👍

Totally agree! Those are really nice. I would love to put together some documentation with these types of diagrams.

Well, that part needs some elaboration :) Reason we can't do it is because the local clip is in the local space, and the clip we are trying to process here is in screen space. So vertex shader would only be able to help if we had gl_ClipDistance, and even then it would likely to turn out into simple discard() calls on the modern HW, which we can do already ourselves.

This brings up a good point which is to test the impact of using discard instead of 0 alpha.

@kvark
Copy link
Member

kvark commented Nov 21, 2017

So vertex shader would only be able to help if we had gl_ClipDistance

Another alternative is to have scissor-enabled batches. This would work best in terms of number of fragments and their processing cost, would be fairly straightforward to put in (least intrusive of all proposals) but would introduce batch breaks. If what @glennw was saying is correct ("we almost never actually see this in real world"), then we can totally afford this.

@mrobinson
Copy link
Member Author

mrobinson commented Nov 21, 2017

I just tested this, and interestingly enough using an early branch/discard approach seems to erase all performance benefits of this PR on my hardware (Mesa DRI Intel(R) HD Graphics P530 (Skylake GT2)).

We calculate the real inner and outer boundaries of clips in screen
space and use it to discard clips and generate masks that are a little
smaller in some cases. This results in a small performance improvement
on pages with complicated clipping.
@mrobinson mrobinson force-pushed the mrobinson:optimize-clipping branch from 7b2abb1 to dd29d63 Nov 21, 2017
@mrobinson
Copy link
Member Author

mrobinson commented Nov 21, 2017

I've pushed a new version of this patch which seems to slightly improve GPU performance and decrease the regression to CPU performance. It also makes WebRender a bit smarter about creating shorter ClipChains.

Before
CPU: 1.50
Compositor: 15.15
GPU: 11.88-12.00

After
CPU: ~1.65
Compositor: 15.30
GPU: 10 - 10.77

I'm pretty sure the compositor differences are just noise here.

@glennw
Copy link
Member

glennw commented Nov 21, 2017

@mrobinson In terms of GPU timing - could you try with vsync disabled? That will certainly be the cause of the compositor timing being weird (it includes the wait for vsync time), and also has a significant effect on the GPU timer queries (on my machine, anyway).

@glennw
Copy link
Member

glennw commented Nov 21, 2017

I'll have a proper read through this today (thanks for the detailed replies!) but I do quite like the idea of scissor enabled batches - it'd be interesting to see what effect that has on timing (if it is indeed quick to hack in and test).

@bors-servo
Copy link
Contributor

bors-servo commented Nov 22, 2017

The latest upstream changes (presumably #2069) made this pull request unmergeable. Please resolve the merge conflicts.

@glennw
Copy link
Member

glennw commented Nov 22, 2017

I don't think this PR makes us process more fragments than before. Instead of drawing a bunch of stuff into the clip mask, we are now drawing less fragments of the main primitive. So, it appears to me that we'd be processing less fragments for (B) with this PR, and same number of fragments for (A), (C), and (D) with a few added instructions.

That's correct - I didn't mean to imply that we're drawing more fragments than previously. What I meant was that if we're using this technique to mask out fragments in the (A) case, we'd be drawing more fragments than if we were to use the local clip rect in that case. But if I understand the PR, we're not processing less fragments either - since we end up still running the FS on all the pixels, just masking out the result at the end of the shader?

Well, that part needs some elaboration :) Reason we can't do it is because the local clip is in the local space, and the clip we are trying to process here is in screen space. So vertex shader would only be able to help if we had gl_ClipDistance, and even then it would likely to turn out into simple discard() calls on the modern HW, which we can do already ourselves.

Yea, I was discussing that in the context of (A). Certainly if we are encountering a lot of the (B) cases, we either need this technique or clip masks (assuming we don't have gl_ClipDistance etc).

To try and sum that up:

  • I suspect that in this test case, we're primarily encountering cases of (A), rather than (B).
  • I suspect that we're not doing the optimal thing in the case of (A) - that is, handling these as a local clip rect and avoiding mask generation where possible.
  • If the above is true, I suspect this technique would be faster than the current code, due to avoiding mask generation, but that if we were in fact handling the (A) cases with a local clip, we'd be faster again even without this technique.

There's a lot of ifs, buts and maybes there - perhaps you guys know off hand if some of those guesses above are correct or not?

@kvark
Copy link
Member

kvark commented Nov 22, 2017

@glennw My understanding is that we are already handling (A) optimally, since the local clip does the job. Let's verify!

@kvark
Copy link
Member

kvark commented Nov 22, 2017

If I understand correctly, our reftests/mask/aligned-layer-rect.yaml is case (A). I took a GPU texture and verified that we don't render any clip masks there, as expected. So we are already doing the best job there.

@glennw
Copy link
Member

glennw commented Nov 23, 2017

@mrobinson I was working on some MotionMark benchmarks today. While looking at #1648 it seemed that we were drawing a huge number of redundant clip masks.

I took your PR here, and removed the parts we're still discussing (the GPU parts). You can see what I ended up with here - glennw@c9ce44b.

The good news is that those improvements to clip chain filtering make a huge improvement to #1648. These changes remove all the redundant clip masks, and take the score on my machine from 165 to 472.

Would you be happy to follow up on that and we could land this PR with the CPU-side improvements now, and then continue investigating the GPU-side improvements as a separate issue?

cc @kvark

@mrobinson
Copy link
Member Author

mrobinson commented Nov 23, 2017

Sure. I'll try to pull apart the CPU and GPU bits and open a new PR to follow up on the screen space clipping idea.

@mrobinson mrobinson closed this Nov 27, 2017
@mrobinson mrobinson deleted the mrobinson:optimize-clipping branch Dec 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.