Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Off-screen clip mask generation #556

Merged
merged 1 commit into from Nov 16, 2016
Merged

Off-screen clip mask generation #556

merged 1 commit into from Nov 16, 2016

Conversation

@kvark
Copy link
Member

kvark commented Nov 11, 2016

This is the second major step towards #498
Edit: now actually includes the third step as well (removal of *_clip shaders).

The PR makes all the clip masks to be generated via the cached rendering tasks (of the new kind). These task work on the area of intersection between all the clip items. The resulting draw calls are being batched via the new ClipBatcher.

It supports everything that we currently support, plus the actual handling of arbitrary number of clips.


This change is Reviewable

@kvark kvark mentioned this pull request Nov 11, 2016
4 of 4 tasks complete
Cargo.lock Outdated
@@ -314,7 +314,7 @@ dependencies = [

[[package]]
name = "ipc-channel"
version = "0.6.0"
version = "0.5.1"

This comment has been minimized.

Copy link
@frewsxcv

frewsxcv Nov 11, 2016

Member

Seems unfortunate this is downgraded

This comment has been minimized.

Copy link
@kvark

kvark Nov 11, 2016

Author Member

That's actually not me, it's Glenn - bf3ac6f
not sure how it got in my PR though

This comment has been minimized.

Copy link
@glennw

glennw Nov 11, 2016

Member

There's a bug in ipc-channel 0.6 preventing us from landing the update in Servo, so we had to revert the update in WR (for now).

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

The revert to 0.5.1 is landed in WR master, so if you rebase all the changes related to ipc-channel should disappear.

// ResourceCache allocates/load the actual data
// will be simplified after the TextureCache upgrade
pub image: Option<ImageMask>,
pub device_rect: DeviceRect,

This comment has been minimized.

Copy link
@kvark

kvark Nov 12, 2016

Author Member

One problem that I see with this change is the computation of device_rect. Since layers are supposed to move, this rectangle needs to be re-computed, but there is not tracking of the layers at this point. Is there an example where the layers state is being cached and updated upon any movement? Alternatively, I can re-compute the rectangle every frame.

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Doesn't this get re-computed every frame anyway, since the generate() call is done during cull_layers()?

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

generate() is currently only done if metadata.clip_cache_info.is_none()

Copy link
Member

glennw left a comment

@kvark Not a full review yet, just some nits from a quick scan. I'll take a more in-depth look today. The general idea seems great though :)

Cargo.lock Outdated
@@ -314,7 +314,7 @@ dependencies = [

[[package]]
name = "ipc-channel"
version = "0.6.0"
version = "0.5.1"

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

The revert to 0.5.1 is landed in WR master, so if you rebase all the changes related to ipc-channel should disappear.

@@ -18,7 +18,7 @@ byteorder = "0.5"
euclid = "0.10"
fnv="1.0"
gleam = "0.2"
ipc-channel = "0.6"
ipc-channel = "0.5"

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Remove

int render_task_index;
int layer_index;
int data_index;
int pad;

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

The pad field can be removed since this is a VS only structure.

* file, You can obtain one at http://mozilla.org/MPL/2.0/. */

void main(void) {
oFragColor = vec4(1, 1, 1, 1);

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

1 -> 1.0 for strict GLESv3 compilers.

#[derive(Debug, Copy, Clone, Eq, PartialEq, Hash)]
pub struct ClipAddressRange {
pub start: GpuStoreAddress, // start GPU address
pub count: u32, // number of items, not bytes

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Perhaps rename to item_count?

@@ -36,7 +36,7 @@ pub struct RenderBackend {
next_namespace_id: IdNamespace,

resource_cache: ResourceCache,
dummy_resources: DummyResources,
_dummy_resources: DummyResources,

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Can this be removed completely?

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

agreed

&projection);
}

// Draw the clip items into the tiled alpha mask.
if true {

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Remove condition

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

eh, ok :)

@@ -567,11 +645,14 @@ struct CompileTileContext<'a> {
struct RenderTargetContext<'a> {
layer_store: &'a [StackingContext],
prim_store: &'a PrimitiveStore,
resource_cache: &'a ResourceCache,
_clip_stack: &'a ClipRegionStack,

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Unused?

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

removed now

@glennw
Copy link
Member

glennw commented Nov 14, 2016

@kvark What was the reason for having to cover the entire tile with each clip primitive again? I think we'll probably need to find a solution that doesn't require this - I think the performance overhead will probably be too high otherwise. I'll pull the branch today and do some performance tests though to check that.

@kvark
Copy link
Member Author

kvark commented Nov 14, 2016

The name became slightly misleading. The clip tile is not the tile a clip belongs to, it's actually just representing the screen space boundary of thr intersection between all clip instances, so the performance is not expected to suffer.

On Nov 13, 2016, at 19:46, Glenn Watson notifications@github.com wrote:

@kvark What was the reason for having to cover the entire tile with each clip primitive again? I think we'll probably need to find a solution that doesn't require this - I think the performance overhead will probably be too high otherwise. I'll pull the branch today and do some performance tests though to check that.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

return cci;
}

// The transformed vertex function that always covers the whole tile with the primitive

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Let's specify that the whole tile means the intersection of the clip instances here.

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

reworked the whole tile confusion, at last

@glennw
Copy link
Member

glennw commented Nov 14, 2016

Doing some testing with this today, and ran into something which I assume is a bug - unless I'm misunderstanding something:

I have a test case that contains a single div, with the follow style:

  position: absolute;
  top: 100px;
  left: 100px;
  width: 100px;
  height: 100px;
  background: red;
  border-radius: 32px;

With some logging to show what the renderer is submitting for clip cache draws, I see:

draw_target Some((TextureId { name: 1033, target: 35866 }, 0))
clear [CacheClipInstance { task_id: 12, layer_index: 0, address: GpuStoreAddress(0), pad: 0 }, 
CacheClipInstance { task_id: 13, layer_index: 0, address: GpuStoreAddress(0), pad: 0 }]
rect [CacheClipInstance { task_id: 13, layer_index: 0, address: GpuStoreAddress(0), pad: 0 }]

draw_target None
clear [CacheClipInstance { task_id: 14, layer_index: 0, address: GpuStoreAddress(0), pad: 0 }]

Which suggests that it's issuing two clear draw calls for the cache render, and also a clear on the main render target. I was expecting to see 1 clear + 1 rect in the render target, and no clip items in the main frame buffer target.

(I ensured the div is positioned completely inside one tile, just to rule out any bugs related to crossing tile boundaries).

@glennw
Copy link
Member

glennw commented Nov 14, 2016

Ah, those extra clears are related to the dummy mask item - I guess that always gets created at the moment, even if not needed?

@kvark
Copy link
Member Author

kvark commented Nov 14, 2016

Yeah, and it's also incomplete/buggy. I've been looking into fixing it, hopefully done by tomorrow. Note: it should be just one draw, I'll have a closer look tomorrow about the rest.

On Nov 13, 2016, at 21:03, Glenn Watson notifications@github.com wrote:

Ah, those extra clears are related to the dummy mask item - I guess that always gets created at the moment, even if not needed?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Copy link
Member

glennw left a comment

This looks promising! Added a few comments, feel free to reply here or perhaps they might be easier to discuss on IRC / Vidyo during the week.

@@ -49,7 +52,6 @@ uniform sampler2DArray sCache;
uniform sampler2D sLayers;
uniform sampler2D sRenderTasks;
uniform sampler2D sPrimGeometry;
uniform sampler2D sClips;

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

We could also remove Clips from the TextureSampler enum, and associated code.


float do_clip() {
// anything outside of the mask is considered transparent
bvec4 inside = lessThanEqual(

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

It's unfortunate to add the cost of this conditional to every shader - is it actually needed? Shouldn't we always be able to just sample from the clip mask?

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

We'd have to clamp to UV borders, and even then - we don't currently guarantee that those borders contain opaque data. So - possible, but non-trivial.
I don't see it as a big deal though, just a few comparisons with masking the fetch result.

vec2 final_pos = tile.screen_origin_task_origin.zw +
tile.size_target_index.xy * aPosition.xy;

gl_Position = uTransform * vec4(final_pos, 0, 1);

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Use float constants

vec2 source_uv = clamped_mask_uv * vClipMaskUvRect.zw + vClipMaskUvRect.xy;
float clip_alpha = texture(sMask, source_uv).r; //careful: texture has type A8

oFragColor = vec4(1, 1, 1, min(alpha, clip_alpha));

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Use float constants


float clip_alpha = rounded_rect(local_pos);

oFragColor = vec4(1, 1, 1, min(alpha, clip_alpha));

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Use float constants

@@ -11,6 +11,7 @@ void main(void) {
vec2 local_pos = vPos;
#endif

alpha = min(alpha, do_clip(local_pos));
//alpha = min(alpha, do_clip(local_pos));
alpha = min(alpha, do_clip());
oFragColor = vColor * vec4(1, 1, 1, alpha);

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Use float constants

}

pub struct ClipRegionStack {
layers: HashMap<StackingContextIndex, LayerInfo>,

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

It would be good if we could get rid of this layers hash map - the LayerInfo struct doesn't really seem to store anything that's not already stored in the main Layer Store. Could we instead just pass in a borrow of the Layer Store when it's needed in generate(), including the current layer id? That way, we guarantee that there is one source of truth for the stacking context transform, that includes scroll offsets etc.

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

done


pub struct ClipRegionStack {
layers: HashMap<StackingContextIndex, LayerInfo>,
image_masks: Vec<ImageMask>,

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

image_masks seems to be unused - remove?

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

done

@@ -5,7 +5,7 @@
use renderer::MAX_VERTEX_TEXTURE_WIDTH;
use std::mem;

#[derive(Debug, Copy, Clone)]
#[derive(Debug, Copy, Clone, Eq, Hash, PartialEq)]

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

I'm unsure about this - maybe it's necessary but it just doesn't seem quite right. Perhaps in the cache key we could store the ImageKey instead of the actual GpuStoreAddress?

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

we may be able to store ImageKey instead of the gpu address in the cache key, but there is also ClipAddressRange in there that uses GpuStoreAddress

let image = match source {
&PrimitiveClipSource::NoClip => None,
&PrimitiveClipSource::Complex(rect, radius) => {
let address = clip_store.alloc(CLIP_DATA_GPU_SIZE);

This comment has been minimized.

Copy link
@glennw

glennw Nov 14, 2016

Member

Ideally we don't want to be calling alloc() on gpu stores during generate. The reason is that the gpu stores are not re-created during scrolling, so this will end up growing the gpu stores during scrolling. What most of the primitives do is a gpu store alloc when the prim is first added (if needed), and then just get a (mutable) reference to that address when they need to update that data. I think that would work OK in this case?

This comment has been minimized.

Copy link
@kvark

kvark Nov 14, 2016

Author Member

Ok, so this appears to be complicating things a bit (implementing now). The old generate logic is going to be split into 3 parts, starting with the GPU store allocation (part 1), and proceeding to actually filling the GPU stores during the primitives generation, if needed (part 2), and followed by the update of the device rectangle (part 3 - again, if needed).

@bors-servo
Copy link
Contributor

bors-servo commented Nov 14, 2016

The latest upstream changes (presumably #552) made this pull request unmergeable. Please resolve the merge conflicts.

@kvark kvark force-pushed the kvark:clip_stack branch from 22e4212 to 9b9918c Nov 14, 2016
@kvark
Copy link
Member Author

kvark commented Nov 14, 2016

@glennw thank you for the fantastic review!
I've addressed most of your concerns (and hope to chat about the rest few). The ClipRegionStack is no more. You can see the code is much more even now (+889 −676 loc), and layers scrolling should be properly supported (running the tests ATM, unsure if it's covered though).
The shader side of things is also clear now - with the distinct ClipArea struct used instead of piggy-backing on the Tile. The dummy task renders nothing, is only added on the first opaque primitive for the tile, and works correctly.

@kvark
Copy link
Member Author

kvark commented Nov 14, 2016

Fixed the dummy task now some more and confirmed the mozilla test suite passes for servo.

@kvark kvark force-pushed the kvark:clip_stack branch from dbc575d to ec63c19 Nov 14, 2016
@bors-servo
Copy link
Contributor

bors-servo commented Nov 14, 2016

The latest upstream changes (presumably #554) made this pull request unmergeable. Please resolve the merge conflicts.

@glennw
Copy link
Member

glennw commented Nov 14, 2016

@kvark Thanks! I'll take another look at this today.

I did some testing, and in general the performance seems quite good. A good page to do some testing with is https://github.com/servo/servo - the performance goes off a cliff as you scroll around halfway down the page. The GPU profiler shows that it is spending the majority of its time generating clip masks, so that would be worth investigating. There also seems to be some clipping artifacts when scrolling on some of the buttons.

It's possible those slowdowns / artifacts are unrelated to this change, but I haven't noticed them previously.

Copy link
Member

glennw left a comment

Left a couple of minor comments. Let's look into the GH performance issue too and find out what is causing that to be slow. Then we can either add a follow up task to fix that, or fix it now if simple. Then this should be ready to merge.

@@ -156,8 +157,13 @@ impl AlphaBatchHelpers for PrimitiveStore {
PrimitiveKind::TextRun |
PrimitiveKind::Image |
PrimitiveKind::Gradient |
PrimitiveKind::BoxShadow => true,

PrimitiveKind::BoxShadow => {

This comment has been minimized.

Copy link
@glennw

glennw Nov 15, 2016

Member

Let's do this test outside the match statement, so it handles borders too.

if let Some(mask_task) = RenderTask::new_mask(self.rect, clip_info) {
current_task.children.push(mask_task);
} else {
// The primitive has clip items but their intersection is empty

This comment has been minimized.

Copy link
@glennw

glennw Nov 15, 2016

Member

Ideally we'd catch this earlier, and not even mark the primitive as visible in this case. But we can do that as a follow up (let's just add a TODO comment here about it).

@kvark
Copy link
Member Author

kvark commented Nov 15, 2016

@glennw I checked the servo page and haven't found anything outstanding in terms of what the implementation is doing. There is quite a large of an area associated with the clips (~80% screen space), so that's where the slowdown can come from. Frankly, I do not observe the "going off the cliff" effect, the FPS drops from 55 to 51 when scrolled to the middle of the document.

Given that the rounded corner radius is typically about 2-3px, up to 10, and the rectangles are pretty big, I can think of a way to drastically reduce the time spent on clip generation:

  1. make sure the target is cleared to 1.0 (opaque). This may require a separate target from the general RT cache since that one is cleared (intentionally) to 0.0.
  2. draw only 4 corners instead of the whole thing.

I'm going to rebase and resolve the last concerns.

@kvark kvark force-pushed the kvark:clip_stack branch 2 times, most recently from 9f1255e to 2942ca4 Nov 15, 2016
…tiple rounded cornered rectangles.

Removal of the _clip shader variants.
@kvark kvark force-pushed the kvark:clip_stack branch from 2942ca4 to 3967c7d Nov 15, 2016
@glennw
glennw approved these changes Nov 16, 2016
@glennw
Copy link
Member

glennw commented Nov 16, 2016

@bors-servo
Copy link
Contributor

bors-servo commented Nov 16, 2016

📌 Commit 3967c7d has been approved by glennw

@bors-servo
Copy link
Contributor

bors-servo commented Nov 16, 2016

Test exempted - status

@bors-servo bors-servo merged commit 3967c7d into servo:master Nov 16, 2016
3 checks passed
3 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test exempted
Details
bors-servo added a commit that referenced this pull request Nov 16, 2016
Off-screen clip mask generation

This is the second major step towards #498
Edit: now actually includes the third step as well (removal of `*_clip` shaders).

The PR makes all the clip masks to be generated via the cached rendering tasks (of the new kind). These task work on the area of intersection between all the clip items. The resulting draw calls are being batched via the new `ClipBatcher`.

It supports everything that we currently support, plus the actual handling of arbitrary number of clips.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/556)
<!-- Reviewable:end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.