Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce per-instance size from 32 -> 16 bytes. #2839

Merged
merged 1 commit into from Jun 25, 2018

Conversation

@gw3583
Copy link
Collaborator

gw3583 commented Jun 22, 2018

Add a new PrimitiveHeader concept, which contains all the common
information for a given primitive (such as scroll node, render
task, z value etc).

Instances contain an index to the primitive header, and read the
information from that. This allows the instance size to be
halved. There is a slight extra vertex shader cost, due to
an extra indirection. However, this was not apparent in any
profiling I did, since the vertex shader time typically makes
up a very small amount of overall time. Additionally, the
primitive header texels are likely to be in the texture cache
for each glyph that is fetched. A possible future improvement
is to remove that extra indirection by writing some of that
data, such as the render task address, directly into the primitive
header.

On sites such as nytimes.com or wikipedia.org, where there are
typically a large number of glyphs, this saves several hundred
kB per frame of data being sent to the GPU, which improves the
compositor / driver CPU time.

Also remove the local_clip_chains vector and texture. Now, we just
calculate this as part of the primitive header and include it
there. This can save significant amounts of texture upload on
pages that have a lot of clip chains or clip nodes, relative to
the visible primitive count.

In future, both the PrimitiveHeader and instance structures can
easily be further compressed again, but this can be done as a
follow up.

Finally, although this is a useful optimization by itself, it's
mostly prep work for some changes I have planned related to
how we store and upload scroll nodes. Hopefully these changes will
be both an optimization and make it easier to fix some of the
correctness issues we have with nested 3d transforms.


This change is Reviewable

@gw3583
Copy link
Collaborator Author

gw3583 commented Jun 22, 2018

r? @kvark

Pending try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b28c8ddb9153725de59d6aa3745e368a138a45c9

(thus far there appears to be one failure in R8 - I'm not sure yet if that's related to this PR or something else).

@staktrace
Copy link
Contributor

staktrace commented Jun 22, 2018

I haven't seen that R8 failure anywhere else so I would lean towards it being related to this PR.

@kvark
Copy link
Member

kvark commented Jun 23, 2018

In general, this makes sense to me. We have thousands of instances, and there is a ton of redundancy coming from our primitive/segment division, so we can move some per-instance data out.

The cost is slightly worrying still, given that the VS can't hide the fetching latency by doing anything else - everything depends on that header.

However, this was not apparent in any
profiling I did, since the vertex shader time typically makes
up a very small amount of overall time.

Did you see any improvement?


Reviewed 17 of 17 files at r1.
Review status: all files reviewed, 3 unresolved discussions (waiting on @gw3583)


webrender/res/ps_text_run.glsl, line 129 at r1 (raw file):

void main(void) {
    int prim_header_address = aData.x;
    int glyph_index = aData.y;

definitely feels like we can shave a few more bytes off an instance in the future


webrender/src/gpu_types.rs, line 211 at r1 (raw file):

#[cfg_attr(feature = "replay", derive(Deserialize))]
pub struct PrimitiveHeaderF {
    pub local_rect: LayoutRect,

I wonder if we could just quantize those, compressing and letting us have all the header info in the same texture


webrender/src/renderer.rs, line 310 at r1 (raw file):

            TextureSampler::Dither => TextureSlot(8),
            TextureSampler::SharedCacheA8 => TextureSlot(9),
            TextureSampler::PrimitiveHeadersF => TextureSlot(11),

where did 10 go?


Comments from Reviewable

Add a new PrimitiveHeader concept, which contains all the common
information for a given primitive (such as scroll node, render
task, z value etc).

Instances contain an index to the primitive header, and read the
information from that. This allows the instance size to be
halved. There is a slight extra vertex shader cost, due to
an extra indirection. However, this was not apparent in any
profiling I did, since the vertex shader time typically makes
up a very small amount of overall time. Additionally, the
primitive header texels are likely to be in the texture cache
for each glyph that is fetched. A possible future improvement
is to remove that extra indirection by writing some of that
data, such as the render task address, directly into the primitive
header.

On sites such as nytimes.com or wikipedia.org, where there are
typically a large number of glyphs, this saves several hundred
kB per frame of data being sent to the GPU, which improves the
compositor / driver CPU time.

Also remove the local_clip_chains vector and texture. Now, we just
calculate this as part of the primitive header and include it
there. This can save significant amounts of texture upload on
pages that have a lot of clip chains or clip nodes, relative to
the visible primitive count.

In future, both the PrimitiveHeader and instance structures can
easily be further compressed again, but this can be done as a
follow up.

Finally, although this is a useful optimization by itself, it's
mostly prep work for some changes I have planned related to
how we store and upload scroll nodes. Hopefully these changes will
be both an optimization and make it easier to fix some of the
correctness issues we have with nested 3d transforms.
@gw3583 gw3583 force-pushed the gw3583:prim-headers-2 branch from 2a8355f to 7b5a93d Jun 24, 2018
@gw3583
Copy link
Collaborator Author

gw3583 commented Jun 24, 2018

Review status: 16 of 17 files reviewed, 3 unresolved discussions (waiting on @kvark and @gw3583)


webrender/res/ps_text_run.glsl, line 129 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

definitely feels like we can shave a few more bytes off an instance in the future

Yup, plenty of room for further compression.


webrender/src/gpu_types.rs, line 211 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

I wonder if we could just quantize those, compressing and letting us have all the header info in the same texture

I think that's quite possible as a follow up, yes.


webrender/src/renderer.rs, line 310 at r1 (raw file):

Previously, kvark (Dzmitry Malyshau) wrote…

where did 10 go?

Oops, forgot to reorder when I removed the clip rects texture. Fixed.


Comments from Reviewable

@gw3583
Copy link
Collaborator Author

gw3583 commented Jun 24, 2018

@kvark Addressed the review comments. In terms of profiling, I typically see a very small cost in VS time, and a significant saving in render backend / compositor time (for example, on nytimes.com - VS time is now 5.4% of the scene, previously 5.2%, but compositor and backend time both drop by ~0.3ms, which is a significant portion of the total CPU time).

@staktrace I verified that the failure in R8 is occurring in a nightly build when WR is enabled - I'm not sure how that occurred, but it seems unrelated to this PR.

@gw3583
Copy link
Collaborator Author

gw3583 commented Jun 24, 2018

TC failure appears to be a network issue:

[taskcluster:error] Error: Error: (HTTP code 500) server error - {"message":"Get https://registry-1.docker.io/v2/staktrace/webrender-test/manifests/freetype28: received unexpected HTTP status: 503 Service Unavailable"}
@kvark
Copy link
Member

kvark commented Jun 25, 2018

:lgtm:


Reviewed 1 of 1 files at r2.
Review status: all files reviewed, 2 unresolved discussions (waiting on @kvark)


Comments from Reviewable

@kvark
Copy link
Member

kvark commented Jun 25, 2018

@bors-servo
Copy link
Contributor

bors-servo commented Jun 25, 2018

📌 Commit 7b5a93d has been approved by kvark

@bors-servo
Copy link
Contributor

bors-servo commented Jun 25, 2018

Testing commit 7b5a93d with merge 19a2a26...

bors-servo added a commit that referenced this pull request Jun 25, 2018
Reduce per-instance size from 32 -> 16 bytes.

Add a new PrimitiveHeader concept, which contains all the common
information for a given primitive (such as scroll node, render
task, z value etc).

Instances contain an index to the primitive header, and read the
information from that. This allows the instance size to be
halved. There is a slight extra vertex shader cost, due to
an extra indirection. However, this was not apparent in any
profiling I did, since the vertex shader time typically makes
up a very small amount of overall time. Additionally, the
primitive header texels are likely to be in the texture cache
for each glyph that is fetched. A possible future improvement
is to remove that extra indirection by writing some of that
data, such as the render task address, directly into the primitive
header.

On sites such as nytimes.com or wikipedia.org, where there are
typically a large number of glyphs, this saves several hundred
kB per frame of data being sent to the GPU, which improves the
compositor / driver CPU time.

Also remove the local_clip_chains vector and texture. Now, we just
calculate this as part of the primitive header and include it
there. This can save significant amounts of texture upload on
pages that have a lot of clip chains or clip nodes, relative to
the visible primitive count.

In future, both the PrimitiveHeader and instance structures can
easily be further compressed again, but this can be done as a
follow up.

Finally, although this is a useful optimization by itself, it's
mostly prep work for some changes I have planned related to
how we store and upload scroll nodes. Hopefully these changes will
be both an optimization and make it easier to fix some of the
correctness issues we have with nested 3d transforms.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/2839)
<!-- Reviewable:end -->
@bors-servo
Copy link
Contributor

bors-servo commented Jun 25, 2018

☀️ Test successful - status-appveyor, status-taskcluster
Approved by: kvark
Pushing 19a2a26 to master...

@bors-servo bors-servo merged commit 7b5a93d into servo:master Jun 25, 2018
2 of 4 checks passed
2 of 4 checks passed
Taskcluster (pull_request) TaskGroup: failure
Details
code-review/reviewable 2 discussions left (kvark)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
homu Test successful
Details
@staktrace
Copy link
Contributor

staktrace commented Jun 25, 2018

@staktrace I verified that the failure in R8 is occurring in a nightly build when WR is enabled - I'm not sure how that occurred, but it seems unrelated to this PR.

@glennw did you verify this on try or on a local build? On try the R8 is 100% green without your patch and 100% failing with your patch. https://bugzilla.mozilla.org/show_bug.cgi?id=1470125#c4

@gw3583
Copy link
Collaborator Author

gw3583 commented Jun 25, 2018

@staktrace I did only verify with a local reftest run. So, (unless I messed up my testing somewhere) this means:

  • That test fails on a nightly build running reftests locally, but not on CI.
  • This patch makes that test failure also occur on CI.

Taking a closer look at the test, it uses scrollLeft - could it be that the screenshot is being taken in Gecko + WR for the reftest before that scroll applies? It seems unlikely, but that may explain why CI reports it as succeeding?

I'll investigate further today and see if I can work out what is happening.

gw3583 added a commit to gw3583/webrender that referenced this pull request Jun 26, 2018
This fixes a reftest failure in Gecko that was introduced by
the patch in servo#2839.

Make sure that the tight local clip rect used for image and
gradient tiles also includes the local clip rect provided
by the clip-chain.
@gw3583
Copy link
Collaborator Author

gw3583 commented Jun 26, 2018

@staktrace OK, so there are two bugs in play here, which was confusing me. There's a genuine timing bug where the DLs sent to WR aren't always the same for this test case. That was masking that there was also a genuine bug in this patch, related to local clip rects on images that are large enough to require tiling. The fix is #2843.

@gw3583 gw3583 deleted the gw3583:prim-headers-2 branch Jun 26, 2018
bors-servo added a commit that referenced this pull request Jun 26, 2018
Fix local clip rect for tiled images.

This fixes a reftest failure in Gecko that was introduced by
the patch in #2839.

Make sure that the tight local clip rect used for image and
gradient tiles also includes the local clip rect provided
by the clip-chain.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/2843)
<!-- Reviewable:end -->
@staktrace
Copy link
Contributor

staktrace commented Jun 26, 2018

Thanks for chasing this down! I've confirmed that #2843 fixes it on my try pushes. It's odd that the test is sending different DLs to WR though; I guess it might decide to flush a paint before the scroll but the reftest snapshot shouldn't happen until after the scroll.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.