Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upReduce per-instance size from 32 -> 16 bytes. #2839
Conversation
|
r? @kvark Pending try run: (thus far there appears to be one failure in R8 - I'm not sure yet if that's related to this PR or something else). |
|
I haven't seen that R8 failure anywhere else so I would lean towards it being related to this PR. |
|
In general, this makes sense to me. We have thousands of instances, and there is a ton of redundancy coming from our primitive/segment division, so we can move some per-instance data out. The cost is slightly worrying still, given that the VS can't hide the fetching latency by doing anything else - everything depends on that header.
Did you see any improvement? Reviewed 17 of 17 files at r1. webrender/res/ps_text_run.glsl, line 129 at r1 (raw file):
definitely feels like we can shave a few more bytes off an instance in the future webrender/src/gpu_types.rs, line 211 at r1 (raw file):
I wonder if we could just quantize those, compressing and letting us have all the header info in the same texture webrender/src/renderer.rs, line 310 at r1 (raw file):
where did 10 go? Comments from Reviewable |
Add a new PrimitiveHeader concept, which contains all the common information for a given primitive (such as scroll node, render task, z value etc). Instances contain an index to the primitive header, and read the information from that. This allows the instance size to be halved. There is a slight extra vertex shader cost, due to an extra indirection. However, this was not apparent in any profiling I did, since the vertex shader time typically makes up a very small amount of overall time. Additionally, the primitive header texels are likely to be in the texture cache for each glyph that is fetched. A possible future improvement is to remove that extra indirection by writing some of that data, such as the render task address, directly into the primitive header. On sites such as nytimes.com or wikipedia.org, where there are typically a large number of glyphs, this saves several hundred kB per frame of data being sent to the GPU, which improves the compositor / driver CPU time. Also remove the local_clip_chains vector and texture. Now, we just calculate this as part of the primitive header and include it there. This can save significant amounts of texture upload on pages that have a lot of clip chains or clip nodes, relative to the visible primitive count. In future, both the PrimitiveHeader and instance structures can easily be further compressed again, but this can be done as a follow up. Finally, although this is a useful optimization by itself, it's mostly prep work for some changes I have planned related to how we store and upload scroll nodes. Hopefully these changes will be both an optimization and make it easier to fix some of the correctness issues we have with nested 3d transforms.
|
Review status: 16 of 17 files reviewed, 3 unresolved discussions (waiting on @kvark and @gw3583) webrender/res/ps_text_run.glsl, line 129 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Yup, plenty of room for further compression. webrender/src/gpu_types.rs, line 211 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
I think that's quite possible as a follow up, yes. webrender/src/renderer.rs, line 310 at r1 (raw file): Previously, kvark (Dzmitry Malyshau) wrote…
Oops, forgot to reorder when I removed the clip rects texture. Fixed. Comments from Reviewable |
|
@kvark Addressed the review comments. In terms of profiling, I typically see a very small cost in VS time, and a significant saving in render backend / compositor time (for example, on nytimes.com - VS time is now 5.4% of the scene, previously 5.2%, but compositor and backend time both drop by ~0.3ms, which is a significant portion of the total CPU time). @staktrace I verified that the failure in R8 is occurring in a nightly build when WR is enabled - I'm not sure how that occurred, but it seems unrelated to this PR. |
|
TC failure appears to be a network issue:
|
|
Reviewed 1 of 1 files at r2. Comments from Reviewable |
|
@bors-servo r+ |
|
|
Reduce per-instance size from 32 -> 16 bytes. Add a new PrimitiveHeader concept, which contains all the common information for a given primitive (such as scroll node, render task, z value etc). Instances contain an index to the primitive header, and read the information from that. This allows the instance size to be halved. There is a slight extra vertex shader cost, due to an extra indirection. However, this was not apparent in any profiling I did, since the vertex shader time typically makes up a very small amount of overall time. Additionally, the primitive header texels are likely to be in the texture cache for each glyph that is fetched. A possible future improvement is to remove that extra indirection by writing some of that data, such as the render task address, directly into the primitive header. On sites such as nytimes.com or wikipedia.org, where there are typically a large number of glyphs, this saves several hundred kB per frame of data being sent to the GPU, which improves the compositor / driver CPU time. Also remove the local_clip_chains vector and texture. Now, we just calculate this as part of the primitive header and include it there. This can save significant amounts of texture upload on pages that have a lot of clip chains or clip nodes, relative to the visible primitive count. In future, both the PrimitiveHeader and instance structures can easily be further compressed again, but this can be done as a follow up. Finally, although this is a useful optimization by itself, it's mostly prep work for some changes I have planned related to how we store and upload scroll nodes. Hopefully these changes will be both an optimization and make it easier to fix some of the correctness issues we have with nested 3d transforms. <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/2839) <!-- Reviewable:end -->
|
|
@glennw did you verify this on try or on a local build? On try the R8 is 100% green without your patch and 100% failing with your patch. https://bugzilla.mozilla.org/show_bug.cgi?id=1470125#c4 |
|
@staktrace I did only verify with a local reftest run. So, (unless I messed up my testing somewhere) this means:
Taking a closer look at the test, it uses I'll investigate further today and see if I can work out what is happening. |
This fixes a reftest failure in Gecko that was introduced by the patch in servo#2839. Make sure that the tight local clip rect used for image and gradient tiles also includes the local clip rect provided by the clip-chain.
|
@staktrace OK, so there are two bugs in play here, which was confusing me. There's a genuine timing bug where the DLs sent to WR aren't always the same for this test case. That was masking that there was also a genuine bug in this patch, related to local clip rects on images that are large enough to require tiling. The fix is #2843. |
Fix local clip rect for tiled images. This fixes a reftest failure in Gecko that was introduced by the patch in #2839. Make sure that the tight local clip rect used for image and gradient tiles also includes the local clip rect provided by the clip-chain. <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/2843) <!-- Reviewable:end -->
|
Thanks for chasing this down! I've confirmed that #2843 fixes it on my try pushes. It's odd that the test is sending different DLs to WR though; I guess it might decide to flush a paint before the scroll but the reftest snapshot shouldn't happen until after the scroll. |
gw3583 commentedJun 22, 2018
•
edited by larsbergstrom
Add a new PrimitiveHeader concept, which contains all the common
information for a given primitive (such as scroll node, render
task, z value etc).
Instances contain an index to the primitive header, and read the
information from that. This allows the instance size to be
halved. There is a slight extra vertex shader cost, due to
an extra indirection. However, this was not apparent in any
profiling I did, since the vertex shader time typically makes
up a very small amount of overall time. Additionally, the
primitive header texels are likely to be in the texture cache
for each glyph that is fetched. A possible future improvement
is to remove that extra indirection by writing some of that
data, such as the render task address, directly into the primitive
header.
On sites such as nytimes.com or wikipedia.org, where there are
typically a large number of glyphs, this saves several hundred
kB per frame of data being sent to the GPU, which improves the
compositor / driver CPU time.
Also remove the local_clip_chains vector and texture. Now, we just
calculate this as part of the primitive header and include it
there. This can save significant amounts of texture upload on
pages that have a lot of clip chains or clip nodes, relative to
the visible primitive count.
In future, both the PrimitiveHeader and instance structures can
easily be further compressed again, but this can be done as a
follow up.
Finally, although this is a useful optimization by itself, it's
mostly prep work for some changes I have planned related to
how we store and upload scroll nodes. Hopefully these changes will
be both an optimization and make it easier to fix some of the
correctness issues we have with nested 3d transforms.
This change is