Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of cpu (~100ms) used in the render backend on this test case #1031

Closed
jrmuizel opened this issue Mar 29, 2017 · 9 comments
Closed

Lots of cpu (~100ms) used in the render backend on this test case #1031

jrmuizel opened this issue Mar 29, 2017 · 9 comments

Comments

@jrmuizel
Copy link
Contributor

@jrmuizel jrmuizel commented Mar 29, 2017

https://people-mozilla.org/~jmuizelaar/implementation-tests/dl-test.yaml

This is a test case generated from Gecko running https://people-mozilla.org/~jmuizelaar/implementation-tests/dl-test.html.

The test case is not representative and Gecko still uses stacking contexts a lot more than necessary but it would be good if webrender could handle this better.

@jrmuizel
Copy link
Contributor Author

@jrmuizel jrmuizel commented Mar 29, 2017

Make sure to run with the --rebuild option. This matches the original html because the scene is actually changing.

@jrmuizel jrmuizel changed the title Lots of cpu (~15ms) used in the render backend on this test case Lots of cpu (~100ms) used in the render backend on this test case Mar 29, 2017
@jrmuizel
Copy link
Contributor Author

@jrmuizel jrmuizel commented Mar 30, 2017

#1041 helps with this test case.

@jrmuizel
Copy link
Contributor Author

@jrmuizel jrmuizel commented Mar 31, 2017

#1043 will help too.

@glennw
Copy link
Member

@glennw glennw commented Mar 31, 2017

I did some profiling on this today. First, the overall profile counters:

  • ~8300 primitives, of which ~5000 are visible.
  • ~8300 stacking contexts.

profile2

Things to note:

  • CPU backend time is clearly the problem.
  • Compositor time and GPU time are fine.
  • Total vertices is ~31k - should be no problem for any GPU these days.
  • Total draw calls is 1 - batching works! 😄

Looking at a CPU profile of the backend thread:

profile1

  • Both scene and frame build are slow, due to the sheer number of stacking contexts.
  • The majority of the time is dealing with the scene - traversing the stacking context tree and extracting primitives.
  • flatten_stacking_context itself is ~18ms in this test case to traverse the ~8300 stacking contexts.

There's a few obvious things we can do to improve the speed of this scene:

  1. Don't create as many stacking contexts 😀
  2. Improve memory allocation patterns and structure sizes.
  • I would like to experiment with a custom allocator. Many of the structures we use to build a frame follow a very simple allocation and access pattern. I believe this could give us quite large performance wins.
  • Retain more of the allocated structures between frames. This is probably not required if a custom allocator works.
  1. Run some of this workload across CPU threads.
  • In many scenes, multi-threading the stacking context traversal isn't a big win, due to how shallow the stacking context tree typically is. In this case, it would be a large win. It's not clear to me how much work this involves, but the majority of the tree traversal is data independent of other parts of the tree (the clip_scroll tree is the current exception, but this should be fixable).
@glennw
Copy link
Member

@glennw glennw commented Mar 31, 2017

The other option, of course, is incremental display list updates.

This is probably my preferred approach - taking some of the concepts that iframes use to embed a display list within a parent display list, and extending that to be a more general concept. This would allow us to build and modify display lists in small chunks. This would also allow the calling code to use a granularity for display list "chunk size" that is appropriate for the data.

Perhaps we should discuss this in more detail to see if it's worth prototyping?

@glennw
Copy link
Member

@glennw glennw commented Mar 31, 2017

Segmenting large display lists into chunks also provides a natural boundary where it makes sense to build each chunk on a worker thread...

@nical
Copy link
Collaborator

@nical nical commented Apr 21, 2017

I looked at this test case through callgrind and it shows that we spend a lot of time reallocating vectors. I have a prototype patch that reuses the frame's vectors when rebuilding the frame which yields a 12% improvement on the cpu time spent in the render backend with this test case, I'll clean it up and submit it next week. It's not much but it is a start and it should be orthogonal to the problem of the processing many items.

@nical
Copy link
Collaborator

@nical nical commented Apr 24, 2017

After recycling some vectors in FrameBuilder and PrimitiveStore, there is still some amount of time spent growing some data structures, but it looks like most of them are related to the enormous amount of scroll layers and clips in this test case. It would be easy to pre-allocate or recycle most of them but it may not be a good idea to complicate the code even a tiny bit and save another 2 - 5 % on this test case if it is not representative of any real workload. Thoughts?

@nical
Copy link
Collaborator

@nical nical commented Jul 13, 2018

The links for the test cases seem to be dead and async scene building probably fixed a big part of the issue. Closing, but feel free to reopen if there's any test case to act on.

@nical nical closed this Jul 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.