Skip to content

2019 Toronto Thursday

Dzmitry Malyshau edited this page Apr 5, 2019 · 1 revision

Planning - part 2

Assigned tasks to April, May, June and then to Q3 and Q4 of 2019.

GLES2 limitations

(jrmuizel, gw, kvark, nical)

FWIW, we should expect 75%+ of Android devices to support ES3.

No texture arrays:

  • use big 4k by 4k textures
    • Some very old devices are 2k, but should be vanishingly few.
  • can still use the rect packer over smaller 512x512 portions of it
  • if out of space, allocate a new texture and break batches. Hopefully, not often

Around 50% ES2 devices support half-float textures:

Vertex texel fetch:

  • Mali-450(? Amazon device) doesn't support vertex texture units.

Nothing supports instancing:

  • instead of using instanced attributes, copy them over per vertex
  • roughly 4x memory/bandwidth requirements for vertex buffers, could be improved

DL internation - part 2

(kvark, gw)

Idea: hide internation semantics as an implementation detail. Expose the API as device.createSomeObject() -> objectHandle. If the implementation decides to return the old handle - the client (Gecko in our case) doesn't care. Could be done gradually, object by object.

Scene building threads

(nical, kvark)

Go through Low/High priority scene builds to a SB thread per document. Essentially doing the same thing as now, just more granular and explicit.

Still needs a way to synchronize scene builds for both documents when resizing Gecko window.

Frame scheduling

Situation: we don't get more than 2 frames ahead.

Problem: if we fire 2 frames in a row, we'll not have enough time for the 3rd frame procession through the pipeline. Big stall.

Option: don't limit the pipeline by 2 frames

  • coalesce display lists on WR side when needed instead of throttling
  • conflict with WebGL requirements
  • can still throttle in Gecko, just to a higher number of pipeline stages

Q: why do we even have a renderer thread?

  • we go through compositor because of language barrier (impl detail)
  • no real reason, just convenient to implement

Idea: don't go through RB when asking for a scene build:

  • texture cache isn't needed
  • fonts can be shared

Tasks:

  1. serialize DL creation with the end of scene building
  2. remove the RB visit

RB needs to be Vsync synchronized, because it uses the results of inputs.

WebGL:

  • the less frames in flight the better for latency
  • not very clean, has half a frame in flight
  • transaction = drawn frame + fence
  • we only pass the transaction when the fence is reached

Idea: the best way to budget frames and to pipelining is having some heuristics that predict frame consistency.

  • but we don't really want to put heuristics, web is too complex
  • but we already have a heuristic to estimate the time from input sampling to VSync...

Q: how do we reproduce the scheduling problems in general?

Time is only sampled at the start of the compositor. So by the time inputs are sampled, we live in the past.

Chrome approach: DL building starts at -1 vsync, rendering starts 5ms before the vsync. Current WR approach: DL building starts at -2 vsync, rendering starts at -1 vsync.

Note: chrome has less latency but not neccessarily higher throughput. Goal is to make the input latency stable (not necessarily constant).

Idea: both of these periods before vsync-0 are not related to vsyncs, strictly speaking. We need some heuristics to know when to start that work, to finish before the vsync on the GPU at the end of the day. We need to:

  1. detach them from the refresh driver, at first have them fixed to current numbers
  2. start making the heuristics more flexible, based on the previous frames

Problem: we only know how to wake up threads on the refresh driver at the moment

  • solution doesn't have to be exact: an error within 1ms is still acceptable
  • need to look up the way Chrome does it

Tiling with direct composition

(mstange, jrmuizel, gw, kvark)

Example Intel-based macBook has:

  • 720k of L2
  • 8M of L3

Total byte size of the screen buffer is 20M, it doesn't fit into L3 cache, causing us to wait for RAM a lot. Solution:

  • draw to tiles instead of blitting from the full screen into tiles
  • either blit or direct-composit the tiles on screen
  • don't wait for a picture to repeat itself a few frames, always go the tiling code path

Q: What is the best tile size?

  • having it fit in 256K makes us fully within L2 cache and has some benefits
  • current tiles are 4x bigger: 1024x256, still fit in the L3 cache. Can make them 2-4 times more big if we want to.
  • small tiles cause a lot of batches

Q: why does drawing many instances of full-tile blends not scale linearly in GPU time?

  • there is a fixed cost to load the initial framebuffer color as well as write it down at the end
  • just like with tilers on mobile!