Skip to content

antarctica perf on small devices / tiling gpus #2288

Closed
@robclark

Description

@robclark

This issue is to track performance of new render engine, in particular on small gpu's and tiling gpu's. (I am working on freedreno, the FOSS gallium driver for adreno gpu on qualcomm snapdragon SoC's, ie. "mobile devices".)

tl;dr: with older versions of stk, most levels I'd see ~40fps at 1080p on apq8084, aka snapdragon 805 (adreno 420).. the legacy render engine has fallen into disrepair, but now freedreno supports enough features from gl3.1 to run the new render engine (with MESA_GL_VERSION_OVERRIDE=3.1 MESA_GLSL_VERSION_OVERRIDE=140). However the performance is much worse. More like 5-10fps. (That is just estimated figure, due to #2287 I haven't been able to get good numbers.)

In the course of debugging driver issues with the new render engine, I've noticed a few things. So I think there is some low hanging fruit when it comes to optimizing for mobile devices.

In particular, tiling gpu's like adreno overcome lesser memory bandwidth (compared to discrete gpu's with VRAM) by combining multiple draw calls to same render target to operate in a (relatively small) internal tile buffer. Ie. so on each render target switch, they execute draw N->M for tile0, then draw N->M for tile1, and so on. So the draw call cost is less, compared to render target switch (which forces moving things out to system memory and back). So the increased number of rendering passes is problematic. In addition, the increased size (in terms of bits-per-pixel) of intermediate render targets, is problematic, since it costs precious memory bandwidth.

note: adreno has a "hw binning pass" feature, which separates geometry into which tiles they effect (which is implemented for adreno 3xx but not 4xx yet).. this would effect things to some degree, basically lowering VS cost.. on apq8064 (a320), this was worth ~20-30% fps boost. In general, due to 3x larger tile buffer size, it would matter 3x less on a420. Except for the larger bpp intermediate render target buffers. So there is some driver work still which could improve things (but I think not more than 20-30% at tops).

From what I can tell, current render engine does following passes (and I'm not totally familiar with what these passes do, I'm just looking from the drive upwards at what is going on):

  1. render to GL_RGBA16F @ native resolution - quite a lot of draw calls, building up entire geometry (minus textures from the look of it)
  2. mrt render to color0=GL_R11F_G11F_B10F, color1=GL_R11F_G11F_B10F, both at native resolution.. only a few draw calls, so not building up geometry..
  3. render to GL_RGBA16F @ native resolution - quite a lot of draw calls, building up entire geometry w/ textures this time.. looks like in some cases using result of step 2 as one of the src textures..
  4. "tonemap" pass (judging from comments in shader, from GL_RGBA16F to GL_RBGA16F (at native resolution)
  5. ??? pass, from GL_RGBA16F -> GL_SRGB8_ALPHA8 (at native resolution)
  6. ??? pass, from GL_SRGB8_ALPHA8 -> GL_RGB (at native resolution)

UPDATE: actually it is much worse than 6 passes.. it ends up becoming 11 passes: http://hastebin.com/gipanojuto.js
I think there may be some silly stuff going on, like BindFB(somefb); Clear(); BindFB(otherfb); Clear();..

UPDATE2: I haven't reversed enough about the perf counters to do TIME_ELAPSED queries (and profiling w/ timestamp queries, would require the same thing.. plus give nonsense results for a tiling gpu).. but I've instrumented the kernel to log time at submit plus at time of irq back from gpu.. not perfect, but should give us some rough idea of the cost of various passes. A few grains of salt should possibly be taken, since it isn't a perfect setup. The costs aren't quite in line with what I expected.. otoh this is with snapdragon 805 (which has quite good memory bandwidth compared to other SoC gpus.. on the order of ~25G/s..): http://hastebin.com/ayuzihitij.md. Basically, it looks like, assuming I've managed to line up things properly, the first pass (lighting prepass?) is ~11ms.. the second pass (to r11fg11fb10f is a bit surprising at ~33ms.. possible that # is bogus.. or possible there is something fairly inefficient about that format w/ this gpu?? third pass is ~25ms. After that, there is some silliness with switching render target and clearing bug not actually doing any rendering, which seems to cost together ~10ms. And after that the hdr related passes cost together ~24ms. I'm not super confident that those are all correct but the 1st and 3rd passes seem plausible. I still think a big win would be an option to disable hdr (and skip passes 4->6) (and rendering pass 3 to final RGB8 would probably reduce it's cost too).

For a tiler, it is better to start off a render pass with glClear() if you don't need previous buffer contents.. this tells the driver that it is not necessary to move previous tile contents back from system memory into internal tile buffer. Steps 1-3 start with a glClear() which is good (but only color.. would be better to clear depth+stencil too). The remaining steps don't start with a clear, which costs additional memory bandwidth as we move tile contents back from main memory, only to be immediately overwritten.

NOTE: It might be interesting to have some sort of extension to allow engine to do a dummy clear, which skips actual clear but tells tiler driver that previous tile contents are not required. Not sure if there is already such a thing (but if so, probably gles only and not gl). If there is interest in such a thing, I could propose a mesa extension.

I think, for a low end / memory bandwidth constrained system, having an option to skip passes 4-6 would be a good thing (and instead have pass 3 render directly to GL_RGB). In addition, doing passes 1-2 at a lower resolution and/or lower precision would seem useful. I would expect the combination of the two would get us close to previous performance levels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions