antarctica perf on small devices / tiling gpus #2288

Open
robclark opened this Issue Aug 15, 2015 · 23 comments

Projects

None yet

5 participants

@robclark
Contributor

This issue is to track performance of new render engine, in particular on small gpu's and tiling gpu's. (I am working on freedreno, the FOSS gallium driver for adreno gpu on qualcomm snapdragon SoC's, ie. "mobile devices".)

tl;dr: with older versions of stk, most levels I'd see ~40fps at 1080p on apq8084, aka snapdragon 805 (adreno 420).. the legacy render engine has fallen into disrepair, but now freedreno supports enough features from gl3.1 to run the new render engine (with MESA_GL_VERSION_OVERRIDE=3.1 MESA_GLSL_VERSION_OVERRIDE=140). However the performance is much worse. More like 5-10fps. (That is just estimated figure, due to #2287 I haven't been able to get good numbers.)

In the course of debugging driver issues with the new render engine, I've noticed a few things. So I think there is some low hanging fruit when it comes to optimizing for mobile devices.

In particular, tiling gpu's like adreno overcome lesser memory bandwidth (compared to discrete gpu's with VRAM) by combining multiple draw calls to same render target to operate in a (relatively small) internal tile buffer. Ie. so on each render target switch, they execute draw N->M for tile0, then draw N->M for tile1, and so on. So the draw call cost is less, compared to render target switch (which forces moving things out to system memory and back). So the increased number of rendering passes is problematic. In addition, the increased size (in terms of bits-per-pixel) of intermediate render targets, is problematic, since it costs precious memory bandwidth.

note: adreno has a "hw binning pass" feature, which separates geometry into which tiles they effect (which is implemented for adreno 3xx but not 4xx yet).. this would effect things to some degree, basically lowering VS cost.. on apq8064 (a320), this was worth ~20-30% fps boost. In general, due to 3x larger tile buffer size, it would matter 3x less on a420. Except for the larger bpp intermediate render target buffers. So there is some driver work still which could improve things (but I think not more than 20-30% at tops).

From what I can tell, current render engine does following passes (and I'm not totally familiar with what these passes do, I'm just looking from the drive upwards at what is going on):

  1. render to GL_RGBA16F @ native resolution - quite a lot of draw calls, building up entire geometry (minus textures from the look of it)
  2. mrt render to color0=GL_R11F_G11F_B10F, color1=GL_R11F_G11F_B10F, both at native resolution.. only a few draw calls, so not building up geometry..
  3. render to GL_RGBA16F @ native resolution - quite a lot of draw calls, building up entire geometry w/ textures this time.. looks like in some cases using result of step 2 as one of the src textures..
  4. "tonemap" pass (judging from comments in shader, from GL_RGBA16F to GL_RBGA16F (at native resolution)
  5. ??? pass, from GL_RGBA16F -> GL_SRGB8_ALPHA8 (at native resolution)
  6. ??? pass, from GL_SRGB8_ALPHA8 -> GL_RGB (at native resolution)

UPDATE: actually it is much worse than 6 passes.. it ends up becoming 11 passes: http://hastebin.com/gipanojuto.js
I think there may be some silly stuff going on, like BindFB(somefb); Clear(); BindFB(otherfb); Clear();..

UPDATE2: I haven't reversed enough about the perf counters to do TIME_ELAPSED queries (and profiling w/ timestamp queries, would require the same thing.. plus give nonsense results for a tiling gpu).. but I've instrumented the kernel to log time at submit plus at time of irq back from gpu.. not perfect, but should give us some rough idea of the cost of various passes. A few grains of salt should possibly be taken, since it isn't a perfect setup. The costs aren't quite in line with what I expected.. otoh this is with snapdragon 805 (which has quite good memory bandwidth compared to other SoC gpus.. on the order of ~25G/s..): http://hastebin.com/ayuzihitij.md. Basically, it looks like, assuming I've managed to line up things properly, the first pass (lighting prepass?) is ~11ms.. the second pass (to r11fg11fb10f is a bit surprising at ~33ms.. possible that # is bogus.. or possible there is something fairly inefficient about that format w/ this gpu?? third pass is ~25ms. After that, there is some silliness with switching render target and clearing bug not actually doing any rendering, which seems to cost together ~10ms. And after that the hdr related passes cost together ~24ms. I'm not super confident that those are all correct but the 1st and 3rd passes seem plausible. I still think a big win would be an option to disable hdr (and skip passes 4->6) (and rendering pass 3 to final RGB8 would probably reduce it's cost too).

For a tiler, it is better to start off a render pass with glClear() if you don't need previous buffer contents.. this tells the driver that it is not necessary to move previous tile contents back from system memory into internal tile buffer. Steps 1-3 start with a glClear() which is good (but only color.. would be better to clear depth+stencil too). The remaining steps don't start with a clear, which costs additional memory bandwidth as we move tile contents back from main memory, only to be immediately overwritten.

NOTE: It might be interesting to have some sort of extension to allow engine to do a dummy clear, which skips actual clear but tells tiler driver that previous tile contents are not required. Not sure if there is already such a thing (but if so, probably gles only and not gl). If there is interest in such a thing, I could propose a mesa extension.

I think, for a low end / memory bandwidth constrained system, having an option to skip passes 4-6 would be a good thing (and instead have pass 3 render directly to GL_RGB). In addition, doing passes 1-2 at a lower resolution and/or lower precision would seem useful. I would expect the combination of the two would get us close to previous performance levels.

@robclark
Contributor

note: I might be able to rig up something to snapshot some timer at each render target switch to get some better idea of costs for each pass.. but will take me some time and not sure how much time I'll have in the upcoming week. But pretty sure memory bandwidth cost is the key here, so reducing passes and reducing size of passes will be a win.

@yzsolt
yzsolt commented Aug 16, 2015

I'm not associated with this project, but I'm building an OpenGL renderer in my spare time and I think these steps describe an HDR renderer's pipeline. They work with HDR float colors in the framebuffer, then they convert to linear (not gamma-corrected) LDR in step 5 (that should be tone mapping), and the last step should be the actual linear -> gamma corrected color space step (which is done by the driver probably). I don't think there's a simpler way to do HDR rendering.

@Ancurio
Ancurio commented Aug 16, 2015

NOTE: It might be interesting to have some sort of extension to allow engine to do a dummy clear, which skips actual clear but tells tiler driver that previous tile contents are not required. Not sure if there is already such a thing (but if so, probably gles only and not gl). If there is interest in such a thing, I could propose a mesa extension.

EXT_discard_framebuffer ?

@robclark
Contributor

@yzsolt ahh, so I guess then an option to disable HDR is part of what I want..

@Ancurio ahh, looks like I should implement EXT_discard_framebuffer in mesa/gallium then. (Should be useful for vc4 too, the gallium driver for r-pi.) Although I guess we still need a GL version of that extension.

@Ancurio
Ancurio commented Aug 16, 2015

@robclark yeah, the core mesa implementation already exists but is limited to GLES, IIRC because it depends on the GLES specific FBO extension, so a MESA version for desktop GL might be necessary. Too bad this isn't core GLES2, otherwise the compat ext would have covered it already ;)

@robclark
Contributor

@Ancurio yeah, just looking at how best to wire it up in mesa state_tracker.. I think I can probably re-use some of the groundwork that Eric Anholt put in place to support discarding depth/stencil after eglSwapBuffers()

I think we might need EXT_discard_framebuffer2 (or something like that) also, since it seems like existing extension restricts things to GL_COLOR_ATTACHMENT0 for no good reason (afaict)

@Ancurio
Ancurio commented Aug 16, 2015

@robclark

I think we might need EXT_discard_framebuffer2 (or something like that) also, since it seems like existing extension restricts things to GL_COLOR_ATTACHMENT0 for no good reason (afaict)

Now that you say it, that is really weird, issue 1 in the ext spec seems to imply the opposite. I wonder if that was an accident.

Edit: I just noticed core GLES 3.0 has glInvalidateFramebuffer, which looks to be a proper successor to the EXT one, and it explicitly allows clearing of any color attachments. Sorry for the confusion.

@robclark
Contributor

hmm,

    RESOLVED: We'll use a sized list of framebuffer attachments.  This
    will give us some future-proofing for when MRTs and multisampled
    FBOs are supported.

maybe they planned to add it but forgot? I'm not really sure. Currently mesa will throw an error for GL_COLOR_ATTACHMENTn for n>1

@Ancurio
Ancurio commented Aug 16, 2015

@robclark Here is the core Mesa plumbing for glInvalidate(Sub)Framebuffer, and it does let through any COLORn attachment. And thanks to the compat ext this should already be available in desktop GL right?

@robclark
Contributor

@Ancurio interesting.. looks like that extension (ARB_invalidate_subdata) was put in place for vram migration, but I guess it is basically the same idea.. it still has the GL_COLOR_ATTACHMENT0 restriction for gles, but not for desktop gl, so I guess that is fine for us.

@Ancurio
Ancurio commented Aug 16, 2015

@robclark

it still has the GL_COLOR_ATTACHMENT0 restriction for gles

Where do you take that from? The GLES 3.0 doc page I linked above says GL_COLOR_ATTACHMENTi, and the mesa plumbing doesn't distinguish between GLES and GL for color attachments either.

@robclark
Contributor

@Ancurio from https://www.opengl.org/registry/specs/ARB/invalidate_subdata.txt .. although possibly that just isn't updated since gles2?

@Ancurio
Ancurio commented Aug 16, 2015

@robclark Ah, I see. Grepping gl2.h, it seems core GLES2.0 doesn't have MRT at all (only COLOR0 enum is defined), and needs a dedicated MRT extension to get the other attachments (which might also be why EXT_discard_framebuffer never referenced anything past COLOR0). Maybe they didn't want to complicate ARB_invalidate_subdata with yet another optional dep on the MRT extension..

In any case, that's a GLES2.0 specific limitation, it doesn't apply in 3.0.

@robclark
Contributor

@Ancurio makes sense.. I frequently get lost in the interaction between different extensions ;-)

Anyways, just sent an RFC for wiring InvalidateFramebuffer/DiscardFramebuffer up to something useful:

http://lists.freedesktop.org/archives/mesa-dev/2015-August/091883.html

The idea is stk sprinkles glInvalidateFramebuffer() calls around (ie. after last draw before render target switch or SwapBuffers() for depth/stencil when those don't need to be preserved, and before first draw after rt switch or SwapBuffers() for color when those don't need to be restored).. seems like there should be potential for the discrete desktop gpu's to possibly optimize out some vram migration in some cases with these too.

@robclark
Contributor

Fwiw, I've pushed a few perf tweaks to:
https://github.com/robclark/stk-code/commits/perf-hacks
(not really intending that as a pull req, but I think at least some of the ideas in there are good)

This gets rid of the stalls and a couple unnecessary render passes (see robclark@2ae4b1e).

Between those, I get a couple more fps (from ~11fps to ~14fps). Still a ways to go.. I am thinking an option to do the lighting pre-pass stuff at a lower resolution would be beneficial..

@deveee
Member
deveee commented Aug 29, 2016

@robclark I already made some improvements for embedded devices:

  • On devices that support only single resolution, you can use "scale_rtts_factor" parameter in config.xml. It scales down RTTs, so you can set it to for example 0.75 when you want to get 1440x810 quality on 1920x1080 display. In this case performance is similar to real 1440x810 resolution. Note that advanced lighting must be enabled to use it, otherwise it must be set to 1.0.
  • You can set "max_texture_size" parameter in config.xml to 128 or 256, which saves a lot of RAM when HD textures are disabled in options.
  • I made a port to OpenGL ES 3.0, which works fine with intel mesa drivers and with software rendering on linux. Though it still needs some tweaks to make it working for other drivers (at least GL_BGRA is available only as extension). If you want to use it on linux, you need to use cmake .. -DUSE_GLES2=1
@robclark
Contributor

@deveee ohh, nice.. is this on master already, or on a branch somewhere?

(I don't mind too much about GLES vs GL, since with gallium I have gl3.1 in addition to gles3.0 currently.. but scale_rtts_factor sounds quite useful)

@deveee
Member
deveee commented Aug 30, 2016
@robclark
Contributor

hmm, changing max_texture_size to 256 (was 512) cuts fps in about half.. although didn't have a chance to look at what was going on there.

scale_rtts_factor=0.75 does indeed speed things up, and even makes things playable at 1920x1080 (previously I'd been dropping down to 1280x720.. although oddly 1080p w/ scale=0.75 is ~50% faster than 1280x720 with scale=1.0..)

I wonder if it would be possible to scale down the pre-pass render targets (light/ssao/etc.. I'm not too sure how that works), but keep the final render pass at full resolution? I guess that would reduce the precision of light/shadow/etc, but that seems like it would be less noticeable.

@deveee
Member
deveee commented Aug 30, 2016

The max_texture_size parameter shouldn't affect performance at all. It just scales down the textures, which saves memory. Here is quick test on my machine in lighthouse track:

  • enabled HD textures: RAM usage: 800 MB, 30 fps
  • disabled HD textures, max_texture_size=128: RAM usage: 460 MB, 32 fps

The scale_rtts_factor already scales down all RTTs used in post-processing, so this indeed may be a bit faster (but lower quality). It only doesn't scale shadows, because it is handled by different parameter (shadows_resolution) and available in GUI.

@robclark
Contributor

my guess is that max_texture_size ends up somehow doing something bad on tilers (ie. triggering blits mid-batch.. which my new re-order code in freedreno should in theory deal with but maybe there are some cases I am missing). I'll need to have a closer look and see what is going on, since it is a puzzling result. (But I switched to a different setup to work on different issues.. hopefully I'll have some time over the weekend. And also some time to check out shadows_resolution. Thx)

@SuicSoft
Contributor
SuicSoft commented Oct 27, 2016 edited

@robclark Does your perf_hacks branch also work on desktop / laptop (and improve performance on them). I am getting 17 - 24 FPS on Cocoa Temple on my laptop.

If you exclude that hack for ARM I think you should make a PR (if it works on desktop)

Also at what graphical settings are you running STK at ?

@robclark
Contributor

@SuicSoft, PR is already merged.. I think one of the patches should slightly (ie. fraction of a fps) help desktop. Cocoa Temple is pretty heavyweight so I think 17-24fps is probably about what you can expect. (It is far too slow to be playable, and I have out-of-memory problems with that level on the ARM devices I do have.. maybe with snapdragon 820 / adreno 530 it might start to become playable.

The 'scale_rtts_factor' thing that @deveee mentioned should help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment