New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
antarctica perf on small devices / tiling gpus #2288
Comments
note: I might be able to rig up something to snapshot some timer at each render target switch to get some better idea of costs for each pass.. but will take me some time and not sure how much time I'll have in the upcoming week. But pretty sure memory bandwidth cost is the key here, so reducing passes and reducing size of passes will be a win. |
I'm not associated with this project, but I'm building an OpenGL renderer in my spare time and I think these steps describe an HDR renderer's pipeline. They work with HDR float colors in the framebuffer, then they convert to linear (not gamma-corrected) LDR in step 5 (that should be tone mapping), and the last step should be the actual linear -> gamma corrected color space step (which is done by the driver probably). I don't think there's a simpler way to do HDR rendering. |
|
@robclark yeah, the core mesa implementation already exists but is limited to GLES, IIRC because it depends on the GLES specific FBO extension, so a MESA version for desktop GL might be necessary. Too bad this isn't core GLES2, otherwise the compat ext would have covered it already ;) |
@Ancurio yeah, just looking at how best to wire it up in mesa state_tracker.. I think I can probably re-use some of the groundwork that Eric Anholt put in place to support discarding depth/stencil after eglSwapBuffers() I think we might need |
Now that you say it, that is really weird, issue 1 in the ext spec seems to imply the opposite. I wonder if that was an accident. Edit: I just noticed core GLES 3.0 has glInvalidateFramebuffer, which looks to be a proper successor to the EXT one, and it explicitly allows clearing of any color attachments. Sorry for the confusion. |
hmm,
maybe they planned to add it but forgot? I'm not really sure. Currently mesa will throw an error for GL_COLOR_ATTACHMENTn for n>1 |
@Ancurio interesting.. looks like that extension ( |
Where do you take that from? The GLES 3.0 doc page I linked above says GL_COLOR_ATTACHMENTi, and the mesa plumbing doesn't distinguish between GLES and GL for color attachments either. |
@Ancurio from https://www.opengl.org/registry/specs/ARB/invalidate_subdata.txt .. although possibly that just isn't updated since gles2? |
@robclark Ah, I see. Grepping In any case, that's a GLES2.0 specific limitation, it doesn't apply in 3.0. |
@Ancurio makes sense.. I frequently get lost in the interaction between different extensions ;-) Anyways, just sent an RFC for wiring InvalidateFramebuffer/DiscardFramebuffer up to something useful: http://lists.freedesktop.org/archives/mesa-dev/2015-August/091883.html The idea is stk sprinkles glInvalidateFramebuffer() calls around (ie. after last draw before render target switch or SwapBuffers() for depth/stencil when those don't need to be preserved, and before first draw after rt switch or SwapBuffers() for color when those don't need to be restored).. seems like there should be potential for the discrete desktop gpu's to possibly optimize out some vram migration in some cases with these too. |
Fwiw, I've pushed a few perf tweaks to: This gets rid of the stalls and a couple unnecessary render passes (see robclark@2ae4b1e). Between those, I get a couple more fps (from ~11fps to ~14fps). Still a ways to go.. I am thinking an option to do the lighting pre-pass stuff at a lower resolution would be beneficial.. |
@robclark I already made some improvements for embedded devices:
|
@deveee ohh, nice.. is this on master already, or on a branch somewhere? (I don't mind too much about GLES vs GL, since with gallium I have gl3.1 in addition to gles3.0 currently.. but scale_rtts_factor sounds quite useful) |
Yes, it's already on master. It's only available as a parameter in config
file though.
|
hmm, changing max_texture_size to 256 (was 512) cuts fps in about half.. although didn't have a chance to look at what was going on there. scale_rtts_factor=0.75 does indeed speed things up, and even makes things playable at 1920x1080 (previously I'd been dropping down to 1280x720.. although oddly 1080p w/ scale=0.75 is ~50% faster than 1280x720 with scale=1.0..) I wonder if it would be possible to scale down the pre-pass render targets (light/ssao/etc.. I'm not too sure how that works), but keep the final render pass at full resolution? I guess that would reduce the precision of light/shadow/etc, but that seems like it would be less noticeable. |
The max_texture_size parameter shouldn't affect performance at all. It just scales down the textures, which saves memory. Here is quick test on my machine in lighthouse track:
The scale_rtts_factor already scales down all RTTs used in post-processing, so this indeed may be a bit faster (but lower quality). It only doesn't scale shadows, because it is handled by different parameter (shadows_resolution) and available in GUI. |
my guess is that max_texture_size ends up somehow doing something bad on tilers (ie. triggering blits mid-batch.. which my new re-order code in freedreno should in theory deal with but maybe there are some cases I am missing). I'll need to have a closer look and see what is going on, since it is a puzzling result. (But I switched to a different setup to work on different issues.. hopefully I'll have some time over the weekend. And also some time to check out shadows_resolution. Thx) |
@robclark Does your perf_hacks branch also work on desktop / laptop (and improve performance on them). I am getting 17 - 24 FPS on Cocoa Temple on my laptop. If you exclude that hack for ARM I think you should make a PR (if it works on desktop) Also at what graphical settings are you running STK at ? |
@SuicSoft, PR is already merged.. I think one of the patches should slightly (ie. fraction of a fps) help desktop. Cocoa Temple is pretty heavyweight so I think 17-24fps is probably about what you can expect. (It is far too slow to be playable, and I have out-of-memory problems with that level on the ARM devices I do have.. maybe with snapdragon 820 / adreno 530 it might start to become playable. The 'scale_rtts_factor' thing that @deveee mentioned should help. |
This issue is to track performance of new render engine, in particular on small gpu's and tiling gpu's. (I am working on freedreno, the FOSS gallium driver for adreno gpu on qualcomm snapdragon SoC's, ie. "mobile devices".)
tl;dr: with older versions of stk, most levels I'd see ~40fps at 1080p on apq8084, aka snapdragon 805 (adreno 420).. the legacy render engine has fallen into disrepair, but now freedreno supports enough features from gl3.1 to run the new render engine (with
MESA_GL_VERSION_OVERRIDE=3.1 MESA_GLSL_VERSION_OVERRIDE=140
). However the performance is much worse. More like 5-10fps. (That is just estimated figure, due to #2287 I haven't been able to get good numbers.)In the course of debugging driver issues with the new render engine, I've noticed a few things. So I think there is some low hanging fruit when it comes to optimizing for mobile devices.
In particular, tiling gpu's like adreno overcome lesser memory bandwidth (compared to discrete gpu's with VRAM) by combining multiple draw calls to same render target to operate in a (relatively small) internal tile buffer. Ie. so on each render target switch, they execute draw N->M for tile0, then draw N->M for tile1, and so on. So the draw call cost is less, compared to render target switch (which forces moving things out to system memory and back). So the increased number of rendering passes is problematic. In addition, the increased size (in terms of bits-per-pixel) of intermediate render targets, is problematic, since it costs precious memory bandwidth.
note: adreno has a "hw binning pass" feature, which separates geometry into which tiles they effect (which is implemented for adreno 3xx but not 4xx yet).. this would effect things to some degree, basically lowering VS cost.. on apq8064 (a320), this was worth ~20-30% fps boost. In general, due to 3x larger tile buffer size, it would matter 3x less on a420. Except for the larger bpp intermediate render target buffers. So there is some driver work still which could improve things (but I think not more than 20-30% at tops).
From what I can tell, current render engine does following passes (and I'm not totally familiar with what these passes do, I'm just looking from the drive upwards at what is going on):
UPDATE: actually it is much worse than 6 passes.. it ends up becoming 11 passes: http://hastebin.com/gipanojuto.js
I think there may be some silly stuff going on, like BindFB(somefb); Clear(); BindFB(otherfb); Clear();..
UPDATE2: I haven't reversed enough about the perf counters to do TIME_ELAPSED queries (and profiling w/ timestamp queries, would require the same thing.. plus give nonsense results for a tiling gpu).. but I've instrumented the kernel to log time at submit plus at time of irq back from gpu.. not perfect, but should give us some rough idea of the cost of various passes. A few grains of salt should possibly be taken, since it isn't a perfect setup. The costs aren't quite in line with what I expected.. otoh this is with snapdragon 805 (which has quite good memory bandwidth compared to other SoC gpus.. on the order of ~25G/s..): http://hastebin.com/ayuzihitij.md. Basically, it looks like, assuming I've managed to line up things properly, the first pass (lighting prepass?) is ~11ms.. the second pass (to r11fg11fb10f is a bit surprising at ~33ms.. possible that # is bogus.. or possible there is something fairly inefficient about that format w/ this gpu?? third pass is ~25ms. After that, there is some silliness with switching render target and clearing bug not actually doing any rendering, which seems to cost together ~10ms. And after that the hdr related passes cost together ~24ms. I'm not super confident that those are all correct but the 1st and 3rd passes seem plausible. I still think a big win would be an option to disable hdr (and skip passes 4->6) (and rendering pass 3 to final RGB8 would probably reduce it's cost too).
For a tiler, it is better to start off a render pass with glClear() if you don't need previous buffer contents.. this tells the driver that it is not necessary to move previous tile contents back from system memory into internal tile buffer. Steps 1-3 start with a glClear() which is good (but only color.. would be better to clear depth+stencil too). The remaining steps don't start with a clear, which costs additional memory bandwidth as we move tile contents back from main memory, only to be immediately overwritten.
NOTE: It might be interesting to have some sort of extension to allow engine to do a dummy clear, which skips actual clear but tells tiler driver that previous tile contents are not required. Not sure if there is already such a thing (but if so, probably gles only and not gl). If there is interest in such a thing, I could propose a mesa extension.
I think, for a low end / memory bandwidth constrained system, having an option to skip passes 4-6 would be a good thing (and instead have pass 3 render directly to GL_RGB). In addition, doing passes 1-2 at a lower resolution and/or lower precision would seem useful. I would expect the combination of the two would get us close to previous performance levels.
The text was updated successfully, but these errors were encountered: