Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upExperiment with a raytracing approach #185
Comments
|
I suspect this will not be a performance improvement. I experimented with doing the render loop in the fragment shader rather than a separate compute kernel, and performance was significantly worse. I think the main thing that's going on is that fragments from different tiles can get scheduled to the same simd group (or subgroup or wave or warp, whichever terminology you prefer), and this kills you on both thread divergence and global memory read patterns (in piet-metal, each render thread reads the same global memory location in lockstep). With a compute kernel, you get tremendous benefit from being able to schedule the work explicitly. Incidentally, I also tried encoding "simple" tiles into the vertex->fragment data and using a point sprite (moving all global memory reads to the vertex shader), and the performance of that was also terrible. For this I don't have as much insight, as it seemed to me reasonable that the global memory reads were the expensive parts. One hypothesis is that it gets killed in register pressure, which in turn will drive down simd utilization. All this is worth continuing to experiment with, of course. |
|
Interesting, thanks. Really sounds like all this stuff is going to be very hardware-dependent. I think doing the work in compute could be interesting as an alternative mode for Pathfinder on some hardware. But on mobile I would like to see pixel-local storage tried as a solution first. |
|
Not to keep flogging the same point, but part of the appeal of compute is that it is not very hardware dependent. What actually happens during rasterization is wildly divergent from GPU to GPU, see this video for some nice visualizations. Similarly, I expect the relative performance of things like blend unit vs ALU to also vary widely. By comparison, I think it's possible to wrap one's head around the performance of compute - schedule your block sizes so you get good simd utilization and not too much register pressure, make your memory access patterns respect the memory hierarchy, avoid divergence, and you should be pretty good. Also this issue is a good chance to point out that these systems resemble Random-access rendering of general vector graphics a bit more than MPVG - fixed size tiling rather than complex data structures, etc. I believe the RAVG work suffered from the issues of putting render in the fragment shader; to a large extent you can see piet-metal as adapting the basic ideas to compute kernels. My gut feeling is that when this washes out, the best performance will be "if compute, then (a refined version of) piet-metal, if non-compute GPU, then PathFinder 3". It's always possible to tune things but I don't see anything yet that invalidates this idea. Of course people should also be evaluating Spinel - I've talked recently with some of the folks involved but still don't have anything remotely resembling an understanding of its performance with respect to the rest of the literature. |
|
Well, if compute is actually faster then we should probably add it to Pathfinder as an optional path. Almost all the tiling code is equally applicable to a compute-based rasterizer, after all. I suppose my hesitancy about compute primarily comes from the fact that it's fundamentally a bet against all the features that the hardware has specifically added for our use case, especially pixel-local storage. Pixel-local storage is precisely designed to address the main performance advantage of compute: being able to programmatically blend in fast local memory. Once you remove that advantage, you're basically betting that you can do better in software than the rasterization hardware can do, which is not a bet I'm comfortable making long-term. One of the takeaways I've had from all of this vector graphics work is that 2D vector rasterization isn't really that different from 3D rasterization. The hardware rasterizer is essentially a 2D rasterizer to begin with. Note that one of the advantages of using the hardware rasterizer is that you can scope the execution of the fragment shader to arbitrary subtile rects, taking advantage of the dynamic scheduling that the hardware rasterizer offers. This is something that isn't really worth doing in compute; it is possible to rig up some sort of complex subtiling scheme to approximate it, but it'll always be slow relative to the HW rasterizer. With compute you're betting that the efficiency gains of having one pass and doing blending in local memory will outweigh this work-inefficiency. Maybe so right now, but pixel-local storage shrinks this advantage significantly. If Intel switches to a mobile-like tiling architecture as I've heard they may and gains pixel-local storage, then desktop will also be able to achieve fast blending… (On the other hand, I've heard from folks who work with Adreno that shader framebuffer fetch has a big performance overhead on Qualcomm hardware right now for some reason. Avoiding this kind of thing is a nice advantage of compute.) |
|
I filed #187 about using pixel-local storage/shader framebuffer fetch for a single pass. |
|
We basically have this now in the D3D11 backend. Closing. |
It might make sense to try encoding all the the fills for each tile in a texture and stepping through them in a loop in the tile fragment shader, blending each pixel in software. This is basically what an early version of WebRender did, and what @raphlinus' piet-metal does. The advantage is that we have one rendering step instead of two. The disadvantage is that we can no longer use the hardware rasterizer's fragment scheduler and we may therefore have worse thread utilization.
My hypothesis, which has a very good chance of being wrong, is that this will be slower than current Pathfinder on most high-performance desktop hardware and mobile hardware, assuming we take advantage of pixel-local storage on the latter. However, it may be faster on some lower-power desktop GPUs, such as Intel. This is likely to be very hardware-dependent.