Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with a raytracing approach #185

Closed
pcwalton opened this issue Jun 5, 2019 · 6 comments
Closed

Experiment with a raytracing approach #185

pcwalton opened this issue Jun 5, 2019 · 6 comments

Comments

@pcwalton
Copy link
Collaborator

@pcwalton pcwalton commented Jun 5, 2019

It might make sense to try encoding all the the fills for each tile in a texture and stepping through them in a loop in the tile fragment shader, blending each pixel in software. This is basically what an early version of WebRender did, and what @raphlinus' piet-metal does. The advantage is that we have one rendering step instead of two. The disadvantage is that we can no longer use the hardware rasterizer's fragment scheduler and we may therefore have worse thread utilization.

My hypothesis, which has a very good chance of being wrong, is that this will be slower than current Pathfinder on most high-performance desktop hardware and mobile hardware, assuming we take advantage of pixel-local storage on the latter. However, it may be faster on some lower-power desktop GPUs, such as Intel. This is likely to be very hardware-dependent.

@raphlinus
Copy link

@raphlinus raphlinus commented Jun 6, 2019

I suspect this will not be a performance improvement. I experimented with doing the render loop in the fragment shader rather than a separate compute kernel, and performance was significantly worse. I think the main thing that's going on is that fragments from different tiles can get scheduled to the same simd group (or subgroup or wave or warp, whichever terminology you prefer), and this kills you on both thread divergence and global memory read patterns (in piet-metal, each render thread reads the same global memory location in lockstep). With a compute kernel, you get tremendous benefit from being able to schedule the work explicitly.

Incidentally, I also tried encoding "simple" tiles into the vertex->fragment data and using a point sprite (moving all global memory reads to the vertex shader), and the performance of that was also terrible. For this I don't have as much insight, as it seemed to me reasonable that the global memory reads were the expensive parts. One hypothesis is that it gets killed in register pressure, which in turn will drive down simd utilization.

All this is worth continuing to experiment with, of course.

@pcwalton
Copy link
Collaborator Author

@pcwalton pcwalton commented Jun 6, 2019

Interesting, thanks. Really sounds like all this stuff is going to be very hardware-dependent.

I think doing the work in compute could be interesting as an alternative mode for Pathfinder on some hardware. But on mobile I would like to see pixel-local storage tried as a solution first.

@raphlinus
Copy link

@raphlinus raphlinus commented Jun 6, 2019

Not to keep flogging the same point, but part of the appeal of compute is that it is not very hardware dependent. What actually happens during rasterization is wildly divergent from GPU to GPU, see this video for some nice visualizations. Similarly, I expect the relative performance of things like blend unit vs ALU to also vary widely. By comparison, I think it's possible to wrap one's head around the performance of compute - schedule your block sizes so you get good simd utilization and not too much register pressure, make your memory access patterns respect the memory hierarchy, avoid divergence, and you should be pretty good.

Also this issue is a good chance to point out that these systems resemble Random-access rendering of general vector graphics a bit more than MPVG - fixed size tiling rather than complex data structures, etc. I believe the RAVG work suffered from the issues of putting render in the fragment shader; to a large extent you can see piet-metal as adapting the basic ideas to compute kernels.

My gut feeling is that when this washes out, the best performance will be "if compute, then (a refined version of) piet-metal, if non-compute GPU, then PathFinder 3". It's always possible to tune things but I don't see anything yet that invalidates this idea. Of course people should also be evaluating Spinel - I've talked recently with some of the folks involved but still don't have anything remotely resembling an understanding of its performance with respect to the rest of the literature.

@pcwalton
Copy link
Collaborator Author

@pcwalton pcwalton commented Jun 6, 2019

Well, if compute is actually faster then we should probably add it to Pathfinder as an optional path. Almost all the tiling code is equally applicable to a compute-based rasterizer, after all.

I suppose my hesitancy about compute primarily comes from the fact that it's fundamentally a bet against all the features that the hardware has specifically added for our use case, especially pixel-local storage. Pixel-local storage is precisely designed to address the main performance advantage of compute: being able to programmatically blend in fast local memory. Once you remove that advantage, you're basically betting that you can do better in software than the rasterization hardware can do, which is not a bet I'm comfortable making long-term.

One of the takeaways I've had from all of this vector graphics work is that 2D vector rasterization isn't really that different from 3D rasterization. The hardware rasterizer is essentially a 2D rasterizer to begin with.

Note that one of the advantages of using the hardware rasterizer is that you can scope the execution of the fragment shader to arbitrary subtile rects, taking advantage of the dynamic scheduling that the hardware rasterizer offers. This is something that isn't really worth doing in compute; it is possible to rig up some sort of complex subtiling scheme to approximate it, but it'll always be slow relative to the HW rasterizer. With compute you're betting that the efficiency gains of having one pass and doing blending in local memory will outweigh this work-inefficiency. Maybe so right now, but pixel-local storage shrinks this advantage significantly. If Intel switches to a mobile-like tiling architecture as I've heard they may and gains pixel-local storage, then desktop will also be able to achieve fast blending…

(On the other hand, I've heard from folks who work with Adreno that shader framebuffer fetch has a big performance overhead on Qualcomm hardware right now for some reason. Avoiding this kind of thing is a nice advantage of compute.)

@pcwalton
Copy link
Collaborator Author

@pcwalton pcwalton commented Jun 6, 2019

I filed #187 about using pixel-local storage/shader framebuffer fetch for a single pass.

@pcwalton
Copy link
Collaborator Author

@pcwalton pcwalton commented Jul 28, 2020

We basically have this now in the D3D11 backend. Closing.

@pcwalton pcwalton closed this Jul 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.