Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGPU layout #3824
GPU layout #3824
Comments
|
@pcwalton posted the progress about his preliminary work to [mozilla.dev.servo mainling list] at March/10/2014, but their mails are not archived because some troubles happens on mailing list system... So I'll reproduce his mails in here |
|
|
|
Hello, I had to fix a small number of bugs to get it to run on my hardware (MacBook Pro Retina with Intel Iris 5100) on both x86/x64 and OS X and Windows 8.1 Pro. There remains a small issue or two I hope to fix in the coming days. I can confirm that I see numbers about 2x slower for my GPU version compared to the OpenCL parallel CPU version despite having unified memory. Having optimized my fare share of algorithms like this in the past, I have observed the following things about the current sample algorithm:
All of the above points highlight the fact that what we are currently comparing isn't really fair. The current algorithm and memory access pattern favors the CPU significantly and it does not translate to the GPU as is in a meaningful manner (and the data might not be representative of real world data). I believe a better way to approach this would be to optimize both with real world data. When optimizing anything, it is always imperative to have a reference implementation we are trying to improve on (which is also missing from this sample). To make this comparison fair, we would have to compare an optimized CPU version (parallel or not depending on which is faster) and an optimized GPU version (possibly along the reference implementation if not the optimized CPU version). I have quite a few ideas on how to make the GPU version faster but I also have ideas on how to make the CPU version faster. However, one thing is clear: css selector matching is very easy to parallelize and if we can rework the memory access pattern, it might end up significantly faster than it currently is even if the CPU version ends up consistently faster. I can see the GPU winning only if the following conditions are met (or partially met): there is a large number of DOM nodes, memory latency can be hidden, or clever use of the hardware caches gives it an edge. What is servo currently using? Is it a serial implementation or a parallel implementation close to the OpenCL version? How does this sound? |
|
So there are enough problems with selectron that I'm not sure it's a good base to begin with anyhow—it doesn't really do cascading properly, which in the real world is 100KB of horribly branchy compiled code using virtual dispatch, because all CSS selectors cascade in different ways. (See all the implementations of Basically, I don't think selector matching will work on the GPU with current designs. It's just too memory bound and branchy. There are potential other problems to look at on the GPU, however. I've had ideas for GPU inline layout and GPU text shaping. I think GPU inline layout has the potential for bigger wins, because the shaping cache tends to warm up fast and so we don't spend a lot of time there. It's also tantalizingly simple, though it's, again, quite possibly too memory bound. |
|
Regarding selector matching, I'd rather look at a CSS JIT like the one WebKit has. If you do want to optimize the selectron algorithm, it might be worth thinking about ways to slot it into an "asm.css"-style fast path subset of CSS. As I said above, I don't think it will scale to the full generality of CSS cascading, so you'd need to limit it to a fast path. (Though, honestly, if we're considering an "asm.css" it might end up being so fast that there's no point running it on the GPU, since it won't be a bottleneck anymore!) |
|
Actually, I have a WIP optimized GPU version and so far, if we can hide part of the host -> device pre-processing copy latency, it is faster than the equivalent CPU version (copy to/from device memory + kernel time is as fast as the CPU version, both versions take about 4.4ms on my laptop with the GPU kernel itself taking about 2.4ms). The GPU is surprisingly good at hiding memory latency when you tune the group size properly. When the GPU stalls for a memory read/write, it will context switch to another group on the same compute unit. This helps hide latency considerably and is something the CPU cannot easily do. The only thing to watch out for is making sure to keep register and local memory usage to a minimum to ensure as many groups can run on the same compute unit as possible. (Context switching on the GPU is very cheap since there is no stack and registers are not spilled to main or local memory.). Also, on many platforms where memory is not unified, GPU memory can often be faster than main memory. I'm still unsure if it will end up faster than a fully optimized parallel CPU implementation though and properly using unified memory will not be easy on most platforms that even have it. DX12 should help a lot here as will mantle, etc. If we can get close to zero-copy, GPU might very well be a win (with selectron anyway). Another factor to consider is whether or not the CPU can do other things while the GPU works on this. Even if the GPU version turns out slower or on par, if we can free the CPU to do other things it could still end up a win. I'll make sure to take a look at the full implementation details when I get the chance to see how all the pieces fit together. |
|
After browsing the code for selector matching and the dom traversal that controls it for a few hours, it appears that the GPU is a very poor fit for this: you were right. The amount of work required to even get it to run will not be small. Getting any kind of decent performance out of it will imply making significant changes to the memory layout of the required data. And ultimately, the performance will remain poor due to the massive amount of required code and branching. Even if we kept the code small by only hosting a few selector types per kernel, we would require a high number of dispatches which will be no better for performance. Ultimately, providing we can get it to run at all on the GPU, performance is likely going to be significantly slower. On the up side, it seems entirely possible to hide whatever latency would be incurred by a host -> device copy but it is unclear to me if we can hide it at all the other way around (if copying is required). It seems much more reasonable to attempt to optimize the CPU version first. Much can be done to make it more hardware friendly and many of these optimizations could be of use later if we wish to revisit a GPU implementation. I took a quick look at CSS JIT and it seems viable and very reasonable as a technique. Although generating code for this might be a bad idea at runtime. Security implications aside (do we trust the JIT, etc.) and the amount of work required to write code generation (although presumably we could reuse the JS JIT stuff or other libs/llvm), it seems that there might be safer & easier approaches in the same vein. Many platforms also prevent (or frown upon) any form of code JITing such as on iOS and many other embedded devices (Xbox 360, PlayStation 3, etc.). Although, presumably, if JS is able to JIT on a given platform, we should be too. From my understanding CSS JIT stitches together basic selector logic. It can do this in various ways which we can emulate to a large extent without the need for a compiler (inlining, dynamic branching, static branching, etc.). We can initially (and easily) stitch our basic blocks in chains of functions pointers to rust code (thus 100% rust and safe). The only disadvantage VS JIT is that not all the selector code will reside on the same code page nor be inlined in the case of compound selectors. The upside is that we avoid this memory/compiler overhead and everything remains simple and 100% rust. I have yet to look in depth at where the execution time is spent exactly. I won't have much spare time in the next 2-5 months but I hope to profile whenever I can these parts to gain a better understanding of how everything works. Perhaps it might be best to create another issue for optimizing this on the CPU to avoid polluting this with non-GPU details. We could probably close this issue. |
|
Threaded code for CSS selector matching sounds like a great idea (or even a bytecode with an interpreter?) Even if we eventually go with a JIT it'd be a great starting point and fallback for non-JIT environments. |
|
I've started working on this some time ago and you can see the progress so far here. I have a solid canonical NFA implementation working and the performance is very close to the original current implementation. Once I'm done with the optimizations I have planned for it, it should be just as fast or maybe even a bit faster than the current implementation and will serve later as a fallback for pathological CSS that cannot use the GPU/DFA. It will also be the foundation for the next step: generating a DFA from our NFA. The code above isn't usable as-is since I had to instrument servo and I haven't branched it yet locally. All in all I should be done with the NFA stuff this month if all goes well. Next steps:
|
We have a work queue for layout tasks, all we have to do is write a GPU worker thread for it. How hard could it be? ;)
This makes the most sense on systems like AMD's APU where CPU↔ GPU transfers are zero-copy (though still not free, due to cache effects).
@pcwalton did some preliminary work on this.