Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU layout #3824

Open
kmcallister opened this issue Oct 28, 2014 · 10 comments
Open

GPU layout #3824

kmcallister opened this issue Oct 28, 2014 · 10 comments

Comments

@kmcallister
Copy link
Contributor

@kmcallister kmcallister commented Oct 28, 2014

We have a work queue for layout tasks, all we have to do is write a GPU worker thread for it. How hard could it be? ;)

This makes the most sense on systems like AMD's APU where CPUGPU transfers are zero-copy (though still not free, due to cache effects).

@pcwalton did some preliminary work on this.

@tetsuharuohzeki
Copy link
Member

@tetsuharuohzeki tetsuharuohzeki commented Oct 28, 2014

@pcwalton posted the progress about his preliminary work to [mozilla.dev.servo mainling list] at March/10/2014, but their mails are not archived because some troubles happens on mailing list system...

So I'll reproduce his mails in here

@tetsuharuohzeki
Copy link
Member

@tetsuharuohzeki tetsuharuohzeki commented Oct 28, 2014

From: Patrick Walton
Date: 2014-03-10 12:02 GMT+09:00
Subject: [dev-servo] Exploratory work for layout on the GPU
To: dev-servo@lists.mozilla.org

Hi everyone,

Over the weekend I created a small repo to play around with selector matching on the GPU:

https://github.com/pcwalton/selectron

There's a rough prototype of selector matching in there, with CPU, OpenCL (GPU and CPU), and CUDA versions. I've only tried it on my MBP's GeForce GT 650M.

So far the numbers have not been particularly good: 10%-100% slower than the parallel CPU numbers, depending on the size, even not counting memory transfer. (It is assumed that anywhere we would want to deploy this would have zero-copy operation.)

From my profiling it seems as though the branchiness of selector matching is not the problem: selector matching is surprisingly straight-line if the hash tables are implemented properly. Rather the issue is that at least my GPU really wants to read in multiple DOM nodes in the same 128-byte cache line. Because it's not a realistic assumption that more than a couple of DOM nodes fit in a 128-byte cache line, 89% (!!) of instructions end up replayed because of memory reads. Artificially lowering the size of DOM nodes to unrealistic levels and packing them together (cheating as far as I'm concerned) brings the performance up again. But I don't know how to make that work in the face of a dynamically changing DOM.

I'll try soon on Kaveri, but indications are that we'll have some hurdles to overcome.

Patrick


dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

@tetsuharuohzeki
Copy link
Member

@tetsuharuohzeki tetsuharuohzeki commented Oct 28, 2014

From: Patrick Walton
Date: 2014-03-11 13:14 GMT+09:00
Subject: Re: [dev-servo] Exploratory work for layout on the GPU

As a follow-up, I profiled a synthetic approximation of selector matching on the AMD Kaveri (AMD A10-7850K 3.70 GHz quad-core CPU + Radeon R7 8-compute-unit GPU). This is a new "APU", released in Q1 this year, which has an integrated CPU and GPU on the same die and cache-coherent memory accesses between the two. Since CSS layout is so memory bound, this seems of interest to us.

Here are the results. The first section of the table represents the devices in my MacBook Pro; the second represents the devices on the Kaveri APU.

Number of DOM nodes: 102,400
Max DOM depth: 10 nodes
CSS styles: 32 32-bit values
DOM node size: 136 bytes
Number of CSS rules: 32 (25 ID rules + 8 tag rules)

Device | Memory | Copy/map time | Execution time
----------------------+-----------------+---------------+----------------
GPU: GeForce GT 650M | Device, copying | 8.85 ms | 3.44 ms
GPU: GeForce GT 650M | Device, mapped | 0.04 ms | 7.04 ms
CPU: Core i7 2.70GHz | Direct | 0.02 ms | 5.35 ms
----------------------+-----------------+---------------+----------------
GPU: A10-7850K APU | Device, copying | 29.81 ms | 3.31 ms
GPU: A10-7850K APU | Device, mapped | 0.02 ms | 2.42 ms
GPU: A10-7850K APU | Host, shared | 0.00 ms | 19.63 ms
CPU: A10-7850K APU | Direct | 0.02 ms | 7.62 ms
----------------------+-----------------+---------------+----------------

Some interesting conclusions that I've tentatively drawn from this:

  • The performance of host shared memory (the cache coherent stuff) is disappointing for our use case. AMD claims in the HSA manual that the shared memory bus operates at about 50% of the bandwidth available. This significantly hurts CSS selector matching, which is quite memory bound.
  • On the other hand, the cache coherent memory might be useful as a way for the GPU to "hand off" work items to the CPU, even if we don't store the whole DOM/frame tree in it.
  • The AMD GPUs, at least on the tightly integrated Kaveri APU, seem to be much more happy with strided memory accesses. This is good news for us: it means that using the GPU for selector matching on a dynamic DOM now appears more viable.
  • DMA bandwidth is really high; transfers between CPU and GPU are not really that costly on desktops if DMA is used, at least compared to the cost of the selector matching as a whole.
  • As far as I interpret the docs, AMD states that on the HSA device memory can be read from the host in a zero-copy manner (with some penalty), which could explain why the "device, mapped" section is so cheap on that system. (On the other hand, it could just be that the DMA is so fast when both components are on the same die that I don't even notice it happening.)
  • At least on the Kaveri APU, it tentatively seems that allocating the DOM and style sheets on device memory and using the GPU is faster than using the CPU for selector matching. The GPU even beats a quad-core Core i7, which is pretty neat.
  • I have not tested the dynamic scheduling stuff -- i.e. following links throughout the DOM tree -- because the driver support is not yet there for that. (AMD is planning on a Q2 release of the necessary drivers.) For this and many other reasons, take these numbers with a grain of salt.

I updated my repo with the newest code.

Patrick


dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

@nfrechette
Copy link

@nfrechette nfrechette commented Jul 8, 2015

Hello,
I wanted to contribute to servo somehow and this issue caught my eye since I had discussed this in our interview. I decided to take a look at your sample code and forked it here: https://github.com/nfrechette/selectron

I had to fix a small number of bugs to get it to run on my hardware (MacBook Pro Retina with Intel Iris 5100) on both x86/x64 and OS X and Windows 8.1 Pro. There remains a small issue or two I hope to fix in the coming days. I can confirm that I see numbers about 2x slower for my GPU version compared to the OpenCL parallel CPU version despite having unified memory.

Having optimized my fare share of algorithms like this in the past, I have observed the following things about the current sample algorithm:

  • DOM node is fairly large but only node ID and class information is used (12 bytes) for input. Both my CPU and GPU have cache lines of 64 bytes meaning there is a lot of waste here. Some of the cache line will be written to at the end with the resulting styles but only partially. On the CPU this isn't so bad if the node size is fixed and all nodes are laid out linearly in memory. The short stride required for access will ensure that hardware prefetching will kick in and we perform enough work per node that most or all the latency should be hidden. On the GPU, this isn't so great. Each compute unit on my GPU has 7 threads. Each thread will read and write a different cache line meaning that each thread would require prefetching of a separate cache line. In a memory bound algorithm like this, hardware prefetching (if present at all on my GPU) might hinder performance since typically there is a fixed and small number of concurrent prefetches that can be in flight. On the GPU it is generally better if neighbour threads of a compute unit access neighbour memory locations (each compute unit has its own L1). Because all threads execute in lockstep in a compute unit, the more memory you wait for, the slower it will be. The same also applies to writing memory.
  • Sorting is slower on the GPU (with local memory) than on the CPU. This is most likely due to the uneven branching behavior of insertion sort. Perhaps a different sorting algorithm on the GPU might yield better results (for large data sets, bitonic sort is usually pretty good but I'm not sure if it would be for small sets as well). Due to the small number of elements, a brute force search might also be viable.
  • Reducing the matched structure size would help on both CPU and GPU. A rule offset could be used to save 4 bytes. On the GPU this would help us reduce the local memory used and help us go wider.
  • Using pointers to access memory on the GPU appears to be a slower. By changing the access from pointers to base pointers + offsets I had a small win.
  • Integer division is used when calculating the hash indices. This is currently a power of two which is good but for non-power of two values, on the GPU this is best when done on unsigned integers. It is generally slightly faster.
  • The mapped copy is slow because the kernel must zero the pages on first access. Subsequent copies to the same location are much faster. Keeping data compact would be best. Perhaps we could preallocate segments or reuse them across multiple doms with masks.
  • The mapped copy is marking memory as read/write. This is bad since it will most likely prevent the driver from using write combined memory (using it would kill any read performance). Since memory will not be read, this is undesirable.
  • The copy profiling numbers are not entirely accurate nor are the kernel execution numbers. If unified memory isn't used, following the unmap the driver will need to transfer the data to device accessible memory. This action will be performed during/after unmap returns (it will not wait until this finishes). This can cause the DMA timing to leak into the kernel timings we measure. We need to use GPU events to extract proper timing information since ideally we would hide this latency somehow and possibly not count the DMA cost towards the total cost.
  • If the upper bound is low on the stylesheet/properties sizes, we could move them into constant memory on the GPU which will end up in a different hardware cache.
  • Each DOM node is processed in a single GPU thread but using a single SIMD lane. On my GPU registers are 32 bytes wide meaning we end up using about 12.5% of the available ALU. It might be good to process multiple nodes per thread (say 4 or 8) and interleave the memory read/writes with computation. This could enable us to considerably reduce the time we spend waiting on memory by hiding the latency. This would also be a win for the CPU implementation for the same reason. While using actual SIMD wide instructions for certain things (node ID hash computation for example) might be easy on the GPU, if might be harder or simply slower on the CPU. Careful profiling will tell. If the GPU sorting can be done entirely in SIMD, this means the entire sorted array would fit in 4 registers.

All of the above points highlight the fact that what we are currently comparing isn't really fair. The current algorithm and memory access pattern favors the CPU significantly and it does not translate to the GPU as is in a meaningful manner (and the data might not be representative of real world data). I believe a better way to approach this would be to optimize both with real world data. When optimizing anything, it is always imperative to have a reference implementation we are trying to improve on (which is also missing from this sample). To make this comparison fair, we would have to compare an optimized CPU version (parallel or not depending on which is faster) and an optimized GPU version (possibly along the reference implementation if not the optimized CPU version). I have quite a few ideas on how to make the GPU version faster but I also have ideas on how to make the CPU version faster.

However, one thing is clear: css selector matching is very easy to parallelize and if we can rework the memory access pattern, it might end up significantly faster than it currently is even if the CPU version ends up consistently faster. I can see the GPU winning only if the following conditions are met (or partially met): there is a large number of DOM nodes, memory latency can be hidden, or clever use of the hardware caches gives it an edge.

What is servo currently using? Is it a serial implementation or a parallel implementation close to the OpenCL version?
Ideally we need to plug into servo the WIP algorithms to ensure we are as close to the real thing as we can. I would propose to first optimize the CPU version to make it as best as it can be before moving on to the GPU version. Ideally both versions will be kept side by side along the way to show the evolution and constantly compare the output to the reference implementation to ensure correctness. A lot of the CPU wins will also be GPU wins. Once this is done, we can run it on various real world input data sets and various hardware and compare apples to apples.

How does this sound?

@pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Jul 16, 2015

So there are enough problems with selectron that I'm not sure it's a good base to begin with anyhow—it doesn't really do cascading properly, which in the real world is 100KB of horribly branchy compiled code using virtual dispatch, because all CSS selectors cascade in different ways. (See all the implementations of to_computed_value.) :( If anything, selectron is far too nice to the GPU, because it cheats by avoiding all the complicated branchy selector matching logic, and even being horribly skewed toward the GPU in this way it still ends up much slower than the CPU.

Basically, I don't think selector matching will work on the GPU with current designs. It's just too memory bound and branchy.

There are potential other problems to look at on the GPU, however. I've had ideas for GPU inline layout and GPU text shaping. I think GPU inline layout has the potential for bigger wins, because the shaping cache tends to warm up fast and so we don't spend a lot of time there. It's also tantalizingly simple, though it's, again, quite possibly too memory bound.

@pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Jul 16, 2015

Regarding selector matching, I'd rather look at a CSS JIT like the one WebKit has.

If you do want to optimize the selectron algorithm, it might be worth thinking about ways to slot it into an "asm.css"-style fast path subset of CSS. As I said above, I don't think it will scale to the full generality of CSS cascading, so you'd need to limit it to a fast path. (Though, honestly, if we're considering an "asm.css" it might end up being so fast that there's no point running it on the GPU, since it won't be a bottleneck anymore!)

@nfrechette
Copy link

@nfrechette nfrechette commented Jul 16, 2015

Actually, I have a WIP optimized GPU version and so far, if we can hide part of the host -> device pre-processing copy latency, it is faster than the equivalent CPU version (copy to/from device memory + kernel time is as fast as the CPU version, both versions take about 4.4ms on my laptop with the GPU kernel itself taking about 2.4ms).

The GPU is surprisingly good at hiding memory latency when you tune the group size properly. When the GPU stalls for a memory read/write, it will context switch to another group on the same compute unit. This helps hide latency considerably and is something the CPU cannot easily do. The only thing to watch out for is making sure to keep register and local memory usage to a minimum to ensure as many groups can run on the same compute unit as possible. (Context switching on the GPU is very cheap since there is no stack and registers are not spilled to main or local memory.). Also, on many platforms where memory is not unified, GPU memory can often be faster than main memory.

I'm still unsure if it will end up faster than a fully optimized parallel CPU implementation though and properly using unified memory will not be easy on most platforms that even have it. DX12 should help a lot here as will mantle, etc. If we can get close to zero-copy, GPU might very well be a win (with selectron anyway).

Another factor to consider is whether or not the CPU can do other things while the GPU works on this. Even if the GPU version turns out slower or on par, if we can free the CPU to do other things it could still end up a win.

I'll make sure to take a look at the full implementation details when I get the chance to see how all the pieces fit together.

@nfrechette
Copy link

@nfrechette nfrechette commented Jul 22, 2015

After browsing the code for selector matching and the dom traversal that controls it for a few hours, it appears that the GPU is a very poor fit for this: you were right. The amount of work required to even get it to run will not be small. Getting any kind of decent performance out of it will imply making significant changes to the memory layout of the required data. And ultimately, the performance will remain poor due to the massive amount of required code and branching. Even if we kept the code small by only hosting a few selector types per kernel, we would require a high number of dispatches which will be no better for performance.

Ultimately, providing we can get it to run at all on the GPU, performance is likely going to be significantly slower. On the up side, it seems entirely possible to hide whatever latency would be incurred by a host -> device copy but it is unclear to me if we can hide it at all the other way around (if copying is required).

It seems much more reasonable to attempt to optimize the CPU version first. Much can be done to make it more hardware friendly and many of these optimizations could be of use later if we wish to revisit a GPU implementation.

I took a quick look at CSS JIT and it seems viable and very reasonable as a technique. Although generating code for this might be a bad idea at runtime. Security implications aside (do we trust the JIT, etc.) and the amount of work required to write code generation (although presumably we could reuse the JS JIT stuff or other libs/llvm), it seems that there might be safer & easier approaches in the same vein. Many platforms also prevent (or frown upon) any form of code JITing such as on iOS and many other embedded devices (Xbox 360, PlayStation 3, etc.). Although, presumably, if JS is able to JIT on a given platform, we should be too.

From my understanding CSS JIT stitches together basic selector logic. It can do this in various ways which we can emulate to a large extent without the need for a compiler (inlining, dynamic branching, static branching, etc.). We can initially (and easily) stitch our basic blocks in chains of functions pointers to rust code (thus 100% rust and safe). The only disadvantage VS JIT is that not all the selector code will reside on the same code page nor be inlined in the case of compound selectors. The upside is that we avoid this memory/compiler overhead and everything remains simple and 100% rust.

I have yet to look in depth at where the execution time is spent exactly. I won't have much spare time in the next 2-5 months but I hope to profile whenever I can these parts to gain a better understanding of how everything works.

Perhaps it might be best to create another issue for optimizing this on the CPU to avoid polluting this with non-GPU details. We could probably close this issue.

@pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Jul 22, 2015

Threaded code for CSS selector matching sounds like a great idea (or even a bytecode with an interpreter?) Even if we eventually go with a JIT it'd be a great starting point and fallback for non-JIT environments.

@nfrechette
Copy link

@nfrechette nfrechette commented Oct 5, 2015

I've started working on this some time ago and you can see the progress so far here.

I have a solid canonical NFA implementation working and the performance is very close to the original current implementation. Once I'm done with the optimizations I have planned for it, it should be just as fast or maybe even a bit faster than the current implementation and will serve later as a fallback for pathological CSS that cannot use the GPU/DFA. It will also be the foundation for the next step: generating a DFA from our NFA.

The code above isn't usable as-is since I had to instrument servo and I haven't branched it yet locally. All in all I should be done with the NFA stuff this month if all goes well.

Next steps:

  • Generating and executing a DFA on the CPU
  • Simplify simple selector matching to make it suitable for GPU execution
  • Make sure the bloom filter test is GPU friendly
  • Run the DFA on the GPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.