-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crazy idea: make an nngraph 'optimizer' #60
Comments
So, managed to |
Seems like to get this to work, would need to do something like get each nn layer to return the uncompiled, but already templated probably, kernel source code, through some kind of generic function, so they can then be stitched together somehow. |
There are some people working on it :) but in different approaches. |
Hugh, my take on this is that the fusion problem is a traditional compiler problem resembling loop fusion. For most (all?) non trivial cases you have to look at the dependence analysis structure of your problem. |
@hughperkins I am sort of working on this problem, not so much on fusion of kernels themselves but on analyzing and scheduling a computation graph, and assigning computations to resources (CPUs, GPUs, remote machines) without requiring manual assignment, to yield better performance than a manual assignment or the normal scheduling of work that, say, :forward() or :backward() would yield. However it effectively involves a rewrite of torch-nn/cunn/nngraph because the existing Torch nn framework does not provide enough information regarding read/write dependencies. Plus, Torch nn is written in a stateful OO manner and has no real guidelines when it comes to internal state or dependencies which causes many problems with trying to extract a computation graph and recompose it. The current framework I feel is very much at a technological dead-end and is not suited for a multi-machine world. I'm not entirely convinced that fusion of simpler layers would yield that much performance either. Pretty much everything (for CUDA/GPU at least, very much not so for CPU since the Torch use of OpenMP is non-sensical and there's no vectorization at all) is bandwidth bound, not FLOP bound. This is a regime for which fusion would yield gains, but much of the heavy work is not in the pointwise modules, but in more complicated things like matrix multiplication, pooling or convolution. In order to fuse computations with these things, I think that everything would have to be written in a formulation like Polymage or Halide in order to fuse more complicated computations. |
Hi nicolasvasilache and wickedfoo , I think that what you're doing is way more advanced than what I'm doing at the moment, so you should probalby ignore me for now ;-) I should probably read through what you guys are doing, and see how I can fit around that. |
Kernel launch overhead (I'm speaking for CUDA here, not OpenCL which I know nothing about :) ) is definitely significant the smaller the work performed is, as in RNNs. Fusion would help here, but the problem is that there are a wide variety of kernels in use, and being able to fuse any pairs of them or combinations of them is hard; especially in cutorch, where the size of the kernel CTA varies widely for various reasons. When the workload is so small too, it's also highly sensitive to CPU/GPU synchronization points (which stall the kernel launch pipeline). One thing my solution is working on is to make it easier to avoid synchronization points (and mercilessly hunting down sync points that do exist). The Torch tensor API also makes it hard to avoid synchronization points; e.g., :mean() is a sync point that copies a value back to the CPU, and there are lots of similar functions. Lazy evaluation is not in Torch's vocabulary. I'm still thinking of a way to address this, to keep intermediate results on the GPU where possible for reuse and perhaps conditional launch/evaluation. That's pretty far off though. It might have to require writing a new base tensor API and mapping the user-level Matlab-like API to that, but the underlying layer is lazy and fusion-friendly. All of this would be tons of work though. |
Yes, this is high on my list of suspicions too actually, as the bottleneck in opencl char-rnn. I think I'm going to start doing a bit more benchmarking first though instead of guessing again and again :-P As far as addressing this particular issue, I think it's relatively easy actually. The technology already exists in cutorch , you see that parameter
(edit: or maybe just detect that the output is a tensor, and use that:
) |
Started a project for benchmarking https://github.com/hughperkins/cltorch-benchmarking it's cltorch-specific for now, since it uses EasyCL, which is opencl-specific. Could be generalized though plausibly to cuda. |
Ha, I'm actually the person who wrote the THCReduceAll and added the outOnDevice code. Nothing uses it yet (at least, I haven't used it) :) But, you'd have to have a system to allow other parts of code take those values as inputs, the ability to evaluate conditionals on the GPU and run a kernel based on it (or not; the kernel could early exit out if its device-local conditional flag wasn't set, or something). And, an ability to be able to chain together these expressions in Torch/Lua land, what that would look like, etc. It's pretty clear to me that it's an entirely different language that would be exposed to the user, unless you wanted to do some pretty crazy compiler analysis. |
Haha, ok :-)
I guess that basically, anywhere there is a scalar input and output, there needs to be an option to provide alternatively a tensor input/output? |
For the reduceall bit, i guess we can baby-step through this, eg starting by adding a lua-side parameter, like |
Hmmm, I reckon we should move the getDeviceScratchSpace bit out of ReduceAll, and into TensorMath. That way ReduceAll is fairly clean, and TensorMath can handle the outOnDevice argument that we receive from lua. |
Some additional scratch space would be needed if the output were eventually put into a tensor. The device scratch space is used for two-pass reductions in cutorch, not just the final result. It's not just 1 float, it's # of SMs * 4 floats or something like that. This could probably be handled, by performing a temp allocation if the scratch space isn't available. It's the same strategy that Thrust uses, except with the scratch space I do not churn memory allocations (with the resultant sync on cudaFree). |
I was thinking that, the tensor would just be a normal tensor argument, as for eg |
The final output can be a tensor that comes from the Lua side, but the scratch space is used for more than that. The scratch space size I require is dependent upon the specifics of the GPU hardware in use for two-pass reduction. |
two-pass reduction needs scratchspace (at least in the current implementation), but I'm not sure why you need scratch space to store a standard tensor? This is how I see tihs working:
(Edit: compared to currently, it would be:
) |
@wickedfoo This branch contains an implementation of
Result:
|
Did a small poc for a zero-dimensional tensor, at https://github.com/hughperkins/cltorch/tree/zerodim-tensor-poc . Seems like it doesnt cause any major fundamental conflicts:
gives:
Yes, it would be nicer if printing a zero-dimensional tensor displayed the actual value. As we can see, this is technically possible, in the worst case by copying via a float storage, and probably can be automated a bit. (Edit: relevant commit: hughperkins/cltorch@1d45998 ) |
added
|
Division by point tensor works now :-)
Result:
commits: |
Think I've exhausted all the nngraph-agnostic optimizations, ie per-module improvements. Looking again at nngraph-level fusion. (Edit: PS, cleaned a bunch of my earlier posts from this thread, to tidy it up a bit) (Edit2: I guess I can start with forward-only Narrow fusion, and then look at other fusions later) |
Just out of curiosity, if I could get nngraph to fuse various nodes, reckon anywhere could publish a paper for that, or basically pure engineering? If I have to guess, I guess: engineering, but it's mostly a guess, hence asking :-) |
Update: kernel fusion is mostly working. Just got a few ... "edge"-cases :-P to work out. |
@hughperkins that's exciting. i'm taking a look to see what you've done :) Edit: I guess it hasn't been pushed yet. |
Hmmm, yeah, I'm in two minds about pushing it until it demonstrably works on char-rnn, 1. in case I suddenly figure out a way of turning it into a paper, 2. I dont want eg Google suddenly adding some obvious bit to it, and then patenting it :-P |
I think the cautiousness is definitely warranted (more for correctness and for research). I doubt Google would do that :D. Another set of folks who already fuse kernels for similar stuff are Nervana in their Neon framework. They introduce "MOPs" which they automatically analyze/fuse. I never tried their stuff beyond simple trivial benchmarking. |
Ok. As far as papers... I suppose the thing that makes a paper a paper, rather than just a computer program, is that it is general enough to be applied to other areas. For example, group theory and map-reduce are both quite general concepts, and the key invention in for example map-reduce is not running a program on lots of computers, but creating a general, simple framework which can be applied in many situations. Similarly, presumably saying "I wrote a program to reduce kernels in Torch" is not very paperable, it's kind of like something one could plausibly include in a masters thesis, whereas if I could abstract a key idea or two that made it possible to fuse the kernels, maybe that could become a paper, assuming it's original? |
Hmmm, seems there are a bunch of existing papers actually:
(Edit: hmmm, the third one, http://www.researchgate.net/publication/220244769_Automatic_fusions_of_CUDA-GPU_kernels_for_parallel_map , is a pretty good description of what I'm doing for char-rnn , except it goes beyond where I am at the moment, since it's actually trying to choose appropriate sub-groups, to fuse, rather than just lumping as much stuff together into one giant molasses kernel :-P ) |
Since the Fousek et al paper covers everything I have to say, and then some, I've pushed my branch up to https://github.com/hughperkins/clnn/tree/fused-modules commits here: https://github.com/hughperkins/clnn/commits/fused-modules The code is a bit of a mess right now ... Key concepts are:
That's it for high level concepts. In terms of how char-rnn looks, here is a picture of char-rnn prior to last week's PR. You can see that Nodes 16, 18 and 28 in this diagram split the Apply island: Here is char-rnn after moving rearranging the Narrows, and changing to Split. You can see that there is a nice Apply island from nodes 16, 18, 19, down to node 3: In terms of technical implementation, creating the dynamic kernels is handled by the UserKernel implementation I created for Sergey a couple of weeks ago https://github.com/hughperkins/cltorch#custom-user-kernels Then, a new network module nn.Apply sits on top of this. The fusion process looks as follows:
Tests are in https://github.com/hughperkins/clnn/blob/fused-modules/test-fusion.lua Run by doing:
Or for a specific test, eg:
What's working:
Known issues:
|
This is really nice work. I really like the fact that the fused graph is still somewhat in lua-land and won't have crazy debugging issues. |
Thank-you! :-) |
Crazy idea: make an nngraph 'optimizer'.
I'm calling it crazy so I dont accidentally commit myself to doing something I then find wont work for reason x,y,z :-P Plus, it is a little way out there.
Thinking of creating an optimizer for nngraph, that takes a graph as input, and then culls edges where it can. eg, maybe replaces nn.Narrow with torch.Narrow, and removes the nn.Narrow node. Since I haven't actually implemented this yet, I dont know how possible this is going to be, and whether obstacles I meet are merely challenging, or are weeks of work.
I'm also thinking of something more general, and more gpu-specific, of implementing gmodule within clnn, and walking the graph within clnn, and joining some of the nodes together, during the forward/backward calls, eg replace a Linear + a Tanh by a single kernel that calls both, in one launch.
Overriding motivatoin for all of this is basically on certain gpus, I notice that the overhead of launching kernels, in char-rnn, appears to dominate the runtime. I havent used any particular diagnostic tools to confirm this, and perhaps I should, but I notice that I can increase the data size 20 times whilst the execution time only increases by aobut 10-20%, which suggests to me that it's not a calc/data issue, but plausibly linked to kernel launches.
The text was updated successfully, but these errors were encountered: