-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No speed-up despite sparsity #15
Comments
Hi, Digging a bit deeper, I tried to use the library directly rather than the python wrappers...
These are the timings:
The input is a 1 channel numpy array of 95% sparsity... |
|
Using timeit on sess.run is incorrect for many reasons including large overhead from session startup. Proper CUDA time measurement in tensorflow is not trivial and I recommend inspecting the benchmarking code, in particular cuda_timer_start and cuda_timer_end custom ops. You also have to properly preheat the GPU and throw away the initial statistics while the clock ramps up (which could be one of many reason why your observed timings are skewed). There are many other nontrivial details for proper GPU benchmarking that you can find in our benchmarking code including understanding some of the tensorflow and cuDNN internals such as either disabling the internal TF autotuner or letting it run for at least one session.run() just off the top of my head. Also see my answer here: |
I appreciate that it might not be perfect due to the overheads; however, even if I run the operations for 100 times, or 1000 times (-n100 instead of -n10), I get quite the same mean timings, and smaller standard deviations. Similarly, if I use the TF Timeline object it's the same. Also, for all practical purposes, the time taken by sess.run is the one people are interested, that will dictate whether people use the operations or not ... and initialization overheads / transferring to GPU should be affecting both the normal convolution as well as this method. |
In this case also the block size is only 3 which is very suboptimal for this method. I would recommend a block size for a 3x3 convolution kernel should be at least 5-10 so that Winograd convolutions can be leveraged and overlap overhead is reduced. |
Also there are template instantiations for some block sizes missing in the source. I recommend autotuning over block size for your specific problem (ie searching for the fastest block/sparsity trade-off and the block size that fits your sparsity pattern best). Sorry at first I didn't realize you were using block size 3. For our internal detector the smallest block size that we used was 5 and it went up to 29 for the high resolution layers. |
Thanks, using larger block sizes present in the macro helped.
And for stacked 1x1, 3x3, 1x1 with relus:
so there is a real need to avoid gathering / scattering unless really necessary. Cheers |
@andrei-pokrovsky Is there an automated way to tune the hyperparameter for block size/ can we learn it? |
As a side note with 3x3 block size for 3x3 kernel you will be getting about 9 pixels of overlap per pixel (extra memory bandwidth to replicate that data into a block) so I think about 9x slowdown is expected plus you lose Winograd so i'm slightly surprised you only got 10x slowdown and not 20x as I would roughly expect. It's also possible at those kernel sizes and resolution the GPU compute is underutilized and the convolution is bandwidth bound (i'd have to crunch the numbers to say exactly) which might be why you were only seeing 10x slowdown. |
Wrt average pooling, if your tolerance cut-off is fixed then average pooling will produce higher sparsity than maxpool and GPU can become underutilized for such a small tensor/high sparsity which is my current guess why you are seeing a lower speedup. GPU needs a decent amount of work left over after block gather to realize the close to linear speedup or same speedup as for maxpool. This is just my guess right now, I haven't ran your code yet. For inference there are many alternate solutions such as TensorRT (SBNet could be integrated using IPlugin interface) and ONNX with Tensor RT backend. You also don't necessarily have to freeze the graph for inference in TensorFlow. Another alternative is using nvGraphAddNode API etc. There are many inferencing solutions out there, getting the perfect one might require some engineering work. I personally find that TensorFlow out of the box is too heavyweight for production inference and it's not that hard to roll your own inference mini-framework if you are not trying to be middleware and just trying to solve your specific application, you just capture the graph, export dependencies and parameters, do a topological sort on the graph and you are basically done. There's some work there but you have full control over your source as opposed to close source libraries like TensorRT. Plus you can always splice subgraphs from TRT and have more control that way. Wrt session overhead, i don't think timing a session with timeit is representative of actual workload when you have a full network wrapped in a single session. So i recommend using the provided timing operation based on CUDA event wrappers - refer to my stackoverflow post i referenced earlier for details on why. HTH |
If you really want to continue using timeit for timing a single sparse convolution I recommend wrapping 50-100 repeated subgraphs with single sparse convolution into a single session which will reduce the session overhead to be more representative of full end-to-end inference of a single session run for a full network. In this scenario pay attention to feed different randomized inputs into each separate convolution and make sure all the outputs are consumed, otherwise tensorflow will be aggressive about optimizing out unused subgraphs or subgraphs with duplicated inputs. |
@dhingratul Wrt autotuning - autotuner in this case is just trying all the different block sizes and picks the fastest. You can check how we do it in our timing code that is included with the distribution. We don't try non-square block sizes, which could potentially work better for a particular sparsity patter (such as for instance 16x8, 32x4 etc) but those template instantiations should be added explicitly to C++ code as described in the readme. More generally speaking, autotuning CUDA kernels is a fairly extensive subject in of itself and is a bit outside of scope of this project but you can look at projects like NVIDIA's jitify etc. |
Thanks for all the detailed tips! I got it running fine now :-) |
Hello @andrei-pokrovsky
or
x is a tensor with channels > 1, and the first way is how you use it in the sparse_conv_lib.py file. However, I am unsure how you select a block for channels greater than 1 when having a defined numerical mask threshold. Do you do a sum internally along the channel direction, or a max? I cannot get the 'expected' output compared to a normal dense convolution if I use the first option, but can get an identical output to a dense convolution if I use the second one. |
Right now the implementation expects/requires a tensor of shape [N,H,W,1] for the mask. This is somewhat redundant since it's always a single channel. In the end it's either max or average pooling per block, so you can come up with your own way of reducing C channels to 1, such as averaging across channels or some other way. |
Hello,
Nice paper! Unfortunately, I am unable to reproduce your speed-ups.
This is what I do:
and it seems the sparse version, despite ~95% sparsity, is 10x slower!?
I am pretty sure there is something I am misunderstanding from your code. Could you clarify?
I tried looking into your benchmark scripts, I'd try to isolate a single layer first...
Thanks!
The text was updated successfully, but these errors were encountered: