-
Notifications
You must be signed in to change notification settings - Fork 25.2k
perf improvements for depthwise convolutions #3265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel looks great! can you explain the changes with the loops / warps? Everything else is straightforward.
const int nwarps = blockDim.x / WARP_SIZE; | ||
const int imageElements = outputWidth * outputHeight; | ||
//use warp per item | ||
for (int batchIdx = batch; batchIdx < batchSize; batchIdx += nwarps){ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
indtmp1 = indtmp2; | ||
indtmp2 = indtmp1/inputChannels; | ||
const int c = indtmp1 - indtmp2 * inputChannels; | ||
const int n = indtmp2; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@@ -155,7 +189,7 @@ __global__ void spatialDepthwiseConvolutionAccGradParameters( | |||
int bidx = blockIdx.x; | |||
int kW = bidx % kernelWidth; | |||
int kH = (bidx / kernelWidth) % kernelHeight; | |||
int ch = (bidx / channelStride) % kernelChannels; | |||
int ch = (bidx / channelStride); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
template <typename T, typename AccT, typename IndexType> | ||
|
||
const int WARP_SIZE = 32; | ||
const int MAX_BLOCK_SIZE = 256; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
const IndexType offset0 = (n * inputChannels + inputChannel) * inputHeight * inputWidth; | ||
#pragma unroll | ||
for (int kH = 0; kH < KH_LIMIT; ++kH) { | ||
#pragma unroll |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
const int laneId = threadIdx.x % WARP_SIZE; | ||
const int batch = threadIdx.x / WARP_SIZE; | ||
const int nwarps = blockDim.x / WARP_SIZE; | ||
const int imageElements = outputWidth * outputHeight; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@@ -164,25 +198,31 @@ __global__ void spatialDepthwiseConvolutionAccGradParameters( | |||
AccT grad = ScalarConvert<float, AccT>::to(0.0); | |||
|
|||
// Block-stride loop over the number of elements we need to reduce | |||
for (IndexType idx = threadIdx.x; idx < blockElements; idx += blockDim.x) { | |||
const int laneId = threadIdx.x % WARP_SIZE; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
In order to use this new depthwise conv, do I have to install the cudnn v7? |
@elliothe no. |
@soumith in order to get the fastest depthwise convolution in pytorch, what should I do now? Since I have installed the lastest pytorch from source, with CUDA 8.0 and cudnn v6. However, if I try to replace the normal spatial conv through combining depthwise conv and summation, the simulation speed is more than tens times slower. Is this speed degradation normal? |
@elliothe how did you actually "replace" normal spatial conv with depthwise conv? All you need to do to use |
@soumith Normal spatial convolution is
When the input_channels and output_channels are large, the speed degrades a lot. |
Wouldn't you also see this "degradation" with nn.Conv2d(input_channels, input_channels*output_channels)? |
That's a fair point, though, currently when deciding between cudnn and depthwise-separable kernels the only thing that's checked is that number_of_groups = input_channels, however, when depth_multiplier is large (that is, number of output channels = < large numer > * number of input channels), using cudnn and just dispatching a number of regular convolution kernels might be faster. Are layers like this encountered in practice? |
@ngimel I believe this is going to be a research direction, since the depthwise-separable kernels become popular due to some recent the-state-of-art NNs, like mobilenets, xception. It would be very helpful for researchers, if pytorch can support such kind of function efficiently. |
In Mobilenet and xception the number of output channels is equal to the number of input channels (may be in some xception layers it's 2*input_channels, I don't remember off the top of my head). For those layers the kernels in pytorch provide much better performance than what used to be available (the benchmarks for mobilenet comparing with the previous cudnn implementation are in #3057, this PR slightly improves on #3057). |
I have upgraded my pytorch to 0.3.0, but I found m = nn.Conv2d(128, 256, kernel_size=3, groups=128) is still 2 times slower than m = nn.Conv2d(128, 256, kernel_size=3). I am really confused by this problem, should I need to upgrade pytorch to other version? Do I need cuda 9? @ngimel |
@KeCh96 I replied on your other post. Please refrain from posting the same question in multiple different places |
Summary: This PR activates faster depthwise convolution kernels for Volta and Turing GPUs using cudnn >= 7600. The script to benchmark the current PyTorch master branch and this PR branch can be found [here](https://gist.github.com/ptrblck/4590cf20721d8f43296c9903abd4a774). (50 warmup iterations, 1000 iterations for timing) I've used #3265 to create a similar benchmark and added a few additional setups. Since the results are quite long, I've uploaded them in a spreadsheet [here](https://docs.google.com/spreadsheets/d/13ByXcqg7LQUr3DVG3XpLwnJ-CXg3GUZJ3puyTMw9n2I/edit?usp=sharing). Times are given in ms per iteration. We've benchmarked this PR on a DGX1 using V100 GPUs. The current workload check in `check_cudnn_depthwise_workload` is quite long and can be moved to another file, if wanted. CC ngimel (Thanks for the support while benchmarking it ;) ) Pull Request resolved: #22302 Differential Revision: D16115057 Pulled By: ezyang fbshipit-source-id: bad184658518e73b4d6b849d77e408f5a7a757de
Summary: This PR activates faster depthwise convolution kernels for Volta and Turing GPUs using cudnn >= 7600. The script to benchmark the current PyTorch master branch and this PR branch can be found [here](https://gist.github.com/ptrblck/4590cf20721d8f43296c9903abd4a774). (50 warmup iterations, 1000 iterations for timing) I've used pytorch/pytorch#3265 to create a similar benchmark and added a few additional setups. Since the results are quite long, I've uploaded them in a spreadsheet [here](https://docs.google.com/spreadsheets/d/13ByXcqg7LQUr3DVG3XpLwnJ-CXg3GUZJ3puyTMw9n2I/edit?usp=sharing). Times are given in ms per iteration. We've benchmarked this PR on a DGX1 using V100 GPUs. The current workload check in `check_cudnn_depthwise_workload` is quite long and can be moved to another file, if wanted. CC ngimel (Thanks for the support while benchmarking it ;) ) Pull Request resolved: pytorch/pytorch#22302 Differential Revision: D16115057 Pulled By: ezyang fbshipit-source-id: bad184658518e73b4d6b849d77e408f5a7a757de
The biggest performance improvements are due to templating kernels. The benchmarks comparing to #3057 performance are above, I've taken sizes from https://github.com/marvis/pytorch-mobilenet/blob/master/benchmark.py#L19-L46 and some are slightly different from what was listed in #3057. Benchmarks are for 50 iterations, time is given per iteration.