-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerated 3D depthwise convolution implementation #40801
Conversation
💊 CI failures summary and remediationsAs of commit 4a03290 (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Cheers @linziyi96 and many thanks for the PR. Just as a quick question, can you reuse test cases from my old PR #31885 ? Otherwise thanks again for the PR - I'll try it out on the internal codebase as well to double check if everything runt smoothly, but I'd leave someone more qualified to do a thorough PR review. |
Hi @linziyi96 , I try your 3D depthwise convolution implementation for X3D( https://arxiv.org/abs/2004.04730), but has not obviously acceleration. If it's convenient, I hope to discuss this problem with you. Could you leave your email or contact me by tongzhan@smail.nju.edu.cn. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lly-zero-one has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
const int in_row_start = out_row * dH - pH; | ||
const int in_frame_start = out_frame * dT - pT; | ||
|
||
scalar_t sum = (scalar_t)0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for fp16 sum should be accscalar_t (float), not scalar_t.
const int out_frame_end = in_frame + pT; | ||
|
||
const scalar_t* kernel_ptr = kernel[in_channel * channel_multiplier].data(); | ||
scalar_t sum = (scalar_t)0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about accscalar_t
} | ||
|
||
template <int dim> | ||
std::vector<int64_t> get_output_size( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's already a function conv_output_size
in ConvUtils.h, no need to have a separate one
TORCH_CHECK(dilation.size() == dim, | ||
"dilation length should be ", dim, ", but got ", dilation.size()); | ||
|
||
TORCH_CHECK(input.defined(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually inputs being defined is not checked, as there's no way to get undefined inputs from python, and they will soon error out (e.g. on .size() call) anyway
output_ = output.unsqueeze(0); | ||
} | ||
Tensor weight_ = weight.contiguous(); | ||
Tensor bias_ = bias.defined() ? bias.contiguous() : bias; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no call for input_.contiguous()?
TORCH_CHECK(padding[i] * 2 + input.size(i + 2) <= int_max, | ||
"Padded input tensor is too large."); | ||
} | ||
TORCH_CHECK(grad_output_.size(0) * grad_output_.size(2) < int_max - block / C10_WARP_SIZE && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments here for why conditions look like this will be helpful
const int warpid = threadIdx.x / C10_WARP_SIZE; | ||
const int nwarps = blockDim.x / C10_WARP_SIZE; | ||
|
||
scalar_t grad = (scalar_t)0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accumulation should be in accscalar_t
scalar_t grad = (scalar_t)0; | ||
int batch = warpid / oT; | ||
int gout_frame = warpid - batch * oT; | ||
for (int outer_pos = warpid; outer_pos < input.size(0) * oT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment on the strategy is useful here (warp accumulates over 2d image and loops over batches + 3rd dimension)
sdata[threadIdx.x] = grad; | ||
__syncthreads(); | ||
|
||
assert(__popc(blockDim.x) == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you really need this assert use CUDA_KERNEL_ASSERT, however, it can be done on the hostside.
@linziyi96 Thanks for the contribution. Could you try to resolve the comments and get the PR in PyTorch? |
@linziyi96 any plan to merge it? How can I use this patch without merge? |
Summary: Because this pull request (#40801) becomes an important part of recent 3D models, brings significant improvement in speed, and also have been open for a while. So I decided to resolve the previous review comment and modify it a bit so that it can be merged into the latest version of Pytorch. Pull Request resolved: #51027 Reviewed By: albanD Differential Revision: D26414116 Pulled By: ngimel fbshipit-source-id: 562c099f4d7f6d603a9c2f2e2a518bc577b0d8ee
Is there any plans to merge this pull request? |
#51027 providing this functionality has been merged. |
Summary: Because this pull request (pytorch#40801) becomes an important part of recent 3D models, brings significant improvement in speed, and also have been open for a while. So I decided to resolve the previous review comment and modify it a bit so that it can be merged into the latest version of Pytorch. Pull Request resolved: pytorch#51027 Reviewed By: albanD Differential Revision: D26414116 Pulled By: ngimel fbshipit-source-id: 562c099f4d7f6d603a9c2f2e2a518bc577b0d8ee
This is another attempt to resolve the slow 3D depthwise (a.k.a. channel-wise) convolution of cuDNN (previously at #31885). 3D depthwise convolutions are seeing increasing use in various recent works (e.g. https://arxiv.org/abs/2004.04730, https://arxiv.org/abs/1904.02811), but currently, cuDNN 3D depthwise convolution is usually even slower than the regular dense convolutions, making it practically very time consuming to train such models.
I have tried to implement a CUDA kernel and found it bringing well noticeable performance gain. It is taking as reference the 2D implementations (PyTorch: https://github.com/pytorch/pytorch/blob/master/aten/src/THCUNN/SpatialDepthwiseConvolution.cuTensorFlow: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/depthwise_conv_op_gpu.h) and existing attempts (#31885), and further tuned with nvidia profiler. The acceleration of forward+backward, according to my tests, is at least 2.5x and in many cases 5x~20x. The timing results are attached in the end. They are intended to cover some most common use cases, but due to the limitation of time and access to different devices, it may still be some way from being complete. Any further tests with different configurations or different devices are certainly welcomed.
Although it has already been used internally for some time, I'm aware that it might need some further refinement for dealing with some less common cases and better following general PyTorch coding principles. I've listed some TODO points I've come up with till now, some of which may include questions I'm now not very clear about:
Any other suggestions are also appreciated. I'm likely able to spend some more time on this PR. Feel free to tag it as WIP if found necessary.
Below is timing results. Timing results are in ms. Kernel size is always (3, 3, 3). The test code will do a 50-iteration warm-up and then another 50-iteration for timing. Timing code is after the table.
V100 results (these machines are not fully under my control so I can't upgrade their drivers. 7.6.3 is highest possible cuDNN version come with an official build)
Timing code: