Depthwise Conv1d performance (a naive CUDA kernel is 10x faster) #75747

BlinkDL · 2022-04-13T18:08:27Z

🚀 The feature, motivation and pitch

Please improve the CUDA performance of Depthwise Conv1d :)

FYI, I write a naive CUDA kernel and it's already 10x faster than pytorch:

https://github.com/BlinkDL/RWKV-CUDA

RTX3090:
pytorch = fwd 14ms bwd 65ms
CUDA kernel v3 = fwd 0.8ms bwd 5.5ms

Alternatives

No response

Additional context

No response

cc @ngimel

ipostr08 · 2022-04-18T00:54:02Z

Yes, to be more general, group convolutions are very slow. Nothing has been done to fix it despite many people asking for a fix over the years. E.g. #73764, #18631, #70954, https://discuss.pytorch.org/t/group-convolution-takes-much-longer-than-normal-convolution/92214, https://twitter.com/wightmanr/status/1486146507132661760.

songyuc · 2022-09-19T14:10:01Z

🚀 The feature, motivation and pitch

Please improve the CUDA performance of Depthwise Conv1d :)

FYI, I write a naive CUDA kernel and it's already 10x faster than pytorch:

https://github.com/BlinkDL/RWKV-CUDA

RTX3090:

pytorch = fwd 14ms bwd 65ms

CUDA kernel v3 = fwd 0.8ms bwd 5.5ms

Alternatives

No response

Additional context

No response

cc @ngimel

Would you consider making a PR about it?😀

JulienSiems · 2022-10-10T16:04:12Z

@BlinkDL Same here! Would be very interested in a PR and ready to help!

BlinkDL · 2022-10-12T13:25:31Z

@BlinkDL Same here! Would be very interested in a PR and ready to help!

Hi you can work on my code :)

gchanan added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 14, 2022

vadimkantorov mentioned this issue Sep 19, 2022

[CV] Optimize the Large Kernel DepthWise convolutions (LKDWconvs) with nn.Conv2d to be faster than the counterpart of megengine #85252

Closed

JulienSiems mentioned this issue Oct 11, 2022

Considering to contribute it to PyTorch? BlinkDL/RWKV-CUDA#3

Open

paulgavrikov mentioned this issue Oct 20, 2022

FP32 depthwise convolution is slow in GPU #18631

Open

vadimkantorov mentioned this issue May 2, 2024

[pytorch][cuda] Some speedup on depth wise convolution 2D forward #125362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depthwise Conv1d performance (a naive CUDA kernel is 10x faster) #75747

Depthwise Conv1d performance (a naive CUDA kernel is 10x faster) #75747

BlinkDL commented Apr 13, 2022 •

edited by pytorch-bot bot

ipostr08 commented Apr 18, 2022 •

edited

songyuc commented Sep 19, 2022 •

edited

🚀 The feature, motivation and pitch

Alternatives

Additional context

JulienSiems commented Oct 10, 2022

BlinkDL commented Oct 12, 2022

Depthwise Conv1d performance (a naive CUDA kernel is 10x faster) #75747

Depthwise Conv1d performance (a naive CUDA kernel is 10x faster) #75747

Comments

BlinkDL commented Apr 13, 2022 • edited by pytorch-bot bot

🚀 The feature, motivation and pitch

Alternatives

Additional context

ipostr08 commented Apr 18, 2022 • edited

songyuc commented Sep 19, 2022 • edited

🚀 The feature, motivation and pitch

Alternatives

Additional context

JulienSiems commented Oct 10, 2022

BlinkDL commented Oct 12, 2022

BlinkDL commented Apr 13, 2022 •

edited by pytorch-bot bot

ipostr08 commented Apr 18, 2022 •

edited

songyuc commented Sep 19, 2022 •

edited