Depthwise Conv1d performance (a naive CUDA kernel is 10x faster) #75747
Labels
module: cuda
Related to torch.cuda, and CUDA support in general
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃殌 The feature, motivation and pitch
Please improve the CUDA performance of Depthwise Conv1d :)
FYI, I write a naive CUDA kernel and it's already 10x faster than pytorch:
https://github.com/BlinkDL/RWKV-CUDA
RTX3090:
pytorch = fwd 14ms bwd 65ms
CUDA kernel v3 = fwd 0.8ms bwd 5.5ms
Alternatives
No response
Additional context
No response
cc @ngimel
The text was updated successfully, but these errors were encountered: